You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/index.md
+36-21Lines changed: 36 additions & 21 deletions
Original file line number
Diff line number
Diff line change
@@ -13,10 +13,13 @@ Data-as-Code (DaC) `dac` is a tool that supports the distribution of data as (py
13
13
## How will the Data Scientists use a DaC package?
14
14
15
15
Say that the Data Engineers prepared the `demo-data` as code for you. Then you will install the code in your environment
16
+
16
17
```sh
17
18
python -m pip install demo-data
18
19
```
20
+
19
21
and then you will be able to access the data simply with
22
+
20
23
```python
21
24
from demo_data import load
22
25
@@ -25,56 +28,68 @@ data = load()
25
28
26
29
Data can be in any format. There is no constraint of any kind.
27
30
28
-
Not only accessing data will be this easy but, depending on how data were prepared, you may also have access to useful metadata. How?
31
+
Not only accessing data will be this easy but, depending on how data were prepared, you may also have access to useful
32
+
metadata. How?
33
+
29
34
```python
30
35
from demo_data import Schema
31
36
```
32
37
33
38
With the schema you could, for example
34
39
35
-
* access the column names (e.g. `Schema.my_column`)
36
-
* unit test your functions by getting a data example with `Schema.example()`
40
+
- access the column names (e.g. `Schema.my_column`)
41
+
- unit test your functions by getting a data example with `Schema.example()`
37
42
38
43
## How can a Data Engineer provide a DaC python package?
39
44
40
45
Install this library
46
+
41
47
```sh
42
48
python -m pip install dac
43
49
```
50
+
44
51
and use the command `dac pack` (run `dac pack --help` for detailed instructions).
45
52
46
53
On a high level, the most important elements you must provide are:
47
54
48
-
* python code to load the data
49
-
* a `Schema` class that at very least contains a `validate` method, but possibly also
55
+
- python code to load the data
50
56
51
-
- data field names (column names, if data is tabular)
52
-
- an `example` method
57
+
- a `Schema` class that at very least contains a `validate` method, but possibly also
53
58
54
-
* python dependencies
59
+
- data field names (column names, if data is tabular)
60
+
- an `example` method
55
61
56
-
!!! hint "Use `pandera` to define the Schema"
62
+
- python dependencies
57
63
58
-
If the data type you are using is supported by [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) consider using a [`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) to define the Schema.
64
+
!!! hint "Use `pandera`to define the Schema"
59
65
66
+
```
67
+
If the data type you are using is supported by [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) consider using a [`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) to define the Schema.
68
+
```
60
69
61
70
## What are the advantages of distributing data in this way?
62
71
63
-
* The code needed to load the data, the data source, and locations are abstracted away from the user.
64
-
This mean that the data engineer can start from local files, transition to SQL database, cloud file storage, or kafka topic, without having the user to notice it or need to adapt its code.
72
+
- The code needed to load the data, the data source, and locations are abstracted away from the user. This mean that the
73
+
data engineer can start from local files, transition to SQL database, cloud file storage, or kafka topic, without
74
+
having the user to notice it or need to adapt its code.
65
75
66
-
**If you provide data field names in `Schema`* (e.g. `Schema.column_1`), the user code will not contain hard-coded column names, and changes in data source field names won't impact the user.
76
+
-*If you provide data field names in `Schema`* (e.g. `Schema.column_1`), the user code will not contain hard-coded
77
+
column names, and changes in data source field names won't impact the user.
67
78
68
-
**If you provide the `Schema.example` method*, users will be able to build robust code by writing unit testing for their functions effortlessly.
79
+
-*If you provide the `Schema.example` method*, users will be able to build robust code by writing unit testing for
80
+
their functions effortlessly.
69
81
70
-
* Semantic versioning can be used to communicate significant changes:
82
+
- Semantic versioning can be used to communicate significant changes:
71
83
72
-
* a patch update corresponds to a fix in the data: its intended content is unchanged
73
-
* a minor update corresponds to a change in the data that does not break the schema
74
-
* a major update corresponds to a change in the schema, or any other breaking change
84
+
- a patch update corresponds to a fix in the data: its intended content is unchanged
85
+
- a minor update corresponds to a change in the data that does not break the schema
86
+
- a major update corresponds to a change in the schema, or any other breaking change
75
87
76
-
In this way data pipelines can subscribe to the appropriate updates. Furthermore, it will be easy to keep releasing data updates maintaining retro-compatibility (one can keep deploying `1.X.Y` updates even after version `2` has been rolled-out).
88
+
In this way data pipelines can subscribe to the appropriate updates. Furthermore, it will be easy to keep releasing
89
+
data updates maintaining retro-compatibility (one can keep deploying `1.X.Y` updates even after version `2` has been
90
+
rolled-out).
77
91
78
-
* Description of the data and columns can be included in the schema, and will therefore reach the user together with the data.
92
+
- Description of the data and columns can be included in the schema, and will therefore reach the user together with the
93
+
data.
79
94
80
-
* Users will always know where to look for data: the PyPi index.
95
+
- Users will always know where to look for data: the PyPi index.
0 commit comments