Skip to content

Commit 6c626e8

Browse files
author
Francesco Calcavecchia
committed
Complete re-writing
1 parent 14e4912 commit 6c626e8

File tree

5 files changed

+286
-64
lines changed

5 files changed

+286
-64
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
.DS_Store
22
.python-version
33
venv
4+
.venv
45
.vscode

.pre-commit-config.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ repos:
1212
- id: check-json
1313
- id: check-toml
1414
- id: check-yaml
15+
exclude: mkdocs.yml
1516
- repo: https://github.com/tcort/markdown-link-check
1617
rev: "v3.13.7"
1718
hooks:
@@ -22,3 +23,7 @@ repos:
2223
hooks:
2324
- id: mdformat
2425
args: ["--wrap", "120"]
26+
additional_dependencies:
27+
- mdformat-admon
28+
- mdformat-mkdocs
29+
- mdformat-tables

docs/examples.md

Lines changed: 0 additions & 5 deletions
This file was deleted.

docs/index.md

Lines changed: 262 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,95 +1,306 @@
1-
# `dac`: Data as Code
1+
# Data as Code
22

33
<div align="center">
44
<img src="img/motto.png" alt="drawing" width="450"/>
55
</div>
66

7-
Data-as-Code (DaC) `dac` is a tool that supports the distribution of data as (python) code.
7+
Data as Code (DaC) is a paradigm of distributing versioned data as versioned code.
88

9-
<div align="center">
10-
<img src="img/logo.jpg" alt="drawing" width="250"/>
11-
</div>
9+
!!! warning "Disclaimer"
10+
11+
```
12+
Currently the focus is on tabular and batch data, and Python code only.
13+
14+
Future extensions may be possible, depending on the community interest.
15+
```
16+
17+
## Consumer - Data Scientist
1218

13-
## How will the Data Scientists use a DaC package?
19+
??? info "Follow along"
1420

15-
Say that the Data Engineers prepared the `demo-data` as code for you. Then you will install the code in your environment
21+
````
22+
The code snippets below can be executed on your machine too!
23+
You just need to configure `pip` to point to the pypi registry where we stored the example DaC package. You can do this by running
1624
17-
```sh
18-
python -m pip install demo-data
25+
```shell
26+
❯ export PIP_EXTRA_INDEX_URL=https://gitlab.com/api/v4/projects/43746775/packages/pypi/simple
1927
```
2028
21-
and then you will be able to access the data simply with
29+
Of course don't forget to create an isolated environment before using `pip` to install the package:
2230
23-
```python
24-
from demo_data import load
31+
```shell
32+
❯ python -m venv venv && . venv/bin/activate
33+
```
34+
````
35+
36+
Say that the Data Engineers prepared the DaC `dac-example-energy` for you. Install it with
2537

26-
data = load()
38+
```shell
39+
❯ python -m pip install dac-example-energy
40+
...
41+
Successfully installed ... dac-example-energy-2.0.2 ...
2742
```
2843

29-
Data can be in any format. There is no constraint of any kind.
44+
Have you noticed the version `2.0.2`? That is the version of your data. This is very important, you can read
45+
[here](#make-releases) why.
3046

31-
Not only accessing data will be this easy but, depending on how data were prepared, you may also have access to useful
32-
metadata. How?
47+
Now, how do you grab the data?
3348

3449
```python
35-
from demo_data import Schema
50+
>>> from dac_example_energy import load
51+
>>> df = load()
52+
>>> df
53+
nrg_bal_name siec_name geo TIME_PERIOD OBS_VALUE
54+
0 Final consumption - energy use Solid fossil fuels AL 1990 6644.088
55+
1 Final consumption - energy use Solid fossil fuels AL 1991 3816.945
56+
2 Final consumption - energy use Solid fossil fuels AL 1992 1067.475
57+
3 Final consumption - energy use Solid fossil fuels AL 1993 525.540
58+
4 Final consumption - energy use Solid fossil fuels AL 1994 459.514
59+
... ... ... .. ... ...
60+
71155 Gross available energy Non-renewable waste XK 2015 0.000
61+
71156 Gross available energy Non-renewable waste XK 2016 0.000
62+
71157 Gross available energy Non-renewable waste XK 2017 0.000
63+
71158 Gross available energy Non-renewable waste XK 2018 0.000
64+
71159 Gross available energy Non-renewable waste XK 2019 0.000
65+
66+
[71160 rows x 5 columns]
3667
```
3768

38-
With the schema you could, for example
69+
One more very valuable tool is the `Schema` class. `Schema` is the implementation of a Data Contract, which is a
70+
contract between the data producer and the data consumer. It describes the data, its structure, and the constraints that
71+
the data must fulfill. At the very least it will have a `validate` method that verifies if a given data set fulfills the
72+
data contract. Loaded data is guaranteed to pass the validation.
3973

40-
- access the column names (e.g. `Schema.my_column`)
41-
- unit test your functions by getting a data example with `Schema.example()`
74+
Let us see what we can do with the `Schema` in the `dac-example-energy` package.
4275

43-
## How can a Data Engineer provide a DaC python package?
76+
```python
77+
>>> from dac_example_energy import Schema
78+
>>> import inspect
79+
>>> print(inspect.getsource(Schema))
80+
class Schema(pa.SchemaModel):
81+
source: Series[str] = pa.Field(
82+
isin=[
83+
"Solid fossil fuels",
84+
...
85+
"Non-renewable waste",
86+
],
87+
nullable=False,
88+
alias="siec_name",
89+
description="Source of energy",
90+
)
91+
value_meaning: Series[str] = pa.Field(
92+
isin=[
93+
"Gross available energy",
94+
...
95+
"Final consumption - transport sector - energy use",
96+
],
97+
nullable=False,
98+
alias="nrg_bal_name",
99+
description="Meaning of the value",
100+
)
101+
location: Series[str] = pa.Field(
102+
isin=[
103+
"AL",
104+
...
105+
"XK",
106+
],
107+
nullable=False,
108+
alias="geo",
109+
description="Location code, either two-digit ISO 3166-1 alpha-2 code or "
110+
"'EA19', 'EU27_2020', 'EU28' for the European Union",
111+
)
112+
year: Series[int] = pa.Field(
113+
ge=1990,
114+
le=3000,
115+
nullable=False,
116+
alias="TIME_PERIOD",
117+
description="Year of observation",
118+
)
119+
value_in_gwh: Series[float] = pa.Field(
120+
nullable=True,
121+
alias="OBS_VALUE",
122+
description="Value in GWh",
123+
)
124+
```
44125

45-
Install this library
126+
In this case [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) has been used to define the Schema. We can
127+
128+
- see which columns are available and even reference their names in our code without cumbersome hardcoded strings:
129+
```python
130+
>>> df[Schema.value_in_gwh]
131+
0 6644.088
132+
1 3816.945
133+
2 1067.475
134+
3 525.540
135+
4 459.514
136+
...
137+
71155 0.000
138+
71156 0.000
139+
71157 0.000
140+
71158 0.000
141+
71159 0.000
142+
Name: OBS_VALUE, Length: 71160, dtype: float64
143+
```
144+
- for each column, we exactly know what to expect. For example, what is the column type, are `None` values allowed, are
145+
there specific admitted categorical values, etc.;
146+
- we can read a useful description of the column;
147+
- if we install `pandera[strategies]` with `pip`, we can even generate synthetic data that is guaranteed to pass the
148+
schema validation. This is very useful for testing our code:
46149

47-
```sh
48-
python -m pip install dac
150+
```python
151+
>>> Schema.example(size=5)
152+
siec_name nrg_bal_name geo TIME_PERIOD OBS_VALUE
153+
0 Natural gas Gross available energy AL 1990 0.0
154+
1 Solid fossil fuels Gross available energy AL 1990 0.0
155+
2 Solid fossil fuels Gross available energy AL 1990 0.0
156+
3 Solid fossil fuels Gross available energy AL 1990 0.0
157+
4 Solid fossil fuels Gross available energy AL 1990 0.0
49158
```
50159

51-
and use the command `dac pack` (run `dac pack --help` for detailed instructions).
160+
!!! hint "Example data does not look right"
52161

53-
On a high level, the most important elements you must provide are:
162+
```
163+
The example data above does not look right. Does this mean that there is something wrong in the implementation of the `example` method? Not really! Read [here](#nice-to-have-schemaexample-method).
164+
```
165+
166+
## Producer - Data Engineer
167+
168+
Data as Code is a paradigm, it does not require any special tool or library. Anyone is free to implement it in his own
169+
way and, by the way, may do so in programming languages other than Python. The tools that we built and describe below
170+
(template and CLI tool `dac`) are just **convenience** tools, meaning that they may accelerate your development process,
171+
but are not strictly necessary.
172+
173+
!!! hint "Use [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) to define the Schema"
174+
175+
```
176+
If the dataframe engine (pandas/polars/dask/spark...) you are using is supported by [`pandera`](https://pandera.readthedocs.io/en/stable/index.html), consider using a [`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) to define the Schema.
177+
```
178+
179+
### Write the library
54180

55-
- python code to load the data
181+
#### 1. Start from scratch
56182

57-
- a `Schema` class that at very least contains a `validate` method, but possibly also
183+
!!! warning "This approach expects you to be familiar with python packaging"
58184

59-
- data field names (column names, if data is tabular)
60-
- an `example` method
185+
Build you own library, respecting the following constraints:
61186

62-
- python dependencies
187+
##### Public function `load`
63188

64-
!!! hint "Use `pandera` to define the Schema"
189+
A public function named `load` is available at the root of package. For example, if you build the package
190+
`dac-my-awesome-data`, it should be possible to do the following:
65191

192+
```python
193+
>>> from dac_my_awesome_data import load
194+
>>> df = load()
66195
```
67-
If the data type you are using is supported by [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) consider using a [`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) to define the Schema.
196+
197+
Notice that it must be possible to call `load()` without any argument, and the version of the returned data must
198+
correspond to the version of the package. This means that data will be different at every build.
199+
200+
##### Data fulfill the `Schema.validate()` method
201+
202+
A public class named `Schema` is available at the root of package and implements the Data Contract. `Schema` has a
203+
`validate` method which takes data as input and raises an error if the Contract is not fulfilled, and returns the data
204+
otherwise.
205+
206+
Notice that the Data Contract should be verified at building time, therefore ensuring that given a Data as Code package,
207+
the data coming from `load()` will always fulfill the Data Contract.
208+
209+
This means that, for example, it must be possible to do the following:
210+
211+
```python
212+
>>> from dac_my_awesome_data import load, Schema
213+
>>> Schema.validate(load())
214+
```
215+
216+
and will never raise an error.
217+
218+
##### [Nice to have] `Schema` contains column names
219+
220+
It is possible to reference the column names from the `Schema` class. For example:
221+
222+
```python
223+
>>> from dac_my_awesome_data import load, Schema
224+
>>> df = load()
225+
>>> df[Schema.column_1]
226+
```
227+
228+
##### [Nice to have] `Schema.example` method
229+
230+
It is possible to generate synthetic data that fulfill the Data Contract. For example:
231+
232+
```python
233+
>>> from dac_my_awesome_data import Schema
234+
>>> Schema.example(size=5)
235+
column_1 column_2
236+
0 1 2
237+
1 3 4
238+
2 5 6
239+
3 7 8
240+
4 9 10
68241
```
69242

70-
## What are the advantages of distributing data in this way?
243+
Ideally synthetic data should be such to really stretch the limits of the Data Contract. By this we mean that the
244+
generated data should be as far aways as possible to the real data, but still fulfill the Data Contract. This is useful
245+
to make the tests built using this feature as robust as possible. Also, this will push the developers to improve the
246+
Data Contract, and therefore will make it as reliable as possible. For example, in the
247+
[Consumer](#consumer-data-scientist) section, you may have noticed that the rows have nearly alwyas the same values.
248+
This is unlikely to be the case in the real data, but as it is not envoded in the Schema it may still happen! It would
249+
be probably a good idea to add meaningful constraint checks to the `Schema` class.
250+
251+
#### 2. Use the template
252+
253+
We provide a [Copier](https://copier.readthedocs.io/en/stable/) template to get started quickly.
254+
255+
[Take me to the template :material-cursor-default-click:](https://gitlab.com/data-as-code/template/src){ .md-button }
256+
257+
#### 3. Use the [`dac`](https://github.com/data-as-code/dac) CLI tool
258+
259+
Out CLI `dac` tool is a convenience tool capable of building a python package that respects the Data as Code paradigm.
260+
261+
[Take me to the `dac` CLI tool :material-cursor-default-click:](https://github.com/data-as-code/dac){ .md-button }
262+
263+
#### Compare template and `dac pack`
264+
265+
Which one should you use?
71266

72-
- The code needed to load the data, the data source, and locations are abstracted away from the user. This mean that the
73-
data engineer can start from local files, transition to SQL database, cloud file storage, or kafka topic, without
74-
having the user to notice it or need to adapt its code.
267+
| | Template | `dac pack` |
268+
| :--------------------------: | :-----------------------: | :-------------------: |
269+
| Simplicity | :material-thumbs-up-down: | :material-thumb-up: |
270+
| Possibility of customization | :material-thumb-up: | :material-thumb-down: |
75271

76-
- *If you provide data field names in `Schema`* (e.g. `Schema.column_1`), the user code will not contain hard-coded
77-
column names, and changes in data source field names won't impact the user.
272+
### Make Releases
78273

79-
- *If you provide the `Schema.example` method*, users will be able to build robust code by writing unit testing for
80-
their functions effortlessly.
274+
Choosing the right release version plays a crucial role in the Data as Code paradigm.
81275

82-
- Semantic versioning can be used to communicate significant changes:
276+
Semantic versioning is used to communicate significant changes:
83277

84-
- a patch update corresponds to a fix in the data: its intended content is unchanged
85-
- a minor update corresponds to a change in the data that does not break the schema
86-
- a major update corresponds to a change in the schema, or any other breaking change
278+
| | Reason |
279+
| :-------: | :----------------------------------------------------------------------------------- |
280+
| __Patch__ | Fix in the data. Intended content unchanged |
281+
| __Minor__ | Change in the data that does not break the Data Contract - typically, new batch data |
282+
| __Major__ | Change in the Data Contract, or any other breaking change |
87283

88-
In this way data pipelines can subscribe to the appropriate updates. Furthermore, it will be easy to keep releasing
89-
data updates maintaining retro-compatibility (one can keep deploying `1.X.Y` updates even after version `2` has been
90-
rolled-out).
284+
__Typically Patch and Major releases involve a manual process, while Minor releases can be automated.__ Our
285+
[`dac` CLI tool](https://github.com/data-as-code/dac) can help you with the automated releases. Explore the command
286+
`dac next-version`.
91287

92-
- Description of the data and columns can be included in the schema, and will therefore reach the user together with the
93-
data.
288+
## Why distributing Data as Code?
94289

95-
- Users will always know where to look for data: the PyPi index.
290+
- __Data Scientists have a very convenient way to ensure that their code is not run on incompatible data.__ They can
291+
simply include the Data as Code into their dependencies (e.g. `dac-example-energy~=1.0`), and then installation of
292+
their code with incompatible data will fail (e.g. `dac-example-energy` version `2.0.0`).
293+
- __Data pipelines can receive data updates without breaking.__ Data can subscribe to a major version of the data, and
294+
can receive updates without breaking changes.
295+
- __It provides a way to maintain multiple release streams__ (e.g. `1.X.Y` and `2.X.Y`). This is useful when a new
296+
version of the data is released, but some users are still using the old version. In this case, the data engineer can
297+
keep releasing updates for both versions, until all users have migrated to the new version.
298+
- __The code needed to load the data, the data source, and locations are abstracted away from the consumer.__ This mean
299+
that the Data Producer can start from local files, transition to SQL database, cloud file storage, or kafka topic,
300+
without having the consumer to notice it or need to adapt its code.
301+
- _If you provide column names in `Schema`_ (e.g. `Schema.column_1`), __the consumer's code will not contain hard-coded
302+
column names__, and changes in data source field names won't impact the user.
303+
- _If you provide the `Schema.example` method_, __the consumer will be able to build robust code by writing unit testing
304+
for their functions__. This will result in a more robust data pipeline.
305+
- _If the description of the data and columns is included in the `Schema`_, __data will be self-documented, from a
306+
consumer perspective__.

0 commit comments

Comments
 (0)