|
1 |
| -# `dac`: Data as Code |
| 1 | +# Data as Code |
2 | 2 |
|
3 | 3 | <div align="center">
|
4 | 4 | <img src="img/motto.png" alt="drawing" width="450"/>
|
5 | 5 | </div>
|
6 | 6 |
|
7 |
| -Data-as-Code (DaC) `dac` is a tool that supports the distribution of data as (python) code. |
| 7 | +Data as Code (DaC) is a paradigm of distributing versioned data as versioned code. |
8 | 8 |
|
9 |
| -<div align="center"> |
10 |
| - <img src="img/logo.jpg" alt="drawing" width="250"/> |
11 |
| -</div> |
| 9 | +!!! warning "Disclaimer" |
| 10 | + |
| 11 | +``` |
| 12 | +Currently the focus is on tabular and batch data, and Python code only. |
| 13 | +
|
| 14 | +Future extensions may be possible, depending on the community interest. |
| 15 | +``` |
| 16 | + |
| 17 | +## Consumer - Data Scientist |
12 | 18 |
|
13 |
| -## How will the Data Scientists use a DaC package? |
| 19 | +??? info "Follow along" |
14 | 20 |
|
15 |
| -Say that the Data Engineers prepared the `demo-data` as code for you. Then you will install the code in your environment |
| 21 | +```` |
| 22 | +The code snippets below can be executed on your machine too! |
| 23 | +You just need to configure `pip` to point to the pypi registry where we stored the example DaC package. You can do this by running |
16 | 24 |
|
17 |
| -```sh |
18 |
| -python -m pip install demo-data |
| 25 | +```shell |
| 26 | +❯ export PIP_EXTRA_INDEX_URL=https://gitlab.com/api/v4/projects/43746775/packages/pypi/simple |
19 | 27 | ```
|
20 | 28 |
|
21 |
| -and then you will be able to access the data simply with |
| 29 | +Of course don't forget to create an isolated environment before using `pip` to install the package: |
22 | 30 |
|
23 |
| -```python |
24 |
| -from demo_data import load |
| 31 | +```shell |
| 32 | +❯ python -m venv venv && . venv/bin/activate |
| 33 | +``` |
| 34 | +```` |
| 35 | + |
| 36 | +Say that the Data Engineers prepared the DaC `dac-example-energy` for you. Install it with |
25 | 37 |
|
26 |
| -data = load() |
| 38 | +```shell |
| 39 | +❯ python -m pip install dac-example-energy |
| 40 | +... |
| 41 | +Successfully installed ... dac-example-energy-2.0.2 ... |
27 | 42 | ```
|
28 | 43 |
|
29 |
| -Data can be in any format. There is no constraint of any kind. |
| 44 | +Have you noticed the version `2.0.2`? That is the version of your data. This is very important, you can read |
| 45 | +[here](#make-releases) why. |
30 | 46 |
|
31 |
| -Not only accessing data will be this easy but, depending on how data were prepared, you may also have access to useful |
32 |
| -metadata. How? |
| 47 | +Now, how do you grab the data? |
33 | 48 |
|
34 | 49 | ```python
|
35 |
| -from demo_data import Schema |
| 50 | +>>> from dac_example_energy import load |
| 51 | +>>> df = load() |
| 52 | +>>> df |
| 53 | + nrg_bal_name siec_name geo TIME_PERIOD OBS_VALUE |
| 54 | +0 Final consumption - energy use Solid fossil fuels AL 1990 6644.088 |
| 55 | +1 Final consumption - energy use Solid fossil fuels AL 1991 3816.945 |
| 56 | +2 Final consumption - energy use Solid fossil fuels AL 1992 1067.475 |
| 57 | +3 Final consumption - energy use Solid fossil fuels AL 1993 525.540 |
| 58 | +4 Final consumption - energy use Solid fossil fuels AL 1994 459.514 |
| 59 | +... ... ... .. ... ... |
| 60 | +71155 Gross available energy Non-renewable waste XK 2015 0.000 |
| 61 | +71156 Gross available energy Non-renewable waste XK 2016 0.000 |
| 62 | +71157 Gross available energy Non-renewable waste XK 2017 0.000 |
| 63 | +71158 Gross available energy Non-renewable waste XK 2018 0.000 |
| 64 | +71159 Gross available energy Non-renewable waste XK 2019 0.000 |
| 65 | + |
| 66 | +[71160 rows x 5 columns] |
36 | 67 | ```
|
37 | 68 |
|
38 |
| -With the schema you could, for example |
| 69 | +One more very valuable tool is the `Schema` class. `Schema` is the implementation of a Data Contract, which is a |
| 70 | +contract between the data producer and the data consumer. It describes the data, its structure, and the constraints that |
| 71 | +the data must fulfill. At the very least it will have a `validate` method that verifies if a given data set fulfills the |
| 72 | +data contract. Loaded data is guaranteed to pass the validation. |
39 | 73 |
|
40 |
| -- access the column names (e.g. `Schema.my_column`) |
41 |
| -- unit test your functions by getting a data example with `Schema.example()` |
| 74 | +Let us see what we can do with the `Schema` in the `dac-example-energy` package. |
42 | 75 |
|
43 |
| -## How can a Data Engineer provide a DaC python package? |
| 76 | +```python |
| 77 | +>>> from dac_example_energy import Schema |
| 78 | +>>> import inspect |
| 79 | +>>> print(inspect.getsource(Schema)) |
| 80 | +class Schema(pa.SchemaModel): |
| 81 | + source: Series[str] = pa.Field( |
| 82 | + isin=[ |
| 83 | + "Solid fossil fuels", |
| 84 | + ... |
| 85 | + "Non-renewable waste", |
| 86 | + ], |
| 87 | + nullable=False, |
| 88 | + alias="siec_name", |
| 89 | + description="Source of energy", |
| 90 | + ) |
| 91 | + value_meaning: Series[str] = pa.Field( |
| 92 | + isin=[ |
| 93 | + "Gross available energy", |
| 94 | + ... |
| 95 | + "Final consumption - transport sector - energy use", |
| 96 | + ], |
| 97 | + nullable=False, |
| 98 | + alias="nrg_bal_name", |
| 99 | + description="Meaning of the value", |
| 100 | + ) |
| 101 | + location: Series[str] = pa.Field( |
| 102 | + isin=[ |
| 103 | + "AL", |
| 104 | + ... |
| 105 | + "XK", |
| 106 | + ], |
| 107 | + nullable=False, |
| 108 | + alias="geo", |
| 109 | + description="Location code, either two-digit ISO 3166-1 alpha-2 code or " |
| 110 | + "'EA19', 'EU27_2020', 'EU28' for the European Union", |
| 111 | + ) |
| 112 | + year: Series[int] = pa.Field( |
| 113 | + ge=1990, |
| 114 | + le=3000, |
| 115 | + nullable=False, |
| 116 | + alias="TIME_PERIOD", |
| 117 | + description="Year of observation", |
| 118 | + ) |
| 119 | + value_in_gwh: Series[float] = pa.Field( |
| 120 | + nullable=True, |
| 121 | + alias="OBS_VALUE", |
| 122 | + description="Value in GWh", |
| 123 | + ) |
| 124 | +``` |
44 | 125 |
|
45 |
| -Install this library |
| 126 | +In this case [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) has been used to define the Schema. We can |
| 127 | + |
| 128 | +- see which columns are available and even reference their names in our code without cumbersome hardcoded strings: |
| 129 | + ```python |
| 130 | + >>> df[Schema.value_in_gwh] |
| 131 | + 0 6644.088 |
| 132 | + 1 3816.945 |
| 133 | + 2 1067.475 |
| 134 | + 3 525.540 |
| 135 | + 4 459.514 |
| 136 | + ... |
| 137 | + 71155 0.000 |
| 138 | + 71156 0.000 |
| 139 | + 71157 0.000 |
| 140 | + 71158 0.000 |
| 141 | + 71159 0.000 |
| 142 | + Name: OBS_VALUE, Length: 71160, dtype: float64 |
| 143 | + ``` |
| 144 | +- for each column, we exactly know what to expect. For example, what is the column type, are `None` values allowed, are |
| 145 | + there specific admitted categorical values, etc.; |
| 146 | +- we can read a useful description of the column; |
| 147 | +- if we install `pandera[strategies]` with `pip`, we can even generate synthetic data that is guaranteed to pass the |
| 148 | + schema validation. This is very useful for testing our code: |
46 | 149 |
|
47 |
| -```sh |
48 |
| -python -m pip install dac |
| 150 | +```python |
| 151 | +>>> Schema.example(size=5) |
| 152 | + siec_name nrg_bal_name geo TIME_PERIOD OBS_VALUE |
| 153 | +0 Natural gas Gross available energy AL 1990 0.0 |
| 154 | +1 Solid fossil fuels Gross available energy AL 1990 0.0 |
| 155 | +2 Solid fossil fuels Gross available energy AL 1990 0.0 |
| 156 | +3 Solid fossil fuels Gross available energy AL 1990 0.0 |
| 157 | +4 Solid fossil fuels Gross available energy AL 1990 0.0 |
49 | 158 | ```
|
50 | 159 |
|
51 |
| -and use the command `dac pack` (run `dac pack --help` for detailed instructions). |
| 160 | +!!! hint "Example data does not look right" |
52 | 161 |
|
53 |
| -On a high level, the most important elements you must provide are: |
| 162 | +``` |
| 163 | +The example data above does not look right. Does this mean that there is something wrong in the implementation of the `example` method? Not really! Read [here](#nice-to-have-schemaexample-method). |
| 164 | +``` |
| 165 | + |
| 166 | +## Producer - Data Engineer |
| 167 | + |
| 168 | +Data as Code is a paradigm, it does not require any special tool or library. Anyone is free to implement it in his own |
| 169 | +way and, by the way, may do so in programming languages other than Python. The tools that we built and describe below |
| 170 | +(template and CLI tool `dac`) are just **convenience** tools, meaning that they may accelerate your development process, |
| 171 | +but are not strictly necessary. |
| 172 | + |
| 173 | +!!! hint "Use [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) to define the Schema" |
| 174 | + |
| 175 | +``` |
| 176 | +If the dataframe engine (pandas/polars/dask/spark...) you are using is supported by [`pandera`](https://pandera.readthedocs.io/en/stable/index.html), consider using a [`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) to define the Schema. |
| 177 | +``` |
| 178 | + |
| 179 | +### Write the library |
54 | 180 |
|
55 |
| -- python code to load the data |
| 181 | +#### 1. Start from scratch |
56 | 182 |
|
57 |
| -- a `Schema` class that at very least contains a `validate` method, but possibly also |
| 183 | +!!! warning "This approach expects you to be familiar with python packaging" |
58 | 184 |
|
59 |
| - - data field names (column names, if data is tabular) |
60 |
| - - an `example` method |
| 185 | +Build you own library, respecting the following constraints: |
61 | 186 |
|
62 |
| -- python dependencies |
| 187 | +##### Public function `load` |
63 | 188 |
|
64 |
| -!!! hint "Use `pandera` to define the Schema" |
| 189 | +A public function named `load` is available at the root of package. For example, if you build the package |
| 190 | +`dac-my-awesome-data`, it should be possible to do the following: |
65 | 191 |
|
| 192 | +```python |
| 193 | +>>> from dac_my_awesome_data import load |
| 194 | +>>> df = load() |
66 | 195 | ```
|
67 |
| -If the data type you are using is supported by [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) consider using a [`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) to define the Schema. |
| 196 | + |
| 197 | +Notice that it must be possible to call `load()` without any argument, and the version of the returned data must |
| 198 | +correspond to the version of the package. This means that data will be different at every build. |
| 199 | + |
| 200 | +##### Data fulfill the `Schema.validate()` method |
| 201 | + |
| 202 | +A public class named `Schema` is available at the root of package and implements the Data Contract. `Schema` has a |
| 203 | +`validate` method which takes data as input and raises an error if the Contract is not fulfilled, and returns the data |
| 204 | +otherwise. |
| 205 | + |
| 206 | +Notice that the Data Contract should be verified at building time, therefore ensuring that given a Data as Code package, |
| 207 | +the data coming from `load()` will always fulfill the Data Contract. |
| 208 | + |
| 209 | +This means that, for example, it must be possible to do the following: |
| 210 | + |
| 211 | +```python |
| 212 | +>>> from dac_my_awesome_data import load, Schema |
| 213 | +>>> Schema.validate(load()) |
| 214 | +``` |
| 215 | + |
| 216 | +and will never raise an error. |
| 217 | + |
| 218 | +##### [Nice to have] `Schema` contains column names |
| 219 | + |
| 220 | +It is possible to reference the column names from the `Schema` class. For example: |
| 221 | + |
| 222 | +```python |
| 223 | +>>> from dac_my_awesome_data import load, Schema |
| 224 | +>>> df = load() |
| 225 | +>>> df[Schema.column_1] |
| 226 | +``` |
| 227 | + |
| 228 | +##### [Nice to have] `Schema.example` method |
| 229 | + |
| 230 | +It is possible to generate synthetic data that fulfill the Data Contract. For example: |
| 231 | + |
| 232 | +```python |
| 233 | +>>> from dac_my_awesome_data import Schema |
| 234 | +>>> Schema.example(size=5) |
| 235 | + column_1 column_2 |
| 236 | +0 1 2 |
| 237 | +1 3 4 |
| 238 | +2 5 6 |
| 239 | +3 7 8 |
| 240 | +4 9 10 |
68 | 241 | ```
|
69 | 242 |
|
70 |
| -## What are the advantages of distributing data in this way? |
| 243 | +Ideally synthetic data should be such to really stretch the limits of the Data Contract. By this we mean that the |
| 244 | +generated data should be as far aways as possible to the real data, but still fulfill the Data Contract. This is useful |
| 245 | +to make the tests built using this feature as robust as possible. Also, this will push the developers to improve the |
| 246 | +Data Contract, and therefore will make it as reliable as possible. For example, in the |
| 247 | +[Consumer](#consumer-data-scientist) section, you may have noticed that the rows have nearly alwyas the same values. |
| 248 | +This is unlikely to be the case in the real data, but as it is not envoded in the Schema it may still happen! It would |
| 249 | +be probably a good idea to add meaningful constraint checks to the `Schema` class. |
| 250 | + |
| 251 | +#### 2. Use the template |
| 252 | + |
| 253 | +We provide a [Copier](https://copier.readthedocs.io/en/stable/) template to get started quickly. |
| 254 | + |
| 255 | +[Take me to the template :material-cursor-default-click:](https://gitlab.com/data-as-code/template/src){ .md-button } |
| 256 | + |
| 257 | +#### 3. Use the [`dac`](https://github.com/data-as-code/dac) CLI tool |
| 258 | + |
| 259 | +Out CLI `dac` tool is a convenience tool capable of building a python package that respects the Data as Code paradigm. |
| 260 | + |
| 261 | +[Take me to the `dac` CLI tool :material-cursor-default-click:](https://github.com/data-as-code/dac){ .md-button } |
| 262 | + |
| 263 | +#### Compare template and `dac pack` |
| 264 | + |
| 265 | +Which one should you use? |
71 | 266 |
|
72 |
| -- The code needed to load the data, the data source, and locations are abstracted away from the user. This mean that the |
73 |
| - data engineer can start from local files, transition to SQL database, cloud file storage, or kafka topic, without |
74 |
| - having the user to notice it or need to adapt its code. |
| 267 | +| | Template | `dac pack` | |
| 268 | +| :--------------------------: | :-----------------------: | :-------------------: | |
| 269 | +| Simplicity | :material-thumbs-up-down: | :material-thumb-up: | |
| 270 | +| Possibility of customization | :material-thumb-up: | :material-thumb-down: | |
75 | 271 |
|
76 |
| -- *If you provide data field names in `Schema`* (e.g. `Schema.column_1`), the user code will not contain hard-coded |
77 |
| - column names, and changes in data source field names won't impact the user. |
| 272 | +### Make Releases |
78 | 273 |
|
79 |
| -- *If you provide the `Schema.example` method*, users will be able to build robust code by writing unit testing for |
80 |
| - their functions effortlessly. |
| 274 | +Choosing the right release version plays a crucial role in the Data as Code paradigm. |
81 | 275 |
|
82 |
| -- Semantic versioning can be used to communicate significant changes: |
| 276 | +Semantic versioning is used to communicate significant changes: |
83 | 277 |
|
84 |
| - - a patch update corresponds to a fix in the data: its intended content is unchanged |
85 |
| - - a minor update corresponds to a change in the data that does not break the schema |
86 |
| - - a major update corresponds to a change in the schema, or any other breaking change |
| 278 | +| | Reason | |
| 279 | +| :-------: | :----------------------------------------------------------------------------------- | |
| 280 | +| __Patch__ | Fix in the data. Intended content unchanged | |
| 281 | +| __Minor__ | Change in the data that does not break the Data Contract - typically, new batch data | |
| 282 | +| __Major__ | Change in the Data Contract, or any other breaking change | |
87 | 283 |
|
88 |
| - In this way data pipelines can subscribe to the appropriate updates. Furthermore, it will be easy to keep releasing |
89 |
| - data updates maintaining retro-compatibility (one can keep deploying `1.X.Y` updates even after version `2` has been |
90 |
| - rolled-out). |
| 284 | +__Typically Patch and Major releases involve a manual process, while Minor releases can be automated.__ Our |
| 285 | +[`dac` CLI tool](https://github.com/data-as-code/dac) can help you with the automated releases. Explore the command |
| 286 | +`dac next-version`. |
91 | 287 |
|
92 |
| -- Description of the data and columns can be included in the schema, and will therefore reach the user together with the |
93 |
| - data. |
| 288 | +## Why distributing Data as Code? |
94 | 289 |
|
95 |
| -- Users will always know where to look for data: the PyPi index. |
| 290 | +- __Data Scientists have a very convenient way to ensure that their code is not run on incompatible data.__ They can |
| 291 | + simply include the Data as Code into their dependencies (e.g. `dac-example-energy~=1.0`), and then installation of |
| 292 | + their code with incompatible data will fail (e.g. `dac-example-energy` version `2.0.0`). |
| 293 | +- __Data pipelines can receive data updates without breaking.__ Data can subscribe to a major version of the data, and |
| 294 | + can receive updates without breaking changes. |
| 295 | +- __It provides a way to maintain multiple release streams__ (e.g. `1.X.Y` and `2.X.Y`). This is useful when a new |
| 296 | + version of the data is released, but some users are still using the old version. In this case, the data engineer can |
| 297 | + keep releasing updates for both versions, until all users have migrated to the new version. |
| 298 | +- __The code needed to load the data, the data source, and locations are abstracted away from the consumer.__ This mean |
| 299 | + that the Data Producer can start from local files, transition to SQL database, cloud file storage, or kafka topic, |
| 300 | + without having the consumer to notice it or need to adapt its code. |
| 301 | +- _If you provide column names in `Schema`_ (e.g. `Schema.column_1`), __the consumer's code will not contain hard-coded |
| 302 | + column names__, and changes in data source field names won't impact the user. |
| 303 | +- _If you provide the `Schema.example` method_, __the consumer will be able to build robust code by writing unit testing |
| 304 | + for their functions__. This will result in a more robust data pipeline. |
| 305 | +- _If the description of the data and columns is included in the `Schema`_, __data will be self-documented, from a |
| 306 | + consumer perspective__. |
0 commit comments