4
4
<img src =" img/motto.png " alt =" drawing " width =" 450 " />
5
5
</div >
6
6
7
- Data as Code (DaC) is a paradigm of distributing versioned data as versioned code.
7
+ Data as Code (DaC) is a paradigm of distributing versioned data as code. Think of it as treating your data with the same
8
+ care and precision as your software.
8
9
9
10
!!! warning "Disclaimer"
10
11
11
- Currently the focus is on tabular and batch data, and Python code only .
12
+ At the moment, we're focusing on tabular and batch data, with Python as the primary language .
12
13
13
- Future extensions may be possible, depending on the community interest.
14
+ But who knows? With enough community interest, we might expand to other areas in the future!
14
15
15
16
## Consumer - Data Scientist
16
17
17
18
??? info "Follow along"
18
19
19
- The code snippets below can be executed on your machine too! You just need to configure `pip` to point to the pypi
20
- registry where we stored the example DaC package. You can do this by running
20
+ Want to try the examples below on your own machine? It's easy! Just configure `pip` to point to the PyPI registry where
21
+ the example DaC package is stored. Run this command:
21
22
22
23
```shell
23
24
❯ export PIP_EXTRA_INDEX_URL=https://gitlab.com/api/v4/projects/43746775/packages/pypi/simple
24
25
```
25
26
26
- Of course don't forget to create an isolated environment before using `pip` to install the package:
27
+ And don't forget to create an isolated environment before installing the package:
27
28
28
29
```shell
29
30
❯ python -m venv venv && . venv/bin/activate
30
31
```
31
32
32
- Say that the Data Engineers prepared the DaC ` dac-example-energy ` for you. Install it with
33
+ Imagine the Data Engineers have prepared a DaC package called ` dac-example-energy ` just for you. Install it like this:
33
34
34
35
``` shell
35
36
❯ python -m pip install dac-example-energy
36
37
...
37
38
Successfully installed ... dac-example-energy-2.0.2 ...
38
39
```
39
40
40
- Have you noticed the version ` 2.0.2 ` ? That is the version of your data. This is very important, you can read
41
- [ here ] ( #make-releases ) why .
41
+ Notice the version ` 2.0.2 ` ? That’s the version of your data! Curious to know more about the importance of the version?
42
+ Check out [ this section ] ( #make-releases ) .
42
43
43
- Now, how do you grab the data?
44
+ ### Grab the data with the snap of a finger: ` load `
45
+
46
+ Now, let’s grab the data:
44
47
45
48
``` python
46
49
>> > from dac_example_energy import load
@@ -62,12 +65,13 @@ Now, how do you grab the data?
62
65
[71160 rows x 5 columns]
63
66
```
64
67
65
- One more very valuable tool is the ` Schema ` class. ` Schema ` is the implementation of a Data Contract, which is a
66
- contract between the data producer and the data consumer. It describes the data, its structure, and the constraints that
67
- the data must fulfill. At the very least it will have a ` validate ` method that verifies if a given data set fulfills the
68
- data contract. Loaded data is guaranteed to pass the validation.
68
+ ### Meet the ` Schema ` Class: Your Data’s Best Friend
69
+
70
+ The ` Schema ` class is the backbone of the Data Contract. It’s a promise between the data producer and the data consumer.
71
+ It defines the structure, constraints, and expectations for the data. And here’s the best part: any data you load is
72
+ guaranteed to pass validation.
69
73
70
- Let us see what we can do with the ` Schema ` in the ` dac-example-energy ` package.
74
+ Let’s explore what the ` Schema ` in the ` dac-example-energy ` package can do:
71
75
72
76
``` python
73
77
>> > from dac_example_energy import Schema
@@ -119,9 +123,9 @@ class Schema(pa.SchemaModel):
119
123
)
120
124
```
121
125
122
- In this case [ ` pandera ` ] ( https://pandera.readthedocs.io/en/stable/index.html ) has been used to define the Schema. We can
126
+ This ` Schema ` is built using [ ` pandera ` ] ( https://pandera.readthedocs.io/en/stable/index.html ) . Here’s why it’s awesome:
123
127
124
- - see which columns are available and even reference their names in our code without cumbersome hardcoded strings :
128
+ - ** Column names are accessible ** : No more hardcoding strings! Reference column names directly in your code :
125
129
``` python
126
130
>> > df[Schema.value_in_gwh]
127
131
0 6644.088
@@ -137,62 +141,55 @@ In this case [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) ha
137
141
71159 0.000
138
142
Name: OBS_VALUE , Length: 71160 , dtype: float64
139
143
```
140
- - for each column, we exactly know what to expect. For example, what is the column type , are `None ` values allowed, are
141
- there specific admitted categorical values, etc.;
142
- - we can read a useful description of the column;
143
- - if we install `pandera[strategies]` with `pip` , we can even generate synthetic data that is guaranteed to pass the
144
- schema validation. This is very useful for testing our code:
145
-
146
- ```python
147
- >> > Schema.example(size = 5 )
148
- siec_name nrg_bal_name geo TIME_PERIOD OBS_VALUE
149
- 0 Natural gas Gross available energy AL 1990 0.0
150
- 1 Solid fossil fuels Gross available energy AL 1990 0.0
151
- 2 Solid fossil fuels Gross available energy AL 1990 0.0
152
- 3 Solid fossil fuels Gross available energy AL 1990 0.0
153
- 4 Solid fossil fuels Gross available energy AL 1990 0.0
154
- ```
144
+ - ** Clear expectations** : Know exactly what each column should contain—types, constraints, and more.
145
+ - ** Self- documenting** : Each column comes with a description.
146
+ - ** Synthetic data generation** : Install `pandera[strategies]` and generate test data that passes validation:
147
+ ```python
148
+ >> > Schema.example(size = 5 )
149
+ siec_name nrg_bal_name geo TIME_PERIOD OBS_VALUE
150
+ 0 Natural gas Gross available energy AL 1990 0.0
151
+ 1 Solid fossil fuels Gross available energy AL 1990 0.0
152
+ 2 Solid fossil fuels Gross available energy AL 1990 0.0
153
+ 3 Solid fossil fuels Gross available energy AL 1990 0.0
154
+ 4 Solid fossil fuels Gross available energy AL 1990 0.0
155
+ ```
155
156
156
- !!! hint " Example data does not look right "
157
+ !!! hint " Example data looks odd? "
157
158
158
- The example data above does not look right . Does this mean that there is something wrong in the implementation of the
159
- `example` method ? Not really! Read [here ](# nice-to-have-schemaexample-method).
159
+ The synthetic data above might not look realistic . Does this mean the `example` method is broken ? Not at all ! Check out
160
+ [this section ](# nice-to-have-schemaexample-method) to learn more .
160
161
161
162
# # Producer - Data Engineer
162
163
163
- Data as Code is a paradigm, it does not require any special tool or library. Anyone is free to implement it in his own
164
- way and , by the way, may do so in programming languages other than Python. The tools that we built and describe below
165
- (template and CLI tool `dac` ) are just ** convenience** tools, meaning that they may accelerate your development process,
166
- but are not strictly necessary.
164
+ Data as Code is a paradigm, not a tool. You can implement it however you like, in any language. That said, we’ve built
165
+ some handy tools to make your life easier.
167
166
168
- !!! hint " Use [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) to define the Schema "
167
+ !!! hint " Pro Tip: Use [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) for defining schemas "
169
168
170
- If the dataframe engine (pandas/ polars/ dask/ spark... ) you are using is supported by
171
- [`pandera` ](https:// pandera.readthedocs.io/ en/ stable/ index.html), consider using a
172
- [`DataFrameModel` ](https:// pandera.readthedocs.io/ en/ stable/ dataframe_models.html) to define the Schema.
169
+ If your dataframe engine (pandas, polars, dask, spark, etc.) is supported by `pandera` , consider using a
170
+ [`DataFrameModel` ](https:// pandera.readthedocs.io/ en/ stable/ dataframe_models.html) to define your schema.
173
171
174
172
# ## Write the library
175
173
176
174
# ### 1. Start from scratch
177
175
178
- !!! warning " This approach expects you to be familiar with python packaging"
176
+ !!! warning " This approach requires Python packaging knowledge "
179
177
180
- Build you own library, respecting the following constraints :
178
+ Build your own library while following these guidelines :
181
179
182
180
# #### Public function `load`
183
181
184
- A public function named `load` is available at the root of package . For example, if you build the package
185
- `dac- my- awesome- data` , it should be possible to do the following :
182
+ Your package must have a public function named `load` at its root. For example, if your package is
183
+ `dac- my- awesome- data` , users should be able to do this :
186
184
187
185
```python
188
186
>> > from dac_my_awesome_data import load
189
187
>> > df = load()
190
188
```
191
189
192
- Notice that it must be possible to call `load()` without any argument, and the version of the returned data must
193
- correspond to the version of the package. This means that data will be different at every build.
190
+ The `load()` function should return data corresponding to the package version. Each build should produce different data.
194
191
195
- # #### Data fulfill the `Schema.validate()` method
192
+ # #### Data must pass `Schema.validate()`
196
193
197
194
A public class named `Schema` is available at the root of package and implements the Data Contract. `Schema` has a
198
195
`validate` method which takes data as input and raises an error if the Contract is not fulfilled, and returns the data
@@ -222,7 +219,7 @@ It is possible to reference the column names from the `Schema` class. For exampl
222
219
223
220
# #### [Nice to have] `Schema.example` method
224
221
225
- It is possible to generate synthetic data that fulfill the Data Contract. For example :
222
+ Provide a method to generate synthetic data that fulfills the Data Contract:
226
223
227
224
```python
228
225
>> > from dac_my_awesome_data import Schema
@@ -245,19 +242,19 @@ be probably a good idea to add meaningful constraint checks to the `Schema` clas
245
242
246
243
# ### 2. Use the template
247
244
248
- We provide a [Copier](https:// copier.readthedocs.io/ en/ stable/ ) template to get started quickly.
245
+ We’ve created a [Copier](https:// copier.readthedocs.io/ en/ stable/ ) template to help you get started quickly.
249
246
250
- [Take me to the template :material- cursor- default- click:](https:// gitlab.com/ data- as - code/ template/ src){ .md- button }
247
+ [Check out the template :material- cursor- default- click:](https:// gitlab.com/ data- as - code/ template/ src){ .md- button }
251
248
252
249
# ### 3. Use the [`dac`](https://github.com/data-as-code/dac) CLI tool
253
250
254
- Out CLI `dac` tool is a convenience tool capable of building a python package that respects the Data as Code paradigm.
251
+ Our `dac` CLI tool simplifies building Python packages that follow the Data as Code paradigm.
255
252
256
- [Take me to the `dac` CLI tool :material- cursor- default- click:](https:// github.com/ data- as - code/ dac){ .md- button }
253
+ [Explore the `dac` CLI tool :material- cursor- default- click:](https:// github.com/ data- as - code/ dac){ .md- button }
257
254
258
- # ### Compare template and `dac pack`
255
+ # ### Template vs. `dac pack`
259
256
260
- Which one should you use ?
257
+ Which one should you choose ?
261
258
262
259
| | Template | `dac pack` |
263
260
| :-------------------------- : | :---------------------- - : | :------------------ - : |
@@ -268,34 +265,28 @@ Which one should you use?
268
265
269
266
Choosing the right release version plays a crucial role in the Data as Code paradigm.
270
267
271
- Semantic versioning is used to communicate significant changes:
272
-
273
- | | Reason |
274
- | :------ - : | :---------------------------------------------------------------------------------- - |
275
- | __Patch__ | Fix in the data. Intended content unchanged |
276
- | __Minor__ | Change in the data that does not break the Data Contract - typically, new batch data |
277
- | __Major__ | Change in the Data Contract, or any other breaking change |
268
+ | | When to Use |
269
+ | :------ - : | :---------------------------------------------------------- - |
270
+ | __Patch__ | Fixes in the data without changing its intended content |
271
+ | __Minor__ | Non- breaking changes, like a fresh version of the batch data |
272
+ | __Major__ | Breaking changes, such as changes to the Data Contract |
278
273
279
- __Typically Patch and Major releases involve a manual process, while Minor releases can be automated.__ Our
280
- [`dac` CLI tool](https:// github.com/ data- as - code/ dac) can help you with the automated releases. Explore the command
281
- `dac next - version` .
274
+ __Patch and Major releases are usually manual, while Minor releases can be automated.__ Use the `dac` CLI tool to
275
+ automate Minor releases with the `dac next - version` command.
282
276
283
277
# # Why distributing Data as Code?
284
278
285
- - __Data Scientists have a very convenient way to ensure that their code is not run on incompatible data.__ They can
286
- simply include the Data as Code into their dependencies (e.g. `dac- example- energy~ =1.0 ` ), and then installation of
287
- their code with incompatible data will fail (e.g. `dac- example- energy` version `2.0 .0` ).
288
- - __Data pipelines can receive data updates without breaking.__ Data can subscribe to a major version of the data, and
289
- can receive updates without breaking changes.
290
- - __It provides a way to maintain multiple release streams__ (e.g. `1. X.Y` and `2. X.Y` ). This is useful when a new
291
- version of the data is released, but some users are still using the old version. In this case, the data engineer can
292
- keep releasing updates for both versions, until all users have migrated to the new version.
293
- - __The code needed to load the data, the data source, and locations are abstracted away from the consumer.__ This mean
294
- that the Data Producer can start from local files, transition to SQL database, cloud file storage, or kafka topic,
295
- without having the consumer to notice it or need to adapt its code.
296
- - _If you provide column names in `Schema` _ (e.g. `Schema.column_1` ), __the consumer' s code will not contain hard-coded
297
- column names__, and changes in data source field names won' t impact the user.
298
- - _If you provide the `Schema.example` method_, __the consumer will be able to build robust code by writing unit testing
299
- for their functions__. This will result in a more robust data pipeline.
300
- - _If the description of the data and columns is included in the `Schema` _, __data will be self - documented, from a
301
- consumer perspective__.
279
+ - ** Seamless compatibility** : Data Scientists can ensure their code runs on compatible data by including the Data as
280
+ Code package as a dependency to their code. For example, if they add `dac- example- energy~ =1.0 ` to the dependencies,
281
+ it will not be possible to use it together with `dac- example- energy== 2.0 .0` .
282
+ - ** Smooth updates** : Data pipelines can receive updates without breaking, as long as they subscribe to a major version.
283
+ - ** Multiple release streams** : Maintain different versions (e.g., `1. X.Y` and `2. X.Y` ) to support users on older
284
+ versions.
285
+ - ** Abstracted complexity** : Data loading, sources, and locations are hidden from consumers, allowing producers to
286
+ change implementations without impacting users.
287
+ - ** No hardcoded column names** : * If column names are included in the `Schema` * , consumers can avoid hardcoding field
288
+ names.
289
+ - ** Robust testing** : * If the `Schema.example` method is provided* , it enables consumers to write strong unit tests for
290
+ their code.
291
+ - ** Self- documenting data** : * If data and column descriptions are included in the `Schema` * , data will be easier to
292
+ understand for consumers.
0 commit comments