Skip to content

Commit 073146c

Browse files
author
Francesco Calcavecchia
committed
simplify text
1 parent 2e9d47d commit 073146c

File tree

1 file changed

+75
-84
lines changed

1 file changed

+75
-84
lines changed

docs/index.md

Lines changed: 75 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -4,43 +4,46 @@
44
<img src="img/motto.png" alt="drawing" width="450"/>
55
</div>
66

7-
Data as Code (DaC) is a paradigm of distributing versioned data as versioned code.
7+
Data as Code (DaC) is a paradigm of distributing versioned data as code. Think of it as treating your data with the same
8+
care and precision as your software.
89

910
!!! warning "Disclaimer"
1011

11-
Currently the focus is on tabular and batch data, and Python code only.
12+
At the moment, we're focusing on tabular and batch data, with Python as the primary language.
1213

13-
Future extensions may be possible, depending on the community interest.
14+
But who knows? With enough community interest, we might expand to other areas in the future!
1415

1516
## Consumer - Data Scientist
1617

1718
??? info "Follow along"
1819

19-
The code snippets below can be executed on your machine too! You just need to configure `pip` to point to the pypi
20-
registry where we stored the example DaC package. You can do this by running
20+
Want to try the examples below on your own machine? It's easy! Just configure `pip` to point to the PyPI registry where
21+
the example DaC package is stored. Run this command:
2122

2223
```shell
2324
❯ export PIP_EXTRA_INDEX_URL=https://gitlab.com/api/v4/projects/43746775/packages/pypi/simple
2425
```
2526

26-
Of course don't forget to create an isolated environment before using `pip` to install the package:
27+
And don't forget to create an isolated environment before installing the package:
2728

2829
```shell
2930
❯ python -m venv venv && . venv/bin/activate
3031
```
3132

32-
Say that the Data Engineers prepared the DaC `dac-example-energy` for you. Install it with
33+
Imagine the Data Engineers have prepared a DaC package called `dac-example-energy` just for you. Install it like this:
3334

3435
```shell
3536
❯ python -m pip install dac-example-energy
3637
...
3738
Successfully installed ... dac-example-energy-2.0.2 ...
3839
```
3940

40-
Have you noticed the version `2.0.2`? That is the version of your data. This is very important, you can read
41-
[here](#make-releases) why.
41+
Notice the version `2.0.2`? That’s the version of your data! Curious to know more about the importance of the version?
42+
Check out [this section](#make-releases).
4243

43-
Now, how do you grab the data?
44+
### Grab the data with the snap of a finger: `load`
45+
46+
Now, let’s grab the data:
4447

4548
```python
4649
>>> from dac_example_energy import load
@@ -62,12 +65,13 @@ Now, how do you grab the data?
6265
[71160 rows x 5 columns]
6366
```
6467

65-
One more very valuable tool is the `Schema` class. `Schema` is the implementation of a Data Contract, which is a
66-
contract between the data producer and the data consumer. It describes the data, its structure, and the constraints that
67-
the data must fulfill. At the very least it will have a `validate` method that verifies if a given data set fulfills the
68-
data contract. Loaded data is guaranteed to pass the validation.
68+
### Meet the `Schema` Class: Your Data’s Best Friend
69+
70+
The `Schema` class is the backbone of the Data Contract. It’s a promise between the data producer and the data consumer.
71+
It defines the structure, constraints, and expectations for the data. And here’s the best part: any data you load is
72+
guaranteed to pass validation.
6973

70-
Let us see what we can do with the `Schema` in the `dac-example-energy` package.
74+
Let’s explore what the `Schema` in the `dac-example-energy` package can do:
7175

7276
```python
7377
>>> from dac_example_energy import Schema
@@ -119,9 +123,9 @@ class Schema(pa.SchemaModel):
119123
)
120124
```
121125

122-
In this case [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) has been used to define the Schema. We can
126+
This `Schema` is built using [`pandera`](https://pandera.readthedocs.io/en/stable/index.html). Here’s why it’s awesome:
123127

124-
- see which columns are available and even reference their names in our code without cumbersome hardcoded strings:
128+
- **Column names are accessible**: No more hardcoding strings! Reference column names directly in your code:
125129
```python
126130
>>> df[Schema.value_in_gwh]
127131
0 6644.088
@@ -137,62 +141,55 @@ In this case [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) ha
137141
71159 0.000
138142
Name: OBS_VALUE, Length: 71160, dtype: float64
139143
```
140-
- for each column, we exactly know what to expect. For example, what is the column type, are `None` values allowed, are
141-
there specific admitted categorical values, etc.;
142-
- we can read a useful description of the column;
143-
- if we install `pandera[strategies]` with `pip`, we can even generate synthetic data that is guaranteed to pass the
144-
schema validation. This is very useful for testing our code:
145-
146-
```python
147-
>>> Schema.example(size=5)
148-
siec_name nrg_bal_name geo TIME_PERIOD OBS_VALUE
149-
0 Natural gas Gross available energy AL 1990 0.0
150-
1 Solid fossil fuels Gross available energy AL 1990 0.0
151-
2 Solid fossil fuels Gross available energy AL 1990 0.0
152-
3 Solid fossil fuels Gross available energy AL 1990 0.0
153-
4 Solid fossil fuels Gross available energy AL 1990 0.0
154-
```
144+
- **Clear expectations**: Know exactly what each column should contain—types, constraints, and more.
145+
- **Self-documenting**: Each column comes with a description.
146+
- **Synthetic data generation**: Install `pandera[strategies]` and generate test data that passes validation:
147+
```python
148+
>>> Schema.example(size=5)
149+
siec_name nrg_bal_name geo TIME_PERIOD OBS_VALUE
150+
0 Natural gas Gross available energy AL 1990 0.0
151+
1 Solid fossil fuels Gross available energy AL 1990 0.0
152+
2 Solid fossil fuels Gross available energy AL 1990 0.0
153+
3 Solid fossil fuels Gross available energy AL 1990 0.0
154+
4 Solid fossil fuels Gross available energy AL 1990 0.0
155+
```
155156

156-
!!! hint "Example data does not look right"
157+
!!! hint "Example data looks odd?"
157158

158-
The example data above does not look right. Does this mean that there is something wrong in the implementation of the
159-
`example` method? Not really! Read [here](#nice-to-have-schemaexample-method).
159+
The synthetic data above might not look realistic. Does this mean the `example` method is broken? Not at all! Check out
160+
[this section](#nice-to-have-schemaexample-method) to learn more.
160161

161162
## Producer - Data Engineer
162163

163-
Data as Code is a paradigm, it does not require any special tool or library. Anyone is free to implement it in his own
164-
way and, by the way, may do so in programming languages other than Python. The tools that we built and describe below
165-
(template and CLI tool `dac`) are just **convenience** tools, meaning that they may accelerate your development process,
166-
but are not strictly necessary.
164+
Data as Code is a paradigm, not a tool. You can implement it however you like, in any language. That said, we’ve built
165+
some handy tools to make your life easier.
167166

168-
!!! hint "Use [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) to define the Schema"
167+
!!! hint "Pro Tip: Use [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) for defining schemas"
169168

170-
If the dataframe engine (pandas/polars/dask/spark...) you are using is supported by
171-
[`pandera`](https://pandera.readthedocs.io/en/stable/index.html), consider using a
172-
[`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) to define the Schema.
169+
If your dataframe engine (pandas, polars, dask, spark, etc.) is supported by `pandera`, consider using a
170+
[`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) to define your schema.
173171

174172
### Write the library
175173

176174
#### 1. Start from scratch
177175

178-
!!! warning "This approach expects you to be familiar with python packaging"
176+
!!! warning "This approach requires Python packaging knowledge"
179177

180-
Build you own library, respecting the following constraints:
178+
Build your own library while following these guidelines:
181179

182180
##### Public function `load`
183181

184-
A public function named `load` is available at the root of package. For example, if you build the package
185-
`dac-my-awesome-data`, it should be possible to do the following:
182+
Your package must have a public function named `load` at its root. For example, if your package is
183+
`dac-my-awesome-data`, users should be able to do this:
186184

187185
```python
188186
>>> from dac_my_awesome_data import load
189187
>>> df = load()
190188
```
191189

192-
Notice that it must be possible to call `load()` without any argument, and the version of the returned data must
193-
correspond to the version of the package. This means that data will be different at every build.
190+
The `load()` function should return data corresponding to the package version. Each build should produce different data.
194191

195-
##### Data fulfill the `Schema.validate()` method
192+
##### Data must pass `Schema.validate()`
196193

197194
A public class named `Schema` is available at the root of package and implements the Data Contract. `Schema` has a
198195
`validate` method which takes data as input and raises an error if the Contract is not fulfilled, and returns the data
@@ -222,7 +219,7 @@ It is possible to reference the column names from the `Schema` class. For exampl
222219

223220
##### [Nice to have] `Schema.example` method
224221

225-
It is possible to generate synthetic data that fulfill the Data Contract. For example:
222+
Provide a method to generate synthetic data that fulfills the Data Contract:
226223

227224
```python
228225
>>> from dac_my_awesome_data import Schema
@@ -245,19 +242,19 @@ be probably a good idea to add meaningful constraint checks to the `Schema` clas
245242

246243
#### 2. Use the template
247244

248-
We provide a [Copier](https://copier.readthedocs.io/en/stable/) template to get started quickly.
245+
We’ve created a [Copier](https://copier.readthedocs.io/en/stable/) template to help you get started quickly.
249246

250-
[Take me to the template :material-cursor-default-click:](https://gitlab.com/data-as-code/template/src){ .md-button }
247+
[Check out the template :material-cursor-default-click:](https://gitlab.com/data-as-code/template/src){ .md-button }
251248

252249
#### 3. Use the [`dac`](https://github.com/data-as-code/dac) CLI tool
253250

254-
Out CLI `dac` tool is a convenience tool capable of building a python package that respects the Data as Code paradigm.
251+
Our `dac` CLI tool simplifies building Python packages that follow the Data as Code paradigm.
255252

256-
[Take me to the `dac` CLI tool :material-cursor-default-click:](https://github.com/data-as-code/dac){ .md-button }
253+
[Explore the `dac` CLI tool :material-cursor-default-click:](https://github.com/data-as-code/dac){ .md-button }
257254

258-
#### Compare template and `dac pack`
255+
#### Template vs. `dac pack`
259256

260-
Which one should you use?
257+
Which one should you choose?
261258

262259
| | Template | `dac pack` |
263260
| :--------------------------: | :-----------------------: | :-------------------: |
@@ -268,34 +265,28 @@ Which one should you use?
268265

269266
Choosing the right release version plays a crucial role in the Data as Code paradigm.
270267

271-
Semantic versioning is used to communicate significant changes:
272-
273-
| | Reason |
274-
| :-------: | :----------------------------------------------------------------------------------- |
275-
| __Patch__ | Fix in the data. Intended content unchanged |
276-
| __Minor__ | Change in the data that does not break the Data Contract - typically, new batch data |
277-
| __Major__ | Change in the Data Contract, or any other breaking change |
268+
| | When to Use |
269+
| :-------: | :----------------------------------------------------------- |
270+
| __Patch__ | Fixes in the data without changing its intended content |
271+
| __Minor__ | Non-breaking changes, like a fresh version of the batch data |
272+
| __Major__ | Breaking changes, such as changes to the Data Contract |
278273

279-
__Typically Patch and Major releases involve a manual process, while Minor releases can be automated.__ Our
280-
[`dac` CLI tool](https://github.com/data-as-code/dac) can help you with the automated releases. Explore the command
281-
`dac next-version`.
274+
__Patch and Major releases are usually manual, while Minor releases can be automated.__ Use the `dac` CLI tool to
275+
automate Minor releases with the `dac next-version` command.
282276

283277
## Why distributing Data as Code?
284278

285-
- __Data Scientists have a very convenient way to ensure that their code is not run on incompatible data.__ They can
286-
simply include the Data as Code into their dependencies (e.g. `dac-example-energy~=1.0`), and then installation of
287-
their code with incompatible data will fail (e.g. `dac-example-energy` version `2.0.0`).
288-
- __Data pipelines can receive data updates without breaking.__ Data can subscribe to a major version of the data, and
289-
can receive updates without breaking changes.
290-
- __It provides a way to maintain multiple release streams__ (e.g. `1.X.Y` and `2.X.Y`). This is useful when a new
291-
version of the data is released, but some users are still using the old version. In this case, the data engineer can
292-
keep releasing updates for both versions, until all users have migrated to the new version.
293-
- __The code needed to load the data, the data source, and locations are abstracted away from the consumer.__ This mean
294-
that the Data Producer can start from local files, transition to SQL database, cloud file storage, or kafka topic,
295-
without having the consumer to notice it or need to adapt its code.
296-
- _If you provide column names in `Schema`_ (e.g. `Schema.column_1`), __the consumer's code will not contain hard-coded
297-
column names__, and changes in data source field names won't impact the user.
298-
- _If you provide the `Schema.example` method_, __the consumer will be able to build robust code by writing unit testing
299-
for their functions__. This will result in a more robust data pipeline.
300-
- _If the description of the data and columns is included in the `Schema`_, __data will be self-documented, from a
301-
consumer perspective__.
279+
- **Seamless compatibility**: Data Scientists can ensure their code runs on compatible data by including the Data as
280+
Code package as a dependency to their code. For example, if they add `dac-example-energy~=1.0` to the dependencies,
281+
it will not be possible to use it together with `dac-example-energy==2.0.0`.
282+
- **Smooth updates**: Data pipelines can receive updates without breaking, as long as they subscribe to a major version.
283+
- **Multiple release streams**: Maintain different versions (e.g., `1.X.Y` and `2.X.Y`) to support users on older
284+
versions.
285+
- **Abstracted complexity**: Data loading, sources, and locations are hidden from consumers, allowing producers to
286+
change implementations without impacting users.
287+
- **No hardcoded column names**: *If column names are included in the `Schema`*, consumers can avoid hardcoding field
288+
names.
289+
- **Robust testing**: *If the `Schema.example` method is provided*, it enables consumers to write strong unit tests for
290+
their code.
291+
- **Self-documenting data**: *If data and column descriptions are included in the `Schema`*, data will be easier to
292+
understand for consumers.

0 commit comments

Comments
 (0)