Skip to content

Commit 53d45d5

Browse files
author
Francesco Calcavecchia
committed
initial setup - copy from dac repo
1 parent 30d514b commit 53d45d5

File tree

11 files changed

+232
-0
lines changed

11 files changed

+232
-0
lines changed

.github/workflows/actions.yml

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
name: dac actions
2+
3+
on:
4+
push:
5+
branches:
6+
- '*'
7+
tags:
8+
- '*'
9+
10+
jobs:
11+
12+
check-style:
13+
runs-on: ubuntu-latest
14+
steps:
15+
- name: Checkout 🔖
16+
uses: actions/checkout@v3
17+
with:
18+
fetch-depth: 1
19+
- name: Setup python 🐍
20+
uses: actions/setup-python@v4
21+
with:
22+
python-version: '3.12'
23+
- name: Setup cache 💾
24+
uses: actions/cache@v3
25+
with:
26+
path: ~/.cache/pre-commit
27+
key: pre-commit
28+
- name: Prepare pre-commit 🙆‍♂️👗
29+
run: |
30+
python -m venv venv || . venv/bin/activate
31+
pip install -U pip wheel setuptools pre-commit
32+
pre-commit install
33+
- name: Run pre-commit 👗🚀
34+
run: |
35+
pre-commit run --all-files
36+
37+
docs:
38+
needs: [check-style]
39+
if: ${{ github.ref == 'refs/heads/main' }}
40+
runs-on: ubuntu-latest
41+
steps:
42+
- name: Checkout 🔖
43+
uses: actions/checkout@v3
44+
- name: Deploy docs
45+
uses: mhausenblas/mkdocs-deploy-gh-pages@master
46+
env:
47+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
48+
CONFIG_FILE: mkdocs.yml
49+
REQUIREMENTS: requirements-docs.txt

.pre-commit-config.yaml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
default_language_version:
2+
python: python3
3+
repos:
4+
- repo: https://github.com/pre-commit/pre-commit-hooks
5+
rev: v4.4.0
6+
hooks:
7+
- id: check-added-large-files
8+
- id: check-ast
9+
exclude: test/data/schema/wrong_syntax.py
10+
- id: trailing-whitespace
11+
- id: end-of-file-fixer
12+
- id: check-json
13+
- id: check-toml
14+
- id: check-yaml
15+
exclude: mkdocs.yml
16+
- repo: https://github.com/psf/black
17+
rev: 23.1.0
18+
hooks:
19+
- id: black
20+
exclude: test/data/schema/wrong_syntax.py
21+
- repo: https://github.com/pre-commit/mirrors-mypy
22+
rev: v1.0.0
23+
hooks:
24+
- id: mypy
25+
exclude: test/data/schema/wrong_syntax.py
26+
- repo: https://github.com/dosisod/refurb
27+
rev: v1.11.1
28+
hooks:
29+
- id: refurb
30+
exclude: test/data/schema/wrong_syntax.py
31+
- repo: https://github.com/charliermarsh/ruff-pre-commit
32+
rev: 'v0.0.247'
33+
hooks:
34+
- id: ruff
35+
args: [--fix, --exit-non-zero-on-fix]
36+
- repo: https://github.com/tcort/markdown-link-check
37+
rev: 'v3.11.2'
38+
hooks:
39+
- id: markdown-link-check
40+
args: [-q]

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2023 data-as-code
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
# docs
2+
23
Documentation for the Data as Code

docs/.DS_Store

6 KB
Binary file not shown.

docs/examples.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Examples
2+
3+
In [Energy DaC](https://gitlab.com/data-as-code/energy-dac-example) you can pip install some energy-related data as code.
4+
The Readme will guide you through a demo.
5+
You can also inspect the repo to see how the DaC package was built using `dac`.

docs/img/logo.jpg

201 KB
Loading

docs/img/motto.png

44.7 KB
Loading

docs/index.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# `dac`: Data as Code
2+
3+
<div align="center">
4+
<img src="img/motto.png" alt="drawing" width="450"/>
5+
</div>
6+
7+
Data-as-Code (DaC) `dac` is a tool that supports the distribution of data as (python) code.
8+
9+
<div align="center">
10+
<img src="img/logo.jpg" alt="drawing" width="250"/>
11+
</div>
12+
13+
## How will the Data Scientists use a DaC package?
14+
15+
Say that the Data Engineers prepared the `demo-data` as code for you. Then you will install the code in your environment
16+
```sh
17+
python -m pip install demo-data
18+
```
19+
and then you will be able to access the data simply with
20+
```python
21+
from demo_data import load
22+
23+
data = load()
24+
```
25+
26+
Data can be in any format. There is no constraint of any kind.
27+
28+
Not only accessing data will be this easy but, depending on how data were prepared, you may also have access to useful metadata. How?
29+
```python
30+
from demo_data import Schema
31+
```
32+
33+
With the schema you could, for example
34+
35+
* access the column names (e.g. `Schema.my_column`)
36+
* unit test your functions by getting a data example with `Schema.example()`
37+
38+
## How can a Data Engineer provide a DaC python package?
39+
40+
Install this library
41+
```sh
42+
python -m pip install dac
43+
```
44+
and use the command `dac pack` (run `dac pack --help` for detailed instructions).
45+
46+
On a high level, the most important elements you must provide are:
47+
48+
* python code to load the data
49+
* a `Schema` class that at very least contains a `validate` method, but possibly also
50+
51+
- data field names (column names, if data is tabular)
52+
- an `example` method
53+
54+
* python dependencies
55+
56+
!!! hint "Use `pandera` to define the Schema"
57+
58+
If the data type you are using is supported by [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) consider using a [`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) to define the Schema.
59+
60+
61+
## What are the advantages of distributing data in this way?
62+
63+
* The code needed to load the data, the data source, and locations are abstracted away from the user.
64+
This mean that the data engineer can start from local files, transition to SQL database, cloud file storage, or kafka topic, without having the user to notice it or need to adapt its code.
65+
66+
* *If you provide data field names in `Schema`* (e.g. `Schema.column_1`), the user code will not contain hard-coded column names, and changes in data source field names won't impact the user.
67+
68+
* *If you provide the `Schema.example` method*, users will be able to build robust code by writing unit testing for their functions effortlessly.
69+
70+
* Semantic versioning can be used to communicate significant changes:
71+
72+
* a patch update corresponds to a fix in the data: its intended content is unchanged
73+
* a minor update corresponds to a change in the data that does not break the schema
74+
* a major update corresponds to a change in the schema, or any other breaking change
75+
76+
In this way data pipelines can subscribe to the appropriate updates. Furthermore, it will be easy to keep releasing data updates maintaining retro-compatibility (one can keep deploying `1.X.Y` updates even after version `2` has been rolled-out).
77+
78+
* Description of the data and columns can be included in the schema, and will therefore reach the user together with the data.
79+
80+
* Users will always know where to look for data: the PyPi index.

mkdocs.yml

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
site_name: Data as Code
2+
3+
repo_url: https://github.com/data-as-code/dac
4+
edit_uri: tree/main/doc
5+
6+
nav:
7+
- Home: index.md
8+
- Examples: examples.md
9+
10+
theme:
11+
name: material
12+
features:
13+
- navigation.instant
14+
- navigation.tracking
15+
- navigation.sections
16+
- navigation.expand
17+
- content.code.annotate
18+
19+
plugins:
20+
- search
21+
22+
markdown_extensions:
23+
- tables
24+
- attr_list
25+
- admonition
26+
- pymdownx.details
27+
- pymdownx.superfences
28+
- pymdownx.emoji:
29+
emoji_index: !!python/name:materialx.emoji.twemoji
30+
emoji_generator: !!python/name:materialx.emoji.to_svg
31+
- pymdownx.superfences:
32+
custom_fences:
33+
- name: mermaid
34+
class: mermaid
35+
format: !!python/name:pymdownx.superfences.fence_code_format

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
mkdocs-material~=8.5

0 commit comments

Comments
 (0)