Skip to content

Commit 6026517

Browse files
Merge pull request #22 from khuyentran1401/add-contribution-guide
Add contribution guide
2 parents 4fbc862 + 5cf3f5e commit 6026517

File tree

8 files changed

+442
-476
lines changed

8 files changed

+442
-476
lines changed

.github/workflows/publish-marimo.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ jobs:
3333
uv run marimo export html llm/pydantic_ai_examples.py -o build/llm/pydantic_ai_examples.html --sandbox
3434
uv run marimo export html data_science_tools/pandas_api_on_spark.py -o build/data_science_tools/pandas_api_on_spark.html --sandbox
3535
uv run marimo export html data_science_tools/pyspark_parametrize.py -o build/data_science_tools/pyspark_parametrize.html --sandbox
36+
uv run marimo export html data_science_tools/narwhals.py -o build/data_science_tools/narwhals.html --sandbox
3637
- name: Upload Pages Artifact
3738
uses: actions/upload-pages-artifact@v3
3839
with:

.pre-commit-config.yaml

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,10 @@
11
repos:
2-
- repo: https://github.com/ambv/black
3-
rev: 20.8b1
4-
hooks:
5-
- id: black
6-
additional_dependencies: ['click==8.0.4']
7-
- repo: https://github.com/pycqa/flake8
8-
rev: 3.8.4
9-
hooks:
10-
- id: flake8
11-
- repo: https://github.com/timothycrosley/isort
12-
rev: 5.12.0
13-
hooks:
14-
- id: isort
2+
- repo: https://github.com/charliermarsh/ruff-pre-commit
3+
rev: v0.11.6
4+
hooks:
5+
- id: ruff
6+
args: [--fix]
7+
- repo: https://github.com/pre-commit/mirrors-mypy
8+
rev: v1.15.0
9+
hooks:
10+
- id: mypy

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Collection of useful data science topics along with articles and videos.
99
## The Data Scientist's Toolkit: 100+ Essential Tools for Modern Analytics
1010

1111
To receive a condensed overview of these tools and additional resources, sign up for [CodeCut's free PDF guide](https://codecut.ai/data-scientist-toolkit/?utm_source=github&utm_medium=data_science_repo&utm_campaign=free_pdf). This comprehensive 264-page document covers over 100 essential data science tools, providing you with a valuable reference for your work.
12-
12+
1313
## How to Download the Code in This Repository to Your Local Machine
1414

1515
To download the code in this repo, you can simply use git clone

contribution.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Contribution Guidelines
2+
3+
## Environment Setup
4+
5+
### Install uv
6+
7+
[uv](https://github.com/astral.sh/uv) is a fast Python package installer and resolver.
8+
9+
```bash
10+
# Install uv
11+
curl -LsSf https://astral.sh/uv/install.sh | sh
12+
13+
# Verify installation
14+
uv --version
15+
```
16+
17+
### Install Dependencies
18+
19+
```bash
20+
# Install dependencies from pyproject.toml
21+
uv sync
22+
```
23+
24+
### Install Pre-commit Hooks
25+
26+
We use pre-commit to ensure code quality and consistency.
27+
28+
```bash
29+
# Install pre-commit hooks
30+
uv run pre-commit install
31+
```
32+
33+
## Working with Marimo Notebooks
34+
35+
### Creating a New Notebook
36+
37+
Create a new notebook using marimo:
38+
39+
```bash
40+
uv run marimo edit notebook.py --sandbox
41+
```
42+
43+
### Publishing Notebooks
44+
45+
Add the following workflow to `.github/workflows/publish-marimo.yml`:
46+
47+
```yaml
48+
...
49+
jobs:
50+
publish:
51+
runs-on: ubuntu-latest
52+
steps:
53+
...
54+
- name: Export notebook
55+
run: |
56+
uv run marimo export html notebook.py -o build/notebook.html --sandbox
57+
...
58+
```
59+
60+
## Pull Request Process
61+
62+
1. Fork the repository
63+
2. Create a new branch for your feature
64+
3. Make your changes
65+
4. Submit a pull request with a clear description of changes

data_science_tools/narwhals.py

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
# /// script
2+
# requires-python = ">=3.11"
3+
# dependencies = [
4+
# "duckdb==1.2.2",
5+
# "marimo",
6+
# "narwhals==1.39.0",
7+
# "pandas==2.2.3",
8+
# "polars==1.29.0",
9+
# "pyarrow==20.0.0",
10+
# "pyspark==3.5.5",
11+
# "sqlframe==3.32.1",
12+
# ]
13+
# ///
14+
15+
import marimo
16+
17+
__generated_with = "0.13.6"
18+
app = marimo.App(width="medium")
19+
20+
21+
@app.cell
22+
def _():
23+
import marimo as mo
24+
25+
return (mo,)
26+
27+
28+
@app.cell(hide_code=True)
29+
def _(mo):
30+
mo.md(
31+
r"""
32+
# Dataframe-agnostic data science
33+
34+
Let's define a dataframe-agnostic function to calculate monthly average prices. It needs to support pandas, Polars, PySpark, DuckDB, PyArrow, Dask, and cuDF, without doing any conversion between libraries.
35+
36+
## Bad solution: just convert to pandas
37+
38+
This kind of works, but:
39+
40+
- It doesn't return to the user the same class they started with.
41+
- It kills lazy execution.
42+
- It kills GPU acceleration.
43+
- If forces pandas as a required dependency.
44+
"""
45+
)
46+
return
47+
48+
49+
@app.function
50+
def monthly_aggregate_bad(user_df):
51+
if hasattr(user_df, "to_pandas"):
52+
df = user_df.to_pandas()
53+
elif hasattr(user_df, "toPandas"):
54+
df = user_df.toPandas()
55+
elif hasattr(user_df, "_to_pandas"):
56+
df = user_df._to_pandas()
57+
return df.resample("MS", on="date")[["price"]].mean()
58+
59+
60+
@app.cell(hide_code=True)
61+
def _(mo):
62+
mo.md(
63+
r"""
64+
## Unmaintainable solution: different branches for each library
65+
66+
This works, but is unfeasibly difficult to test and maintain, especially when also factoring in API changes between different versions of the same library (e.g. pandas `1.*` vs pandas `2.*`).
67+
"""
68+
)
69+
return
70+
71+
72+
@app.cell
73+
def _(F):
74+
import duckdb
75+
import pandas as pd
76+
import polars as pl
77+
import pyspark
78+
79+
def monthly_aggregate_unmaintainable(user_df):
80+
if isinstance(user_df, pd.DataFrame):
81+
result = user_df.resample("MS", on="date")[["price"]].mean()
82+
elif isinstance(user_df, pl.DataFrame):
83+
result = (
84+
user_df.group_by(pl.col("date").dt.truncate("1mo"))
85+
.agg(pl.col("price").mean())
86+
.sort("date")
87+
)
88+
elif isinstance(user_df, pyspark.sql.dataframe.DataFrame):
89+
result = (
90+
user_df.groupBy(F.date_trunc("month", F.col("date")))
91+
.agg(F.mean("price"))
92+
.orderBy("date")
93+
)
94+
elif isinstance(user_df, duckdb.DuckDBPyRelation):
95+
result = user_df.aggregate(
96+
[
97+
duckdb.FunctionExpression(
98+
"time_bucket",
99+
duckdb.ConstantExpression("1 month"),
100+
duckdb.FunctionExpression("date"),
101+
).alias("date"),
102+
duckdb.FunctionExpression("mean", "price").alias("price"),
103+
],
104+
).sort("date")
105+
# TODO: more branches for PyArrow, Dask, etc... :sob:
106+
return result
107+
108+
return duckdb, pd, pl
109+
110+
111+
@app.cell(hide_code=True)
112+
def _(mo):
113+
mo.md(
114+
r"""
115+
## Best solution: Narwhals as a unified dataframe interface
116+
117+
- Preserves lazy execution and GPU acceleration.
118+
- Users get back what they started with.
119+
- Easy to write and maintain.
120+
- Strong and complete static typing.
121+
"""
122+
)
123+
return
124+
125+
126+
@app.cell
127+
def _():
128+
import narwhals as nw
129+
from narwhals.typing import IntoFrameT
130+
131+
def monthly_aggregate(user_df: IntoFrameT) -> IntoFrameT:
132+
return (
133+
nw.from_native(user_df)
134+
.group_by(nw.col("date").dt.truncate("1mo"))
135+
.agg(nw.col("price").mean())
136+
.sort("date")
137+
.to_native()
138+
)
139+
140+
return (monthly_aggregate,)
141+
142+
143+
@app.cell(hide_code=True)
144+
def _(mo):
145+
mo.md(r"""## Demo: let's verify that it works!""")
146+
return
147+
148+
149+
@app.cell
150+
def _():
151+
from datetime import datetime
152+
153+
data = {
154+
"date": [datetime(2020, 1, 1), datetime(2020, 1, 8), datetime(2020, 2, 3)],
155+
"price": [1, 4, 3],
156+
}
157+
return (data,)
158+
159+
160+
@app.cell
161+
def _(data, monthly_aggregate, pd):
162+
# pandas
163+
df_pd = pd.DataFrame(data)
164+
monthly_aggregate(df_pd)
165+
return (df_pd,)
166+
167+
168+
@app.cell
169+
def _(data, monthly_aggregate, pl):
170+
# Polars
171+
df_pl = pl.DataFrame(data)
172+
monthly_aggregate(df_pl)
173+
return
174+
175+
176+
@app.cell
177+
def _(duckdb, monthly_aggregate):
178+
# DuckDB
179+
rel = duckdb.sql(
180+
"""
181+
from values (timestamp '2020-01-01', 1),
182+
(timestamp '2020-01-08', 4),
183+
(timestamp '2020-02-03', 3)
184+
df(date, price)
185+
select *
186+
"""
187+
)
188+
monthly_aggregate(rel)
189+
return
190+
191+
192+
@app.cell
193+
def _(data, monthly_aggregate):
194+
# PyArrow
195+
import pyarrow as pa
196+
197+
tbl = pa.table(data)
198+
monthly_aggregate(tbl)
199+
return
200+
201+
202+
@app.cell(hide_code=True)
203+
def _(mo):
204+
mo.md(
205+
r"""
206+
## Bonus - can we generate SQL?
207+
208+
Narwhals comes with an extra bonus feature: by combining it with [SQLFrame](https://github.com/eakmanrq/sqlframe), we can easily transpiling the Polars API to any major SQL dialect. For example, to translate to the DataBricks SQL dialect, we can do:
209+
"""
210+
)
211+
return
212+
213+
214+
@app.cell
215+
def _(df_pd, monthly_aggregate):
216+
from sqlframe.duckdb import DuckDBSession
217+
218+
sqlframe = DuckDBSession()
219+
sqlframe_df = sqlframe.createDataFrame(df_pd)
220+
sqlframe_result = monthly_aggregate(sqlframe_df)
221+
print(sqlframe_result.sql(dialect="databricks"))
222+
return
223+
224+
225+
@app.cell
226+
def _():
227+
return
228+
229+
230+
if __name__ == "__main__":
231+
app.run()

pyproject.toml

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,9 @@ description = "Add your description here"
55
readme = "README.md"
66
requires-python = ">=3.11"
77
dependencies = [
8-
"loguru>=0.7.3",
9-
"marimo==0.13.6",
10-
"narwhals==1.36.0",
11-
"nbformat>=5.10.4",
12-
"pandas>=2.2.3",
13-
"pyspark[sql]>=3.5.5",
8+
"marimo>=0.13.7",
9+
"pre-commit>=4.2.0",
1410
]
1511

1612
[dependency-groups]
17-
dev = [
18-
"pytest>=8.3.5",
19-
]
13+
dev = []

requirements.txt

Lines changed: 0 additions & 1 deletion
This file was deleted.

0 commit comments

Comments
 (0)