Skip to content

Commit 988d3c4

Browse files
init commit
0 parents  commit 988d3c4

21 files changed

+5949
-0
lines changed

.dvc/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
/config.local
2+
/tmp
3+
/cache

.dvc/config

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[core]
2+
remote = origin
3+
['remote "origin"']
4+
url = https://dagshub.com/khuyentran1401/prefect-dvc.dvc

.dvcignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Add patterns of files dvc should ignore, which could improve
2+
# the performance. Learn more at
3+
# https://dvc.org/doc/user-guide/dvcignore

.gitignore

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
.pytest_cache
2+
.mypy_cache
3+
.vscode
4+
__pycache__
5+
outputs
6+
processors
7+
.DS_Store
8+
.ipynb_checkpoints
9+
customer-segmentation-a-AULLE--py3.8
10+
*-workspace
11+
.tox
12+
wandb
13+
multirun
14+
*.log
15+
mlruns
16+
artifacts
17+
.venv
18+
.DS_Store
19+
venv
20+
*.ipynb
21+
/model
22+
/image

.pre-commit-config.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
repos:
2+
- repo: https://github.com/charliermarsh/ruff-pre-commit
3+
rev: v0.11.6
4+
hooks:
5+
- id: ruff
6+
args: [--fix]
7+
- repo: https://github.com/pre-commit/mirrors-mypy
8+
rev: v1.15.0
9+
hooks:
10+
- id: mypy

README.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
[![View the article](https://img.shields.io/badge/CodeCut-View%20Article-blue)](https://codecut.ai/introduction-to-dvc-data-version-control-tool-for-machine-learning-projects-2/)
2+
3+
# DVC Demo
4+
5+
A demonstration of Data Version Control (DVC) for managing ML pipelines and data versioning.
6+
7+
## What is DVC?
8+
9+
[DVC](https://dvc.org/) is an open-source version control system for machine learning projects. It helps you:
10+
- Version control large files, data sets, machine learning models, and metrics
11+
- Track ML experiments
12+
- Create reproducible ML pipelines
13+
- Collaborate with team members
14+
15+
## Project Structure
16+
17+
```
18+
.
19+
├── data/ # Raw and processed data files
20+
│ └── raw.dvc # DVC file for raw data
21+
├── src/ # Source code for data processing and model training
22+
├── config/ # Configuration files
23+
├── .dvc/ # DVC internal files
24+
├── dvc.yaml # DVC pipeline definition
25+
├── dvc.lock # DVC lock file for reproducible pipelines
26+
└── .dvcignore # Files/directories to be ignored by DVC
27+
```
28+
29+
## Setup
30+
31+
1. Install project dependencies using uv:
32+
33+
```bash
34+
uv sync dvc
35+
```
36+
37+
2. Pull the data from remote storage:
38+
39+
```bash
40+
dvc pull
41+
```
42+
43+
3. Run the pipeline to reproduce all stages:
44+
45+
```bash
46+
dvc repro
47+
```
48+
49+
## Version Control
50+
51+
- Track data files: `dvc add <file>`
52+
- Push data to remote storage: `dvc push`
53+
- Pull data from remote storage: `dvc pull`
54+
- Check status: `dvc status`

config/main.yaml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
defaults:
2+
- process: process_1
3+
- _self_
4+
5+
raw_data:
6+
path: data/raw/marketing_campaign.csv
7+
8+
intermediate:
9+
dir: data/intermediate
10+
name: scale_features.csv
11+
path: ${intermediate.dir}/${intermediate.name}
12+
13+
final:
14+
dir: data/final
15+
name: segmented.csv
16+
path: ${final.dir}/${final.name}
17+
18+
model:
19+
path: model/cluster.pkl

config/process/process_1.yaml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
name: process_1
2+
keep_columns:
3+
- Income
4+
- Recency
5+
- NumWebVisitsMonth
6+
- AcceptedCmp3
7+
- AcceptedCmp4
8+
- AcceptedCmp5
9+
- AcceptedCmp1
10+
- AcceptedCmp2
11+
- Complain
12+
- Response
13+
- age
14+
- total_purchases
15+
- enrollment_years
16+
- family_size
17+
18+
remove_outliers_threshold:
19+
age: 84
20+
Income: 600000
21+
22+
family_size:
23+
Married: 2
24+
Together: 2
25+
Absurd: 1
26+
Widow: 1
27+
YOLO: 1
28+
Divorced: 1
29+
Single: 1
30+
Alone: 1
31+

config/process/process_2.yaml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
name: process_2
2+
keep_columns:
3+
- Income
4+
- Recency
5+
- NumWebVisitsMonth
6+
- Complain
7+
- age
8+
- total_purchases
9+
- enrollment_years
10+
- family_size
11+
12+
remove_outliers_threshold:
13+
age: 90
14+
Income: 600000
15+
16+
family_size:
17+
Married: 2
18+
Together: 2
19+
Absurd: 1
20+
Widow: 1
21+
YOLO: 1
22+
Divorced: 1
23+
Single: 1
24+
Alone: 1
25+

config/process/process_3.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
name: process_3
2+
keep_columns:
3+
- Income
4+
- Recency
5+
- NumWebVisitsMonth
6+
- NumDealsPurchases
7+
- NumWebPurchases
8+
- NumCatalogPurchases
9+
- NumStorePurchases
10+
- Complain
11+
- Response
12+
- age
13+
- enrollment_years
14+
- family_size
15+
16+
remove_outliers_threshold:
17+
age: 90
18+
Income: 600000
19+
20+
family_size:
21+
Married: 2
22+
Together: 2
23+
Absurd: 1
24+
Widow: 1
25+
YOLO: 1
26+
Divorced: 1
27+
Single: 1
28+
Alone: 1
29+

data/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
/raw
2+
/intermediate
3+
/final

data/raw.dvc

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
outs:
2+
- md5: 10c3f643286f509fa7f6b4675d9efbad.dir
3+
size: 222379
4+
nfiles: 1
5+
path: raw

dvc.lock

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
schema: '2.0'
2+
stages:
3+
process_data:
4+
cmd: python src/process_data.py
5+
deps:
6+
- path: data/raw
7+
md5: 10c3f643286f509fa7f6b4675d9efbad.dir
8+
size: 222379
9+
nfiles: 1
10+
- path: src/process_data.py
11+
hash: md5
12+
md5: 5920b8b9838a6fdb8afdda6c82e35986
13+
size: 2654
14+
params:
15+
config/process/process_1.yaml:
16+
family_size:
17+
Married: 2
18+
Together: 2
19+
Absurd: 1
20+
Widow: 1
21+
YOLO: 1
22+
Divorced: 1
23+
Single: 1
24+
Alone: 1
25+
keep_columns:
26+
- Income
27+
- Recency
28+
- NumWebVisitsMonth
29+
- AcceptedCmp3
30+
- AcceptedCmp4
31+
- AcceptedCmp5
32+
- AcceptedCmp1
33+
- AcceptedCmp2
34+
- Complain
35+
- Response
36+
- age
37+
- total_purchases
38+
- enrollment_years
39+
- family_size
40+
name: process_1
41+
remove_outliers_threshold:
42+
age: 84
43+
Income: 600000
44+
outs:
45+
- path: data/intermediate
46+
hash: md5
47+
md5: 69c6a4e21a7e575450a4ce26f70f394f.dir
48+
size: 624234
49+
nfiles: 1
50+
train:
51+
cmd: python src/segment.py
52+
deps:
53+
- path: data/intermediate
54+
hash: md5
55+
md5: 69c6a4e21a7e575450a4ce26f70f394f.dir
56+
size: 624234
57+
nfiles: 1
58+
- path: src/segment.py
59+
hash: md5
60+
md5: b0f72dee173f4a36c4e9849fa3b0545c
61+
size: 2245
62+
params:
63+
config/main.yaml:
64+
defaults:
65+
- process: process_1
66+
- _self_
67+
final:
68+
dir: data/final
69+
name: segmented.csv
70+
path: ${final.dir}/${final.name}
71+
intermediate:
72+
dir: data/intermediate
73+
name: scale_features.csv
74+
path: ${intermediate.dir}/${intermediate.name}
75+
model:
76+
path: model/cluster.pkl
77+
raw_data:
78+
path: data/raw/marketing_campaign.csv
79+
outs:
80+
- path: data/final
81+
hash: md5
82+
md5: fcdc1dd0b9a2a1877736c356b9602f6a.dir
83+
size: 610251
84+
nfiles: 1
85+
- path: model/cluster.pkl
86+
hash: md5
87+
md5: 8fd544c7627269bc5cbee2243e6cee58
88+
size: 9701

dvc.yaml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
stages:
2+
process_data:
3+
cmd: python src/process_data.py
4+
params:
5+
- config/process/process_1.yaml:
6+
deps:
7+
- data/raw
8+
- src/process_data.py
9+
outs:
10+
- data/intermediate
11+
train:
12+
cmd: python src/segment.py
13+
params:
14+
- config/main.yaml:
15+
deps:
16+
- data/intermediate
17+
- src/segment.py
18+
outs:
19+
- data/final
20+
- model/cluster.pkl

pyproject.toml

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
[project]
2+
name = "customer_segmentation"
3+
version = "0.1.0"
4+
description = ""
5+
authors = [{ name = "Khuyen" }]
6+
requires-python = ">=3.8"
7+
dependencies = [
8+
"dvc",
9+
"pandas>=2.0.3",
10+
"scikit-learn>=1.3.2",
11+
"yellowbrick>=1.5",
12+
]
13+
14+
[dependency-groups]
15+
dev = [
16+
"pre-commit>=3.5.0",
17+
"pytest>=8.3.5",
18+
]
19+
20+
[tool.ruff]
21+
# Exclude a variety of commonly ignored directories.
22+
exclude = [
23+
".bzr",
24+
".direnv",
25+
".eggs",
26+
".git",
27+
".git-rewrite",
28+
".hg",
29+
".mypy_cache",
30+
".nox",
31+
".pants.d",
32+
".pytype",
33+
".ruff_cache",
34+
".svn",
35+
".tox",
36+
".venv",
37+
"__pypackages__",
38+
"_build",
39+
"buck-out",
40+
"build",
41+
"dist",
42+
"node_modules",
43+
"venv",
44+
]
45+
46+
# Same as Black.
47+
line-length = 88
48+
49+
[tool.ruff.lint]
50+
ignore = ["E501"]
51+
select = ["B", "C", "E", "F", "W", "B9", "I", "Q"]
52+
53+
[tool.ruff.format]
54+
quote-style = "double"
55+
indent-style = "tab"
56+
skip-magic-trailing-comma = false
57+
58+
[tool.ruff.lint.mccabe]
59+
max-complexity = 10
60+
61+
[tool.mypy]
62+
ignore_missing_imports = true

src/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)