Skip to content

Commit 988d3c4

Browse files
init commit
0 parents  commit 988d3c4

21 files changed

+5949
-0
lines changed

.dvc/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
/config.local
2+
/tmp
3+
/cache

.dvc/config

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[core]
2+
remote = origin
3+
['remote "origin"']
4+
url = https://dagshub.com/khuyentran1401/prefect-dvc.dvc

.dvcignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Add patterns of files dvc should ignore, which could improve
2+
# the performance. Learn more at
3+
# https://dvc.org/doc/user-guide/dvcignore

.gitignore

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
.pytest_cache
2+
.mypy_cache
3+
.vscode
4+
__pycache__
5+
outputs
6+
processors
7+
.DS_Store
8+
.ipynb_checkpoints
9+
customer-segmentation-a-AULLE--py3.8
10+
*-workspace
11+
.tox
12+
wandb
13+
multirun
14+
*.log
15+
mlruns
16+
artifacts
17+
.venv
18+
.DS_Store
19+
venv
20+
*.ipynb
21+
/model
22+
/image

.pre-commit-config.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
repos:
2+
- repo: https://github.com/charliermarsh/ruff-pre-commit
3+
rev: v0.11.6
4+
hooks:
5+
- id: ruff
6+
args: [--fix]
7+
- repo: https://github.com/pre-commit/mirrors-mypy
8+
rev: v1.15.0
9+
hooks:
10+
- id: mypy

README.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
[![View the article](https://img.shields.io/badge/CodeCut-View%20Article-blue)](https://codecut.ai/introduction-to-dvc-data-version-control-tool-for-machine-learning-projects-2/)
2+
3+
# DVC Demo
4+
5+
A demonstration of Data Version Control (DVC) for managing ML pipelines and data versioning.
6+
7+
## What is DVC?
8+
9+
[DVC](https://dvc.org/) is an open-source version control system for machine learning projects. It helps you:
10+
- Version control large files, data sets, machine learning models, and metrics
11+
- Track ML experiments
12+
- Create reproducible ML pipelines
13+
- Collaborate with team members
14+
15+
## Project Structure
16+
17+
```
18+
.
19+
├── data/ # Raw and processed data files
20+
│ └── raw.dvc # DVC file for raw data
21+
├── src/ # Source code for data processing and model training
22+
├── config/ # Configuration files
23+
├── .dvc/ # DVC internal files
24+
├── dvc.yaml # DVC pipeline definition
25+
├── dvc.lock # DVC lock file for reproducible pipelines
26+
└── .dvcignore # Files/directories to be ignored by DVC
27+
```
28+
29+
## Setup
30+
31+
1. Install project dependencies using uv:
32+
33+
```bash
34+
uv sync dvc
35+
```
36+
37+
2. Pull the data from remote storage:
38+
39+
```bash
40+
dvc pull
41+
```
42+
43+
3. Run the pipeline to reproduce all stages:
44+
45+
```bash
46+
dvc repro
47+
```
48+
49+
## Version Control
50+
51+
- Track data files: `dvc add <file>`
52+
- Push data to remote storage: `dvc push`
53+
- Pull data from remote storage: `dvc pull`
54+
- Check status: `dvc status`

config/main.yaml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
defaults:
2+
- process: process_1
3+
- _self_
4+
5+
raw_data:
6+
path: data/raw/marketing_campaign.csv
7+
8+
intermediate:
9+
dir: data/intermediate
10+
name: scale_features.csv
11+
path: ${intermediate.dir}/${intermediate.name}
12+
13+
final:
14+
dir: data/final
15+
name: segmented.csv
16+
path: ${final.dir}/${final.name}
17+
18+
model:
19+
path: model/cluster.pkl

config/process/process_1.yaml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
name: process_1
2+
keep_columns:
3+
- Income
4+
- Recency
5+
- NumWebVisitsMonth
6+
- AcceptedCmp3
7+
- AcceptedCmp4
8+
- AcceptedCmp5
9+
- AcceptedCmp1
10+
- AcceptedCmp2
11+
- Complain
12+
- Response
13+
- age
14+
- total_purchases
15+
- enrollment_years
16+
- family_size
17+
18+
remove_outliers_threshold:
19+
age: 84
20+
Income: 600000
21+
22+
family_size:
23+
Married: 2
24+
Together: 2
25+
Absurd: 1
26+
Widow: 1
27+
YOLO: 1
28+
Divorced: 1
29+
Single: 1
30+
Alone: 1
31+

config/process/process_2.yaml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
name: process_2
2+
keep_columns:
3+
- Income
4+
- Recency
5+
- NumWebVisitsMonth
6+
- Complain
7+
- age
8+
- total_purchases
9+
- enrollment_years
10+
- family_size
11+
12+
remove_outliers_threshold:
13+
age: 90
14+
Income: 600000
15+
16+
family_size:
17+
Married: 2
18+
Together: 2
19+
Absurd: 1
20+
Widow: 1
21+
YOLO: 1
22+
Divorced: 1
23+
Single: 1
24+
Alone: 1
25+

config/process/process_3.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
name: process_3
2+
keep_columns:
3+
- Income
4+
- Recency
5+
- NumWebVisitsMonth
6+
- NumDealsPurchases
7+
- NumWebPurchases
8+
- NumCatalogPurchases
9+
- NumStorePurchases
10+
- Complain
11+
- Response
12+
- age
13+
- enrollment_years
14+
- family_size
15+
16+
remove_outliers_threshold:
17+
age: 90
18+
Income: 600000
19+
20+
family_size:
21+
Married: 2
22+
Together: 2
23+
Absurd: 1
24+
Widow: 1
25+
YOLO: 1
26+
Divorced: 1
27+
Single: 1
28+
Alone: 1
29+

0 commit comments

Comments
 (0)