Welcome to the "Engineering and MLOps practices for Modern AI" course starter project! This repository provides a simple, end-to-end Machine Learning pipeline using the well-known Iris dataset. It's designed to be a practical starting point for you to explore and apply fundamental MLOps patterns and modern development tools.
This project emphasizes solid Software Engineering practices as the foundation for effective MLOps, including:
- ✨ Automated Code Quality: Linting (Ruff), formatting (Ruff), and static type checking (MyPy).
- 🧪 Robust Testing: Unit tests with Pytest and code coverage reporting.
- 🛡️ Pre-commit Hooks: Automating code quality checks before every commit.
- 🔄 Continuous Integration (CI): Automated validation of your code with GitHub Actions.
- ⚙️ Build Automation: Using
make
to streamline common development tasks. - 📦 Dependency Management: Consistent and reproducible environments managed by
uv
. - 🏗️ Code Modularity: An organized structure for your source code.
- ✍️ Readability & Maintainability: Enhanced through type annotations and docstrings.
The core objective is to train a classifier on the Iris dataset. However, the real learning comes from understanding the MLOps practices applied: versioning data and models (conceptually, for now), tracking experiments, automating the pipeline, and ensuring code quality and reproducibility throughout the development lifecycle.
mlops-get-started-iris/
├── .github/ # GitHub specific configurations (e.g., Workflows for CI)
├── data/
│ ├── .gitignore # Specifies data files not to be tracked by Git
│ ├── features_iris.csv # Processed features (generated by src/load_data.py)
│ ├── train.csv # Training dataset (generated by src/split_dataset.py)
│ ├── test.csv # Test dataset (generated by src/split_dataset.py)
│ └── eval.json # Evaluation metrics (generated by src/evaluate.py)
├── docs/
│ └── DEVELOPMENT.md # Detailed guide for developers on coding standards and tools
├── models/ # Intended for trained models
│ └── model.joblib # Trained model artifact (generated by src/train.py)
├── src/
│ ├── load_data.py # Script for loading and initial preprocessing of Iris data
│ ├── split_dataset.py # Script for splitting data into training and testing sets
│ ├── train.py # Script for training the machine learning model
│ └── evaluate.py # Script for evaluating the trained model's performance
├── tests/
│ ├── __init__.py # Makes 'tests' a Python package
│ └── test_load_data.py # Example unit tests for the data loading script
├── .gitignore # Global Git ignore patterns for the project
├── .pre-commit-config.yaml # Configuration for pre-commit hooks (Ruff, Mypy, etc.)
├── .python-version # Specifies the preferred Python version (e.g., for pyenv or uv)
├── .secrets.baseline # Baseline file for detect-secrets (prevents committing secrets)
├── Makefile # Defines useful development commands (e.g., make lint, make test)
├── pyproject.toml # Project configuration, dependencies, and tool settings (PEP 621)
└── uv.lock # Lock file for reproducible Python dependencies (generated by uv)
Key Structure Notes:
data/
Directory: In this initial setup,data/
holds both input (implicitly, asload_data.py
likely fetches it) and output (processed data, model, metrics) files. In more advanced MLOps scenarios, you'd typically separate raw data, processed data, and model artifacts, often versioning them with tools like DVC.models/
Directory: While present, it's not explicitly used by the scripts insrc/
which savemodel.joblib
todata/
. This directory is a placeholder for when you start versioning models more formally (e.g., with DVC or MLflow Model Registry).pyproject.toml
: This is your central hub for defining project dependencies (managed byuv
) and configuring tools like Ruff, Mypy, and Pytest.Makefile
: Your friend for running common tasks like linting, testing, and executing the pipeline with simple commands.
Ensure you have the following installed on your system:
- Python 3.12+: We recommend using the version specified in the
.python-version
file. uv
: A fast Python package installer and project manager. See the uv installation guide.- Git: For version control.
- Make: (Optional, but highly recommended for convenience).
- macOS/Linux: Usually pre-installed.
- Windows: Consider installing via Chocolatey (
choco install make
) or using Windows Subsystem for Linux (WSL).
-
Clone the Repository:
git clone <YOUR_REPOSITORY_URL> cd mlops-starter-project-iris
-
Set Up Python Environment
# Create a virtual environment (e.g., named .venv) using the project's Python version uv venv .venv --python 3.11 # Activate the virtual environment: # On macOS and Linux: source .venv/bin/activate # On Windows (PowerShell): # .\.venv\Scripts\Activate.ps1 # On Windows (Command Prompt): # .\.venv\Scripts\activate.bat
-
Install Dependencies: We'll use
uv
to create a virtual environment and install dependencies.# Install project dependencies, including development tools: uv sync --dev # Initialize Pre-commit Hooks uv run pre-commit install
Alternatively, the
Makefile
provides a shortcut:make install # This target in the Makefile should execute the uv commands above
You're now ready to start!
The pipeline consists of data loading, splitting, model training, and evaluation.
Option 1: Using make
(Recommended for Simplicity)
The Makefile
includes a target to run the entire pipeline:
make run-pipeline
This will execute the Python scripts in src/
in the correct order. Check the Makefile
to see the exact commands if you're curious!
Option 2: Manual Execution of Python Scripts If you want to run each step individually to understand the flow:
# Ensure your virtual environment is activated: source .venv/bin/activate
python src/load_data.py
python src/split_dataset.py --test_size 0.2 # Example of passing an argument
python src/train.py
python src/evaluate.py
After running, you should find features_iris.csv
, train.csv
, test.csv
, model.joblib
, and eval.json
inside the data/
directory.
This project is set up with tools to help you write high-quality code efficiently.
-
Makefile: Your primary interface for common tasks. Run
make help
to see all available commands:make help
This will list targets like
lint
,format-check
,test
,clean
, etc. -
Code Formatting (Ruff Formatter):
- Check if files need formatting:
make format-check
(usesuv run ruff format --check .
) - Automatically format files:
make format
(usesuv run ruff format .
)
- Check if files need formatting:
-
Linting (Ruff Linter):
- Check for style and logical errors:
make lint
(usesuv run ruff check .
) - Attempt to auto-fix lint issues:
make lint-fix
(usesuv run ruff check --fix .
) (If this target exists in your Makefile)
- Check for style and logical errors:
-
Type Checking (Mypy):
- Verify type annotations:
make mypy
(usesuv run mypy src tests
)
- Verify type annotations:
-
Testing (Pytest):
- Run all tests:
make test
(usesuv run pytest
) - This command also generates a code coverage report, viewable by opening
htmlcov/index.html
in your browser.
- Run all tests:
-
Pre-commit Hooks: These run automatically on
git commit
. If they find issues (e.g., formatting errors, lint problems), the commit will be stopped, and you'll see messages on what to fix. After fixing (or if the hooks fix them automatically),git add
the changed files and commit again.
For a deeper dive into the development workflow, tool configurations, and contribution guidelines, please consult the DEVELOPMENT.md file.
Happy MLOps journey! We hope this starter project helps you learn and experiment. 🚀