Skip to content

mlrepa/mlops-starter-project-iris

Repository files navigation

🚀 Engineering and MLOps Practices for Modern AI - Starter Project Iris

Welcome to the "Engineering and MLOps practices for Modern AI" course starter project! This repository provides a simple, end-to-end Machine Learning pipeline using the well-known Iris dataset. It's designed to be a practical starting point for you to explore and apply fundamental MLOps patterns and modern development tools.

This project emphasizes solid Software Engineering practices as the foundation for effective MLOps, including:

  • ✨ Automated Code Quality: Linting (Ruff), formatting (Ruff), and static type checking (MyPy).
  • 🧪 Robust Testing: Unit tests with Pytest and code coverage reporting.
  • 🛡️ Pre-commit Hooks: Automating code quality checks before every commit.
  • 🔄 Continuous Integration (CI): Automated validation of your code with GitHub Actions.
  • ⚙️ Build Automation: Using make to streamline common development tasks.
  • 📦 Dependency Management: Consistent and reproducible environments managed by uv.
  • 🏗️ Code Modularity: An organized structure for your source code.
  • ✍️ Readability & Maintainability: Enhanced through type annotations and docstrings.

🎯 Project Goal

The core objective is to train a classifier on the Iris dataset. However, the real learning comes from understanding the MLOps practices applied: versioning data and models (conceptually, for now), tracking experiments, automating the pipeline, and ensuring code quality and reproducibility throughout the development lifecycle.

📂 Project Structure

mlops-get-started-iris/
├── .github/                # GitHub specific configurations (e.g., Workflows for CI)
├── data/
│   ├── .gitignore          # Specifies data files not to be tracked by Git
│   ├── features_iris.csv   # Processed features (generated by src/load_data.py)
│   ├── train.csv           # Training dataset (generated by src/split_dataset.py)
│   ├── test.csv            # Test dataset (generated by src/split_dataset.py)
│   └── eval.json           # Evaluation metrics (generated by src/evaluate.py)
├── docs/
│   └── DEVELOPMENT.md      # Detailed guide for developers on coding standards and tools
├── models/                 # Intended for trained models
│   └── model.joblib        # Trained model artifact (generated by src/train.py)
├── src/
│   ├── load_data.py        # Script for loading and initial preprocessing of Iris data
│   ├── split_dataset.py    # Script for splitting data into training and testing sets
│   ├── train.py            # Script for training the machine learning model
│   └── evaluate.py         # Script for evaluating the trained model's performance
├── tests/
│   ├── __init__.py         # Makes 'tests' a Python package
│   └── test_load_data.py   # Example unit tests for the data loading script
├── .gitignore              # Global Git ignore patterns for the project
├── .pre-commit-config.yaml # Configuration for pre-commit hooks (Ruff, Mypy, etc.)
├── .python-version         # Specifies the preferred Python version (e.g., for pyenv or uv)
├── .secrets.baseline       # Baseline file for detect-secrets (prevents committing secrets)
├── Makefile                # Defines useful development commands (e.g., make lint, make test)
├── pyproject.toml          # Project configuration, dependencies, and tool settings (PEP 621)
└── uv.lock                 # Lock file for reproducible Python dependencies (generated by uv)

Key Structure Notes:

  • data/ Directory: In this initial setup, data/ holds both input (implicitly, as load_data.py likely fetches it) and output (processed data, model, metrics) files. In more advanced MLOps scenarios, you'd typically separate raw data, processed data, and model artifacts, often versioning them with tools like DVC.
  • models/ Directory: While present, it's not explicitly used by the scripts in src/ which save model.joblib to data/. This directory is a placeholder for when you start versioning models more formally (e.g., with DVC or MLflow Model Registry).
  • pyproject.toml: This is your central hub for defining project dependencies (managed by uv) and configuring tools like Ruff, Mypy, and Pytest.
  • Makefile: Your friend for running common tasks like linting, testing, and executing the pipeline with simple commands.

🛠️ Prerequisites

Ensure you have the following installed on your system:

  • Python 3.12+: We recommend using the version specified in the .python-version file.
  • uv: A fast Python package installer and project manager. See the uv installation guide.
  • Git: For version control.
  • Make: (Optional, but highly recommended for convenience).
    • macOS/Linux: Usually pre-installed.
    • Windows: Consider installing via Chocolatey (choco install make) or using Windows Subsystem for Linux (WSL).

🚀 Quick Start: Installation & Setup

  1. Clone the Repository:

    git clone <YOUR_REPOSITORY_URL>
    cd mlops-starter-project-iris
  2. Set Up Python Environment

    # Create a virtual environment (e.g., named .venv) using the project's Python version
    uv venv .venv --python 3.11
    
    # Activate the virtual environment:
    # On macOS and Linux:
    source .venv/bin/activate
    # On Windows (PowerShell):
    # .\.venv\Scripts\Activate.ps1
    # On Windows (Command Prompt):
    # .\.venv\Scripts\activate.bat
    
  3. Install Dependencies: We'll use uv to create a virtual environment and install dependencies.

    # Install project dependencies, including development tools:
    uv sync --dev
    
    # Initialize Pre-commit Hooks
    uv run pre-commit install

    Alternatively, the Makefile provides a shortcut:

    make install # This target in the Makefile should execute the uv commands above

You're now ready to start!

▶️ Running the ML Pipeline

The pipeline consists of data loading, splitting, model training, and evaluation.

Option 1: Using make (Recommended for Simplicity) The Makefile includes a target to run the entire pipeline:

make run-pipeline

This will execute the Python scripts in src/ in the correct order. Check the Makefile to see the exact commands if you're curious!

Option 2: Manual Execution of Python Scripts If you want to run each step individually to understand the flow:

# Ensure your virtual environment is activated: source .venv/bin/activate

python src/load_data.py
python src/split_dataset.py --test_size 0.2 # Example of passing an argument
python src/train.py
python src/evaluate.py

After running, you should find features_iris.csv, train.csv, test.csv, model.joblib, and eval.json inside the data/ directory.

💻 Development Workflow & Tools

This project is set up with tools to help you write high-quality code efficiently.

  • Makefile: Your primary interface for common tasks. Run make help to see all available commands:

    make help

    This will list targets like lint, format-check, test, clean, etc.

  • Code Formatting (Ruff Formatter):

    • Check if files need formatting: make format-check (uses uv run ruff format --check .)
    • Automatically format files: make format (uses uv run ruff format .)
  • Linting (Ruff Linter):

    • Check for style and logical errors: make lint (uses uv run ruff check .)
    • Attempt to auto-fix lint issues: make lint-fix (uses uv run ruff check --fix .) (If this target exists in your Makefile)
  • Type Checking (Mypy):

    • Verify type annotations: make mypy (uses uv run mypy src tests)
  • Testing (Pytest):

    • Run all tests: make test (uses uv run pytest)
    • This command also generates a code coverage report, viewable by opening htmlcov/index.html in your browser.
  • Pre-commit Hooks: These run automatically on git commit. If they find issues (e.g., formatting errors, lint problems), the commit will be stopped, and you'll see messages on what to fix. After fixing (or if the hooks fix them automatically), git add the changed files and commit again.

For a deeper dive into the development workflow, tool configurations, and contribution guidelines, please consult the DEVELOPMENT.md file.


Happy MLOps journey! We hope this starter project helps you learn and experiment. 🚀

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published