Click the green "Use this template" button at the top of the page, then choose "Create a new repository".
This will create your own copy of this project, which you can modify freely — no need to fork!
- ML Engineering best practices with database persistence
- ML model deployment as a FastAPI service with PostgreSQL integration, containerized with Docker for scalability and reproducibility
$\textcolor{#FF4500}{\text{You can easily adapt the repository to work with any dataset of your choice.}}$ - The structure is flexible and can be applied to various machine learning models, including
$\textcolor{#1E90FF}{\text{regression, classification, and clustering}}$ .
Table of Contents
This project demonstrates an end-to-end ML engineering workflow with database integration. You can train a machine learning model, serve it using a FastAPI application, store predictions in a PostgreSQL database, and experiment interactively with Jupyter Notebook – all within Docker containers. The project is designed to be reproducible and maintainable.
The project provides a comprehensive learning resource with:
-
Core Application Components:
- A well-structured FastAPI application with proper routing and middleware
- Complete PostgreSQL integration using SQLAlchemy ORM
- Properly organized database package with models, schemas, and CRUD operations
- Dependency injection for clean, testable code
-
Docker Infrastructure:
- Containerized services for API, database, and Jupyter notebook
- Volume mounting for data persistence and code accessibility
- Environment variable configuration for flexible deployment
- Health checks and container orchestration
-
Development Environment:
- Jupyter notebooks for data exploration and model development
- Scripts for database setup and model training
- Complete test suite for API endpoints and database operations
-
Documentation:
- Extensive README files in each directory
- Detailed explanations of architectural decisions
- Complete setup and usage instructions
- Best practices for database integration
This project follows clean architecture principles with a clear separation of concerns, making it an excellent reference for building production-ready machine learning APIs with database integration.
This project is designed for anyone interested in machine learning, API development, database integration, or containerization with Docker. Whether you're a student, developer, or data scientist, this resource will guide you through building and deploying a machine learning API with database persistence using FastAPI, PostgreSQL, and Docker.
By the end of this project, you will:
- Develop a foundational understanding of FastAPI and its setup
- Learn how to integrate PostgreSQL with a FastAPI application
- Understand how to containerize applications using Docker
- Explore how to train and deploy simple machine learning models
- Work with practical examples to build scalable APIs with data persistence
- Gain insights into storing and retrieving ML predictions from a database
This project is suitable for three types of learners:
-
For those familiar with Docker, FastAPI, and databases: You can dive straight into the deployment phase. The examples and configurations provided will help you enhance your skills and explore best practices in building and deploying APIs with database integration.
-
For those who know Docker and FastAPI but are new to database integration: This project will introduce you to integrating PostgreSQL with FastAPI, guiding you through building and deploying an API with database persistence.
-
For beginners: This project is designed with you in mind. You'll start with the basics, learning how to set up Docker, FastAPI, and PostgreSQL, and then move on to building and deploying a machine learning model with database integration.
Folder PATH listing
.
+---app <-- Contains the main application code
| | config.py <-- Application configuration
| | database.py <-- Database connection configuration
| | main.py <-- Main FastAPI application
| | README.md <-- API Application documentation
| |
| \---database <-- Database components
| | crud.py <-- Database CRUD operations
| | deps.py <-- Dependency injection
| | models.py <-- SQLAlchemy models
| | README.md <-- Database module documentation
| | schema.py <-- Pydantic schemas
| | session.py <-- Database session management
| | __init__.py <-- Package initialization and exports
| |
| \---migrations <-- Alembic migrations
| | env.py <-- Alembic environment configuration
| | README.md <-- Migrations documentation
| | script.py.mako <-- Migration script template
| |
| \---versions <-- Migration version scripts
| README.md <-- Versions directory documentation
|
+---assets <-- Contains static assets (images, styles, etc.)
| logo.png <-- Project logo image
| README.md <-- Assets directory documentation
|
+---data <-- Directory for storing datasets
| original_dataset.csv <-- Example dataset for model training
| README.md <-- Data directory documentation
|
+---docker <-- Contains Docker configuration files
| Dockerfile <-- Dockerfile for building the API service
| Dockerfile.jupyter <-- Dockerfile for setting up Jupyter Notebook
| entrypoint.sh <-- Container startup script
| README.md <-- Docker setup documentation
|
+---models <-- Stores trained machine learning models
| ml_model.pkl <-- Serialized ML model
| README.md <-- Models directory documentation
|
+---notebooks <-- Jupyter notebooks for experiments and analysis
| data_exploration.ipynb <-- Notebook for data exploration
| train_dev.ipynb <-- Notebook for training and development
| README.md <-- Notebooks directory documentation
|
+---scripts <-- Contains utility scripts
| db_setup.py <-- Script for setting up the database
| train.py <-- Script for training models
| README.md <-- Scripts directory documentation
|
+---tests <-- Contains automated tests
| __init__.py <-- Initializes the tests package
| test_api.py <-- API endpoint tests
| README.md <-- Tests directory documentation
|
| .dockerignore <-- Specifies files to exclude from Docker builds
| .gitignore <-- Specifies files to ignore in Git version control
| alembic.ini <-- Alembic configuration
| docker-compose.yml <-- Docker Compose configuration
| LICENSE <-- License information for the project
| README.md <-- Project overview and instructions
| requirements.txt <-- Lists Python dependencies
-
docker/Dockerfile
Dockerfile for building the FastAPI application container. It installs the dependencies required for serving the model and connecting to PostgreSQL in a production environment. -
docker/Dockerfile.jupyter
Dockerfile for building the Jupyter Notebook container. It installs additional dependencies for interactive development and experimentation, including PostgreSQL client libraries. -
docker-compose.yml
Defines three services:web
: Runs the FastAPI application, exposing it on port 8000.jupyter
: Runs a Jupyter Notebook server accessible on port 8888.db
: Runs PostgreSQL database for storing predictions.
The project uses volume mounts to ensure data persistence and code accessibility across containers.
-
models/ml_model.pkl
A pre-trained machine learning model for Iris flower classification. -
scripts/train.py
A training script that loads the Iris dataset, trains a logistic regression model, evaluates its performance, and saves the trained model. -
app/main.py
The FastAPI application that loads the model and exposes several endpoints for making predictions and accessing the database.
-
app/database/
Contains all the database-related components:- models.py: Defines SQLAlchemy ORM model for storing predictions.
- schema.py: Defines Pydantic models for request/response validation.
- crud.py: Implements database operations for creating and retrieving predictions.
- session.py: Manages database session creation and configuration.
- deps.py: Provides dependency injection for FastAPI.
-
scripts/db_setup.py
A script for setting up the PostgreSQL database and creating necessary tables.
- tests/
Contains automated tests for the API endpoints and database operations:- Tests for health check and root endpoints
- Tests for prediction endpoint and database storage
- Tests for retrieving predictions
- Error handling tests
-
app/
Contains the main application code:- main.py: Main FastAPI application with endpoints and model loading.
- config.py: Application configuration settings using Pydantic.
- database.py: Database connection configuration.
- database/: Database components for PostgreSQL integration.
-
assets/
Contains static assets like images and styles, including the project logo. -
data/
Directory for storing datasets used in the project, including the Iris dataset. -
docker/
Contains Docker configuration files:- Dockerfile: Builds the FastAPI API container.
- Dockerfile.jupyter: Builds the Jupyter Notebook container.
-
models/
Stores trained machine learning models, specifically the Iris classifier. -
notebooks/
Jupyter notebooks for experiments and analysis:- data_exploration.ipynb: Notebook for exploring the Iris dataset.
- train_dev.ipynb: Notebook for model training and development.
-
scripts/
Contains utility scripts:- train.py: Script for training the ML model.
- db_setup.py: Script for setting up the PostgreSQL database.
-
tests/
Contains automated tests for the application.
-
Clone the repository:
git clone <repository-url> cd ML_API_with_PostgreSQL_Integration
-
Build and start the containers:
docker-compose up --build
This will start three containers:
- FastAPI application: available at http://localhost:8000
- PostgreSQL database: available at localhost:5432
- Jupyter Notebook: available at http://localhost:8888
Once the application is running, you can test the endpoints:
-
Swagger UI documentation:
http://localhost:8000/docs
-
Make a prediction using curl:
curl -X 'POST' \ 'http://localhost:8000/predict' \ -H 'Content-Type: application/json' \ -d '{ "sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2 }'
-
Get all predictions:
curl -X 'GET' 'http://localhost:8000/predictions'
To run the automated tests:
docker-compose exec web pytest tests/
Jupyter Notebook provides an interactive environment for model development, data exploration, and experimentation. You have two options to work with the Jupyter notebooks:
-
Through IDE (Recommended):
○ For VS Code:
a. Press
Ctrl+Shift+P
(Windows/Linux) orCmd+Shift+P
(Mac) b. Select "Dev Containers: Attach to Running Container..." c. Choose the container namedml_api_with_postgresql_integration-jupyter-1
d. When prompted to open a folder, navigate to/app
e. You can now work with notebooks directly in VS Code with all dependencies available○ For Cursor:
a. Click the Remote Explorer icon in the sidebar (or press
Ctrl+Shift+P
and search for "Attach to Running Container") b. Select the container namedml_api_with_postgresql_integration-jupyter-1
c. When prompted to open a folder, navigate to/app
d. You can now work with notebooks directly in Cursor with all dependencies available -
Through Web Browser:
○ Open your browser and navigate to http://localhost:8888 ○ All required dependencies are already installed ○ Changes are automatically saved to your local files through Docker volumes
When opening a notebook, make sure to select the correct kernel:
- For a new notebook, click on "Select Kernel" in the top right
- Choose "Python 3.10.16 (/usr/local/bin/python)" from the dropdown
- This kernel has access to all dependencies installed in the container
If you want to confirm you're working inside the container, run the following code in a notebook cell:
# Check Python version
!python --version
# View container environment variables
import os
print(os.environ)
# Verify the operating system
!cat /etc/os-release
The output should show:
- Python 3.10.x
- Container-specific environment variables (including DATABASE_URL with 'db' hostname)
- Debian Linux as the operating system
The provided notebooks are:
notebooks/data_exploration.ipynb
: Explore the Iris dataset with visualizations and statistical analysisnotebooks/train_dev.ipynb
: Develop and train the model, experiment with hyperparameters
To connect to the PostgreSQL database from your notebook:
from sqlalchemy import create_engine
import os
# Get database URL from environment variable
db_url = os.environ.get("DATABASE_URL")
# Create engine
engine = create_engine(db_url)
# Example query
import pandas as pd
df = pd.read_sql("SELECT * FROM predictions", engine)
df.head()
-
Kernel not starting:
- Verify the container is running with
docker ps
- Check container logs with
docker logs ml_api_with_postgresql_integration-jupyter-1
- Verify the container is running with
-
Missing dependencies:
- Dependencies are installed from requirements.txt. If you need additional packages:
!pip install package_name
- Dependencies are installed from requirements.txt. If you need additional packages:
-
Database connection issues:
- Ensure the PostgreSQL container is running
- Verify the DATABASE_URL environment variable with
os.environ.get("DATABASE_URL")
Note: When working through the IDE, you're actually working inside the container where all dependencies are already installed. This ensures consistency between development and production environments.
The project uses PostgreSQL for storing prediction data. There are two approaches to database schema management:
By default, this project uses direct table creation for simplicity:
- Tables are automatically created at application startup using SQLAlchemy's
Base.metadata.create_all()
method inapp/main.py
. - The custom setup script
scripts/db_setup.py
runs during container initialization to ensure tables exist.
This approach was chosen for simplicity in a containerized environment where databases are often recreated from scratch.
While the project is configured with Alembic for migrations, it currently doesn't use migration scripts. To switch to a migration-based approach:
- Modify the SQLAlchemy models in
app/database/models.py
- Generate a migration script:
alembic revision --autogenerate -m "Description"
- Apply the migration:
alembic upgrade head
When working with databases in a containerized environment, you have multiple approaches for database interaction. This section covers best practices for database development within Docker.
Advantages:
- All-in-one environment: Code execution, data visualization, and database interaction in one place
- Documentation as you go: Notebooks serve as self-documenting SQL exploratory sessions
- Visualization integration: Easily plot query results using pandas and matplotlib
- Reproducibility: Queries and their results are saved together in the notebook
Best for:
- Data exploration and analysis
- Creating data processing pipelines
- Prototyping database operations
- Learning and experimentation
Advantages:
- Purpose-built tools: Specialized features like schema visualization and query optimization
- SQL-focused interface: Better syntax highlighting and autocompletion
- Database administration: Easier to perform administrative tasks
- Separate concerns: Keeps database work distinct from application code
Best for:
- Schema design and management
- Complex query development and testing
- Database administration tasks
- Day-to-day database operations
The most effective workflow combines both approaches:
-
Use VS Code/Cursor Database Extensions for:
- Initial database setup and schema management
- Complex SQL query development
- Production database migrations
- Database administration tasks
-
Use Jupyter Notebooks for:
- Data exploration and analysis
- Combining SQL queries with data processing
- Documenting database workflows with explanations
- Visualizing query results
- Prototyping database interactions for your ML pipeline
- Install the
Database Client
orSQLTools
+SQLTools PostgreSQL/Cockroach Driver
extensions in VS Code/Cursor
Database Client
: By cweijanSQLTools
: By Matheus TeixeiraSQLTools PostgreSQL/Cockroach Driver
: By mtxr
-
Configure a new connection using these settings:
- Host:
localhost
- Port:
5432
(the forwarded port) - Username:
user
- Password:
password
- Database:
ml_api_db
- Host:
-
Database Client
vs.SQLTools
+SQLTools PostgreSQL/Cockroach Driver
Database Client
: if you prefer a simple GUI, easy browsing, and minimal setup.SQLTools
+SQLTools PostgreSQL/Cockroach Driver
: if you want powerful code integration, deep customization, and adapter flexibility.
Option 1: Database Client
- Single installation - All-in-one solution with built-in PostgreSQL support
- Best for: Visual learners, beginners, and those who prioritize ease of use
- Modern interface with intuitive data exploration features
- Simplified connection setup with fewer configuration steps
- Better table visualization with interactive data grid view
Option 2: SQLTools + SQLTools PostgreSQL/Cockroach Driver
- Two-part installation - Core SQLTools extension plus PostgreSQL/Cockroach Driver
- Best for: Those comfortable with SQL who need advanced features
- More comprehensive SQL tooling with deeper code integration
- Robust intellisense & linting capabilities
- Highly extensible with plugins and fine-grained configuration
Both options will connect to the project's PostgreSQL database using the same connection parameters. Consider your learning style and preferences when choosing - visual, GUI-focused learners may prefer Database Client, while code-centric, customization-focused learners might prefer SQLTools.
Feature Comparison:
Aspect | SQLTools | Database Client |
---|---|---|
Setup & Configuration | ✅ Large community & docs ✅ Multiple adapters (MySQL, Postgres, etc.) ❌ More config-heavy (JSON files) ❌ Manual adapter installs needed |
✅ Minimal config required ✅ Quick setup out of the box ❌ Limited advanced settings ❌ Not as flexible with niche databases |
UI & Navigation | ✅ Familiar code-based workflow ✅ Lightweight, integrates well into editor ❌ Less visual exploration ❌ Feels "techy" to newcomers |
✅ GUI-based table browsing & editing ✅ Easy schema view in the sidebar ❌ Can feel heavier/cluttered if you prefer code ❌ Fewer inline code features |
Auto-Completion | ✅ Robust intellisense & linting ✅ Good for large/complex schemas ❌ Requires fine-tuning for some DB engines |
✅ Basic suggestions for major SQL dialects ✅ Enough for quick queries ❌ Not as powerful or customizable as SQLTools ❌ Lacks deeper query analysis |
Customization | ✅ Highly extensible (plugins, adapters) ✅ Fine-grained JSON config ❌ Can be daunting for beginners ❌ Takes time to master |
✅ Straightforward "turnkey" experience ✅ Simple configuration menus ❌ Smaller ecosystem ❌ Fewer ways to adapt to unusual or advanced requirements |
Community & Support | ✅ Very large user base, lots of tutorials ✅ Frequent updates & GitHub activity ❌ Some help documents can be spread out ❌ Might lag behind if a DB is extremely new |
✅ Growing but smaller community ✅ Clear official docs ❌ Not as many community-driven resources ❌ Might not have direct support for very new or specialized databases |
Symbols Used
- ✅ = Primary strength / core functionality
- ❌ = Limitation / drawback
from sqlalchemy import create_engine, text
import pandas as pd
import os
# Get database URL from environment variable
db_url = os.environ.get("DATABASE_URL")
engine = create_engine(db_url)
# For exploratory queries
def run_query(query):
"""Run a SQL query and return results as a DataFrame"""
with engine.connect() as conn:
result = conn.execute(text(query))
return pd.DataFrame(result.fetchall(), columns=result.keys())
# Example usage
df = run_query("SELECT * FROM predictions LIMIT 10")
df.describe() # Quick statistics
This hybrid approach gives you the best of both worlds - specialized database tools when needed and seamless integration with your data science workflow in notebooks.
Unlike SQLite which uses a single file, PostgreSQL is a client-server database system that manages data internally. This section explains how data is stored and accessed in this project.
When you collect and store data using this API:
-
Data Storage: PostgreSQL stores all data in its internal directory structure, which is mounted as a Docker volume (
postgres_data:/var/lib/postgresql/data/
). -
No Single Database File: You won't see a database file in your project directory. Instead, PostgreSQL manages its data files internally.
-
Persistence: The Docker volume ensures data persists even if containers are stopped or rebuilt.
There are multiple ways to access the stored data:
-
Through API Endpoints: Use the provided API endpoints (
/predictions
) to retrieve data. -
Direct Database Access: Connect to the PostgreSQL instance:
Host: localhost Port: 5432 Username: user Password: password Database: ml_api_db
-
From Jupyter Notebooks: Use the SQLAlchemy code pattern provided in section 6.3.2.
-
Using Database GUIs: Connect with tools like pgAdmin, DBeaver, or database extensions in VS Code/Cursor.
When the containers are running, you can inspect the database using:
# Connect to PostgreSQL container's shell
docker exec -it ml_api_with_postgresql_integration-db-1 bash
# Connect to PostgreSQL with psql client
psql -U user -d ml_api_db
# Useful psql commands:
# \dt - list tables
# \d+ predictions - show table structure
# SELECT * FROM predictions LIMIT 5; - view data
# \q - quit psql
To back up and restore your PostgreSQL data:
# Backup: from host machine
docker exec -t ml_api_with_postgresql_integration-db-1 pg_dump -U user ml_api_db > backup.sql
# Restore: to host machine
cat backup.sql | docker exec -i ml_api_with_postgresql_integration-db-1 psql -U user -d ml_api_db
The client-server architecture of PostgreSQL provides benefits including better performance, concurrent access, and more robust data management compared to file-based databases.
This project demonstrates how to build a machine learning API with PostgreSQL integration using FastAPI and Docker. It provides a complete framework for:
- Training and serving machine learning models
- Storing predictions in a PostgreSQL database
- Retrieving prediction history
- Running everything in Docker containers for reproducibility and scalability
The project is designed to be easily adaptable for different machine learning tasks and datasets while maintaining a strong focus on engineering best practices.
If you have any questions or encounter issues while working with this project, here are several resources to help you:
- GitHub Issues: Open an issue in the GitHub repository
- Documentation: Refer to the documentation in the respective README files within each directory
- External Resources:
- Contact: Feel free to reach out to the maintainers of this project