A minimal recommendation system framework using Metaflow for workflow orchestration.
This repository demonstrates how to build a production-ready recommendation system using Metaflow to solve the cross-selling problem.
Data: Uses Amazon product dataset (May 1996 – July 2014) with product metadata and user reviews. The data includes product relationships like "also_bought", "also_viewed", and "bought_together" to understand customer purchasing patterns.
Approach: Transforms user-item interactions into a graph-based representation where products are nodes and relationships (co-purchases, similar items) are edges. Generates random walks through the product graph to create training sequences for embedding models.
Models: Supports Word2Vec for product embeddings using skip-gram architecture, and Matrix Factorization for collaborative filtering, both with optional bias terms.
recsys-metaflow/
├── flows/ # Metaflow pipeline definitions
│ ├── data_flow.py # Data preparation pipeline
│ ├── model_flow.py # Model training pipeline
│ └── recommendation_deploy_flow.py # Recommendation generation pipeline
├── models/ # Model implementations
│ ├── word2vec.py # Word2Vec model
│ ├── matrix_factorization.py # Matrix factorization model
│ ├── training.py # Training utilities
│ └── datasets.py # Dataset classes
├── notebooks/ # Jupyter notebooks for exploration
└── DATA_README.md # Data documentation
When you run flows, Metaflow uses the @conda_base
decorator - Metaflow handles dependencies automatically.
In order to create the workstation environment where we can run, develop, and debug our Metaflow flows,
create a local environment with Metaflow and any dependencies we'll need to workflows steps that run locally.
First, install mamba or use your preferred alternative conda version. Then run:
mamba env create -f environment-ob.yaml
mamba activate recsys-metaflow-ob
or
mamba env create -f environment-oss.yaml
mamba activate recsys-metaflow-oss
Processes Amazon dataset into graph relationships and training sequences:
python flows/data_flow.py --environment=fast-bakery run --category Electronics --sample_size 1000
Note:
fast-bakery
is an Outerbounds only offering. Open-source users can usepypi
,conda
, or build docker images and assign tasks to them through compute decorators like@kubernetes
,@slurm
.
Parameters:
category
: Amazon category (Electronics, Books, All_Beauty)sample_size
: Number of records for testing (-1 for all data)test_size
: Train/validation split ratio (default: 0.33)
Trains recommendation models on processed data (triggered by DataFlow completion):
python flows/model_flow.py --environment=fast-bakery --trigger DataFlow/1401 run --model_type word2vec --embedding_dim 128 --epochs 5
Parameters:
--trigger DataFlow/[run_id]
: Run ID from completed DataFlow (e.g., DataFlow/1401)- This is useful in debugging. When deploying to a production orchestrator, Metaflow automatically passed this information along.
model_type
: Model to train (word2vec, mf, or mf_bias)embedding_dim
: Embedding dimension (default: 128)batch_size
: Training batch size (default: 128)epochs
: Number of training epochs (default: 5)learning_rate
: Learning rate (default: 0.01)
Generates recommendations using trained models (triggered by ModelFlow completion):
python flows/recommendation_deploy_flow.py --environment=fast-bakery --trigger ModelFlow/1402 run --top_k 10
Parameters:
--trigger ModelFlow/[run_id]
: Run ID from completed ModelFlow (e.g., ModelFlow/1402)top_k
: Number of recommendations to generate (default: 10)
The implementation includes:
- Word2Vec: Skip-gram architecture for product embeddings
- Matrix Factorization: Classic collaborative filtering approach
- Matrix Factorization with Bias: Enhanced MF with bias terms