Skip to content

outerbounds/recsys-metaflow

Repository files navigation

Metaflow RecSys

Business Problem

A minimal recommendation system framework using Metaflow for workflow orchestration.

Overview

Metaflow System Architecture

This repository demonstrates how to build a production-ready recommendation system using Metaflow to solve the cross-selling problem.

Data: Uses Amazon product dataset (May 1996 – July 2014) with product metadata and user reviews. The data includes product relationships like "also_bought", "also_viewed", and "bought_together" to understand customer purchasing patterns.

Approach: Transforms user-item interactions into a graph-based representation where products are nodes and relationships (co-purchases, similar items) are edges. Generates random walks through the product graph to create training sequences for embedding models.

Models: Supports Word2Vec for product embeddings using skip-gram architecture, and Matrix Factorization for collaborative filtering, both with optional bias terms.

Structure

recsys-metaflow/
├── flows/                    # Metaflow pipeline definitions
│   ├── data_flow.py          # Data preparation pipeline
│   ├── model_flow.py         # Model training pipeline 
│   └── recommendation_deploy_flow.py # Recommendation generation pipeline
├── models/                   # Model implementations
│   ├── word2vec.py           # Word2Vec model
│   ├── matrix_factorization.py # Matrix factorization model
│   ├── training.py           # Training utilities
│   └── datasets.py          # Dataset classes
├── notebooks/                # Jupyter notebooks for exploration
└── DATA_README.md           # Data documentation

Getting Started

Running the Flows

Workstation setup

When you run flows, Metaflow uses the @conda_base decorator - Metaflow handles dependencies automatically. In order to create the workstation environment where we can run, develop, and debug our Metaflow flows, create a local environment with Metaflow and any dependencies we'll need to workflows steps that run locally.

First, install mamba or use your preferred alternative conda version. Then run:

mamba env create -f environment-ob.yaml
mamba activate recsys-metaflow-ob

or

mamba env create -f environment-oss.yaml
mamba activate recsys-metaflow-oss

Data Flow

Processes Amazon dataset into graph relationships and training sequences:

python flows/data_flow.py --environment=fast-bakery run --category Electronics --sample_size 1000

Note: fast-bakery is an Outerbounds only offering. Open-source users can use pypi, conda, or build docker images and assign tasks to them through compute decorators like @kubernetes, @slurm.

Parameters:

  • category: Amazon category (Electronics, Books, All_Beauty)
  • sample_size: Number of records for testing (-1 for all data)
  • test_size: Train/validation split ratio (default: 0.33)

Model Flow

Trains recommendation models on processed data (triggered by DataFlow completion):

python flows/model_flow.py --environment=fast-bakery  --trigger DataFlow/1401 run --model_type word2vec --embedding_dim 128 --epochs 5

Parameters:

  • --trigger DataFlow/[run_id]: Run ID from completed DataFlow (e.g., DataFlow/1401)
    • This is useful in debugging. When deploying to a production orchestrator, Metaflow automatically passed this information along.
  • model_type: Model to train (word2vec, mf, or mf_bias)
  • embedding_dim: Embedding dimension (default: 128)
  • batch_size: Training batch size (default: 128)
  • epochs: Number of training epochs (default: 5)
  • learning_rate: Learning rate (default: 0.01)

Recommendation Flow

Generates recommendations using trained models (triggered by ModelFlow completion):

python flows/recommendation_deploy_flow.py --environment=fast-bakery --trigger ModelFlow/1402 run --top_k 10

Parameters:

  • --trigger ModelFlow/[run_id]: Run ID from completed ModelFlow (e.g., ModelFlow/1402)
  • top_k: Number of recommendations to generate (default: 10)

Models

The implementation includes:

  1. Word2Vec: Skip-gram architecture for product embeddings
  2. Matrix Factorization: Classic collaborative filtering approach
  3. Matrix Factorization with Bias: Enhanced MF with bias terms

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published