This repository contains a demo experiment for running LensKit experiments on public data sets with current best practices for moderately-sized experiments.
This experiment uses DVC to script the experiment, and is laid out in several subcomponents:
-
lkdemois a Python package containing support code (e.g. log configurations) and algorithm definitions. Two files are of particular interest:lkdemo/algorithms.pydefines the different algorithms we can train with sensible default configurations.lkdemo/datasets.pydefines the different data sets, so that any supported data set can be loaded into the format LensKit expects in a uniform fashion.
-
datacontains data files and controls. -
data-splitcontains cross-validation splits, produced bysplit-data.py. These splits only contain the test files, to save disk space - the train files can be obtained withlkdemo.datasets.ds_diff, as seen inrun-algo.py. -
runscontains the results of running LensKit train/test runs. -
Various Python scripts to run individual pieces of the analysis. They use
docoptfor parsing their arguments and thus have comprehensive usage docs in their docstrings. -
Jupyter notebooks to analyze results. These are parameterized and run with Papermill to analyze different data sets with the same notebook.
This experiment comes with dependencies specified in pyproject.toml, and locked with uv.lock for use with [uv][].
To set up, run:
$ uv syncThis will create a virtual environment in .venv/, whic you can activate with:
$ . ./.venv/bin/activateThe dvc program controls runs of individual steps, including downloading data.
For example, to download the ML-20M data set and recommend with ALS, run:
dvc repro runs/dvc.yaml:ml20m@ALS
To re-run the whole experiment:
dvc repro
To reproduce results on one data set:
dvc repro eval-report-ml100k
The various dvc.yaml files control the run. Look at them to modify and extend!
You will probably want to consult the DVC user guide.