[Work-In-Progress] A research framework for analyzing differences between language models using interpretability techniques. This project enables systematic comparison of base models and their variants (model organisms) through various diffing methodologies.
Note: The toolkit is based on a heavily modified version of the saprmarks/dictionary_learning repository, available at science-of-finetuning/dictionary_learning. Although we may eventually merge these repositories, this is currently not a priority due to significant divergence.
This framework consists of two main pipelines:
- Preprocessing Pipeline: Extract and cache activations from pre-existing models
- Diffing Pipeline: Analyze differences between models using interpretability techniques
The framework is designed to work with pre-existing model pairs (e.g., base models vs. model organisms) rather than training new models.
- Clone the repository:
git clone https://github.com/science-of-finetuning/diffing-game
cd diffing-game
- Install dependencies:
pip install -r requirements.txt
Run the complete pipeline (preprocessing + diffing) with default settings:
python main.py
Run preprocessing only (extract activations):
python main.py pipeline.mode=preprocessing
Run diffing analysis only (assumes activations already exist):
python main.py pipeline.mode=diffing
Analyze specific organism and model combinations:
python main.py organism=caps model=gemma3_1B
Use different diffing methods:
python main.py diffing/method=kl
python main.py diffing/method=normdiff
Run experiments across multiple configurations:
python main.py --multirun organism=caps,roman_concrete model=gemma3_1B
Run with different diffing methods:
python main.py --multirun diffing/method=kl,normdiff
The framework includes a Streamlit-based interactive dashboard for visualizing and exploring model diffing results.
- Dynamic Discovery: Automatically detects available models, organisms, and diffing methods
- Real-time Visualization: Interactive plots and visualizations of diffing results
- Model Integration: Direct links to Hugging Face model pages
- Multi-method Support: Compare results across different diffing methodologies
- Interactive Model Testing: Test custom inputs and steering vectors on both base and finetuned models in real-time
Launch the dashboard with:
streamlit run dashboard.py
The dashboard will be available at http://localhost:8501
by default.
You can also pass configuration overwrites to the dashboard:
streamlit run dashboard.py -- model.dtype=float32
- Select Base Model: Choose from available base models
- Select Organism: Pick the model organism (finetuned variant)
- Select Diffing Method: Choose the analysis method to visualize
- Explore Results: Interact with the generated visualizations
The dashboard requires that you have already run diffing experiments to generate results to visualize.