Skip to content

science-of-finetuning/sparsity-artifacts-crosscoders

Repository files navigation

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

This repository contains the code for the Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning paper.

The trained models, along with statistics and maximally activating examples for each latent, are hosted at our huggingface page. We also provide an interactive Colab notebook and training logs in our wandb.

Requirements

Our code heavily relies on an adapted version of the dictionary_learning library. Install requirements with

pip install -r requirements.txt

Reproduce experiments

We cache model activations to disk. Our code assumes that you have around 4TB of storage per model available and that the environment variable $DATASTORE points to it. The training scripts will log progress to wandb. All models will be checkpointed to the checkpoints folder. The resulting plots will be generated in $DATASTORE/results.

For Gemma 2 2b:

bash train_gemma2b.sh

For Llama 3.2 1b:

bash train_llama1b.sh

For Llama 3.1 8b:

bash train_llama8b.sh

Check out notebooks/art.py for generating the more complex plots.

Code structure

The code that implements the actual crosscoders is found in our dictionary_learning fork. This repository is organized into two main directories:

The folder scripts contains the main execution scripts

The tools folder contains various utility functions. The steering_app folder contains a streamlit app to generate steered outputs.

With the notebooks/dashboard-and-demo.ipynb notebook you can explore the crosscoders and their latents.

About

Code for the "Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning" paper.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •