CE-Bench: A Contrastive Evaluation Benchmark of LLM Interpretability with Sparse Autoencoders

Authors: Alex Gulko, Yusen Peng; Advisor: Dr. Sachin Kumar

💥NEW: ICML workshop feedback

missing citation - "The paper fails to cite a number of tools and methods it uses, such as the Gemma models, p-annealing SAEs [1], JumpReLUSAEs [2], and others."
the flaw of supervised training - "The interpretability score trains a linear regression using SAE-Bench scores as ground truth. However, SAE-Bench itself uses auto-interp as one of its core metrics. CE-Bench therefore inherits whatever noise, bias or prompt-instability those LLM judges introduce, even though its inference stage is LLM-free."
missing train/test split - "Since there is no explicit train-test split, one cannot tell whether the proposed metric generalises beyond the SAE-Bench results or merely memorises SAE-Bench results. The authors also never test whether the regressor can predict auto-interp ranking for new SAEs whose SAE-Bench scores are hidden. Without such a holdout, one cannot claim that CE-Bench is a reliable proxy for SAE-Bench."
previous work discussion - "the lack of meaningful comparison with relevant previous work, or at least a better positioning of this work with previous work"
not agree on the "contrastive part" - "Consider a minimally contrastive example of two stories or concepts like "victory" and "defeat" - intuitively, one would want the features spaces of these two to overlap significantly"
a longer discussion and description of the evaluation results is necessary!

Motivation

The two existing interpretability evaluation methods are based on LLM prompting, which can be inherently nondeterministic, unstable, and inconsistent, despite the fact that we can run the same prompt multiple times to slightly alleviate this problem. Instead of utilizing any LLM to evaluate or simulate neuron activations, we propose a contrastive evaluation framework, CE-Bench. Its architecture is illustrated below:

alt text

Contrastive Dataset

we first constructed a contrastive dataset, consisting of entries each with 3 stories and a subject. Stories are generated synthetically using GPT-4o LLM based on the subject and two prefixes with the prompts specified below.

alt text

Contrastive Score

We hypothesize that if neurons activate more differently between tokens with contrastive meanings between two contrastive paragraphs, the latent space is more interpretable. On the left side of the architecture, to implement this, for both input paragraphs, we compute the average activations of all tokens and jointly normalize them as well. We take the absolute element-wise difference of the average activations of two contrastive paragraphs, and we assign the maximum element-wise difference as the contrastive score.

Independent Score

We also hypothesize that if neurons activate more differently between marked tokens and unmarked tokens regardless of in which paragraph they are, the latent space is more interpretable. On the left side of the architecture, to realize this, for both paragraphs, we compute the average activations of marked tokens and the average activations of unmarked tokens, then jointly normalize them. We take the absolute element-wise difference between the activations of marked tokens and unmarked tokens, and we assign the maximum element-wise difference as the independent score.

Interpretability Score

we also hypothesize that the simple summation of them can be a naive yet reasonable indicator of the interpretability of sparse autoencoder probing: interpretable neurons, or interpretable sparse autoencoders as a whole, should demonstrate both strong contrastivity and independence.

Benchmark Result Analysis

Architecture of Sparse Autoencoders

our result:

dataset: GulkoA/contrastive-stories-v3 SAE suite: 65k width Gemma-2-2B

SAEBench result:

SAE suite: 65k width Gemma-2-2B

Depth of Layers

dataset: GulkoA/contrastive-stories-v2 SAE suite: 16k gemma-scope-2b-pt-res

Type of Layers

our result (preliminary)

dataset: GulkoA/contrastive-stories-v1 SAE suite: gemma-scope-2b-pt at layer 12, 16k width

Width of Latent Space

our result (preliminary):

dataset: GulkoA/contrastive-stories-v2 SAE suite: 16k gemma-scope-2b-pt-res at layer 12

command zoo

cores python ce_bench/CE_Bench.py --sae_regex_pattern "gemma-scope-2b-pt-res" --sae_block_pattern "layer_12/width_16k/average_l0_.*"

cores python ce_bench/neuron_steering.py

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.vscode		.vscode
ce_bench		ce_bench
contrastive_generation		contrastive_generation
contrastive_stories		contrastive_stories
data		data
docs		docs
figures		figures
interpretability_eval		interpretability_eval
preliminary_exploration		preliminary_exploration
sae_lens		sae_lens
utils		utils
v2_results		v2_results
v3_results		v3_results
wandb		wandb
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
augmented_pretrained_saes.yaml		augmented_pretrained_saes.yaml
auto_script.sh		auto_script.sh
depth_analysis.sh		depth_analysis.sh
environment.yml		environment.yml
layer_type_analysis.sh		layer_type_analysis.sh
requirements.txt		requirements.txt
width_analysis.sh		width_analysis.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!