Activation Steering

👉 (Aug-2025) Added pca_pairwise method and set as default. Use method="pca_pairwise" to reproduce results closer to those reported in the paper. Colab demos (see below) are fixed accordingly, and they should work as expected.

👉 (Jul-2025) Bug fixed: PCA_centering (@Reason239)

👉 (Apr-2025) Conditional Activation Steering is a spotlight paper at ICLR 2025!

👉 (Nov-2024) A few Colab demos are added.

👉 (Sep-2024) Preprint released on arXiv.

Overview

This is a general-purpose activation steering library to (1) extract vectors and (2) steer model behavior. We release this library alongside our recent paper on Programming Refusal with Conditional Activation Steering to provide an intuitive toolchain for activation steering efforts.

Installation

git clone https://github.com/IBM/activation-steering

pip install -e activation-steering

Activation Steering

Activation steering is a technique for influencing the behavior of language models by modifying their internal activations during inference. This library provides tools for:

Extracting steering vectors from contrastive examples
Applying steering vectors to modify model behavior

This part is conceptually similar to Steering Language Models With Activation Engineering, but our code implementation could be different.

Conditional Activation Steering

Conditional activation steering selectively applies or withholds activation steering based on the input context. Conditional activation steering extends the activation steering framework by introducing:

Context-dependent control capabilities through condition vectors
Logical composition of multiple condition vectors

Refer to our paper and documentation for detailed implementation and usage of activation steering and conditional activation steering.

Documentation

Refer to /docs to understand this library. We recommend starting with Quick Start Tutorial as it covers most concepts that you need to get started with activation steering and conditional activation steering.

Quick Start Tutorial (10 minutes ~ 60 minutes, depending on your hardware) 👉 here!
FAQ 👉 here!

Colab Demos

Adding Refusal Behavior to LLaMA 3.1 8B Inst 👉 here!
Adding CoT Behavior to Gemma 2 9B 👉 here!
Making Hermes 2 Pro Conditionally Refuse Legal Instructions 👉 here!

Acknowledgement

This library builds on top of the excellent work done in the following repositories:

Some parts of the documentation for this library are generated by

ml-tooling/lazydocs > lazydocs activation_steering/ --no-watermark

Citation

@misc{lee2024programmingrefusalconditionalactivation,
      title={Programming Refusal with Conditional Activation Steering}, 
      author={Bruce W. Lee and Inkit Padhi and Karthikeyan Natesan Ramamurthy and Erik Miehling and Pierre Dognin and Manish Nagireddy and Amit Dhurandhar},
      year={2024},
      eprint={2409.05907},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2409.05907}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
activation_steering		activation_steering
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Activation Steering

Overview

Installation

Activation Steering

Conditional Activation Steering

Documentation

Colab Demos

Acknowledgement

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

IBM/activation-steering

Folders and files

Latest commit

History

Repository files navigation

Activation Steering

Overview

Installation

Activation Steering

Conditional Activation Steering

Documentation

Colab Demos

Acknowledgement

Citation

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages