Model Steering¶

tinyurl.com/llm-steering¶

In this session, we will explore recent work on LLM steering. This technique adds abstract concept vectors to a model's hidden state to alter its output. For example, we can add an Eifel Tower-related embedding and the model will speak as a "large metal structure" rather than a "helpful assistant." Come by to learn more and tinker with this new method.

  • Signs of introspection in large language models
  • Steering LLM Behavior Without Fine-Tuning
  • The Eiffel Tower Llama
  • Steering GPT-2-XL by adding an activation vector
In [ ]:
%pip install -q nnsight sae-lens
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.0/44.0 kB 3.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 145.1/145.1 kB 11.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.0/61.0 kB 3.7 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 105.0/105.0 kB 13.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 288.1/288.1 kB 18.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 195.2/195.2 kB 10.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.0/12.0 MB 120.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 739.7/739.7 kB 62.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 566.4/566.4 kB 23.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 kB 4.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 77.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.0/18.0 MB 104.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.8/59.8 kB 7.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.1/82.1 kB 10.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.2/49.2 kB 6.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 273.7/273.7 kB 27.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 236.6/236.6 kB 28.9 MB/s eta 0:00:00
  Building wheel for transformers-stream-generator (setup.py) ... done
  Building wheel for py2store (setup.py) ... done
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-contrib-python 4.13.0.92 requires numpy>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.
opencv-python-headless 4.13.0.92 requires numpy>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
opencv-python 4.13.0.92 requires numpy>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
plum-dispatch 2.6.1 requires beartype>=0.16.2, but you have beartype 0.14.1 which is incompatible.
jax 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
tobler 0.13.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
pytensor 2.37.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
rasterio 1.5.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.
jaxlib 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
In [ ]:
%pip install numpy --upgrade
Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (1.26.4)
Collecting numpy
  Downloading numpy-2.4.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (6.6 kB)
Downloading numpy-2.4.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (16.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.6/16.6 MB 108.9 MB/s eta 0:00:00
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
transformer-lens 2.17.0 requires numpy<2,>=1.26; python_version == "3.12", but you have numpy 2.4.2 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.4.2 which is incompatible.
tensorflow 2.19.0 requires numpy<2.2.0,>=1.26.0, but you have numpy 2.4.2 which is incompatible.
Successfully installed numpy-2.4.2
In [ ]:
#TODO move this to later where it's necessary
%pip install -q --upgrade --force-reinstall transformers
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.5/40.5 kB 2.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.7/57.7 kB 4.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.4/10.4 MB 84.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 553.3/553.3 kB 41.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.6/16.6 MB 74.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74.4/74.4 kB 7.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 807.9/807.9 kB 32.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 803.6/803.6 kB 44.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 507.2/507.2 kB 31.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 83.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.4/78.4 kB 6.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 202.5/202.5 kB 22.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 98.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73.5/73.5 kB 7.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.8/78.8 kB 8.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.4/56.4 kB 5.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 kB 4.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 108.3/108.3 kB 12.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 310.0/310.0 kB 29.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 113.6/113.6 kB 12.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.0/71.0 kB 8.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 152.9/152.9 kB 16.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87.3/87.3 kB 6.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 60.3 MB/s eta 0:00:00
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sae-lens 6.37.1 requires transformers<5.0.0,>=4.38.1, but you have transformers 5.2.0 which is incompatible.
transformer-lens 2.17.0 requires huggingface-hub<1.0,>=0.23.2, but you have huggingface-hub 1.4.1 which is incompatible.
transformer-lens 2.17.0 requires numpy<2,>=1.26; python_version == "3.12", but you have numpy 2.4.2 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.4.2 which is incompatible.
plum-dispatch 2.6.1 requires beartype>=0.16.2, but you have beartype 0.14.1 which is incompatible.
bigframes 2.33.0 requires rich<14,>=12.4.4, but you have rich 14.3.2 which is incompatible.
tensorflow 2.19.0 requires numpy<2.2.0,>=1.26.0, but you have numpy 2.4.2 which is incompatible.
gcsfs 2025.3.0 requires fsspec==2025.3.0, but you have fsspec 2026.2.0 which is incompatible.
datasets 4.0.0 requires fsspec[http]<=2025.3.0,>=2023.1.0, but you have fsspec 2026.2.0 which is incompatible.
In [ ]:
from nnsight import LanguageModel

model = LanguageModel('openai-community/gpt2', device_map='auto', dispatch=True)


N_HEADS = model.config.n_head
N_LAYERS = model.config.n_layer
D_MODEL = model.config.n_embd
D_HEAD = D_MODEL // N_HEADS

print(f"Number of heads: {N_HEADS}")
print(f"Number of layers: {N_LAYERS}")
print(f"Model dimension: {D_MODEL}")
print(f"Head dimension: {D_HEAD}\n")

#print("Entire config: ", model.config)
Number of heads: 12
Number of layers: 12
Model dimension: 768
Head dimension: 64

In [ ]:
from nnsight import LanguageModel

with model.trace('The Eiffel Tower is in the city of'):
    # Intervene on activations (must access in execution order!)
    #model.transformer.h[0].output[0][:] = 0

    layer = 8 #GPT2 has 0-11

    # Steering Coefficient (α)
    steer_coef = 0.5

    # Get the vector from layer 0 first to understand its shape
    activation = model.transformer.h[0].output[0]

    # The activation has shape [batch_size, seq_len, hidden_size]
    # For steering, we typically want to add the same vector to all positions
    # So we take the mean across sequence length or use the last token
    steering_vector = activation.mean(dim=1, keepdim=True)  # Average across sequence positions

    amount = steer_coef * steering_vector
    model.transformer.h[layer].output[0][:] += amount

    # Access and save hidden states from a later layer
    hidden_states = model.transformer.h[-1].output[0].save()

    # Get model output
    output = model.output.save()

print("\n" +model.tokenizer.decode(output.logits.argmax(dim=-1)[0]))
 the-el Tower is a the middle of London

Use a feature from Neuronopedia¶

In [ ]:
import json
import requests

# prompt and model
PROMPT = "The most iconic structure on Earth is"
MODEL_ID = "gemma-2b"

# feature about San Francisco
FEATURE = {"modelId": "gemma-2b", "layer": "6-res-jb", "index": 10200, "strength": 5} #https://www.neuronpedia.org/gemma-2b/6-res-jb/10200

# other settings
TEMPERATURE = 0.2
N_TOKENS = 16
FREQ_PENALTY = 1.0
SEED = 16
STRENGTH_MULTIPLIER = 4

# make the request
url = "https://www.neuronpedia.org/api/steer"
data = {
    "prompt": PROMPT,
    "modelId": MODEL_ID,
    "features": [FEATURE],
    "temperature": TEMPERATURE,
    "n_tokens": N_TOKENS,
    "freq_penalty": FREQ_PENALTY,
    "seed": SEED,
    "strength_multiplier": STRENGTH_MULTIPLIER,
}
headers = {"Content-Type": "application/json"}

# send request
response = requests.post(url, json=data, headers=headers)
json_response = response.json()
formatted_response = json.dumps(json_response, indent=4)
print(formatted_response)
{
    "STEERED": "The most iconic structure on Earth is the Golden Gate Bridge. It\u2019s a symbol of the Bay Area and a",
    "DEFAULT": "The most iconic structure on Earth is the Great Pyramid of Giza, which was built around 2560",
    "steeredLogProbs": null,
    "defaultLogProbs": null,
    "id": "clxy901zx0003kapdbz9uo879",
    "shareUrl": "https://www.neuronpedia.org/steer/clxy901zx0003kapdbz9uo879",
    "limit": "119"
}
In [ ]:
# TODO while the canned example works, it's very difficult to find features that are supported.  The data on the page listing available features does not seem to be correct. Probably need to cut this as an option.
https://www.neuronpedia.org/available-resources


import json
import requests

# prompt and model
PROMPT = "The best place in the world is "
MODEL_ID = "gemma-2b"

# feature about Princeton
FEATURE = {"modelId": "gemma-2b", "layer": "19-gemmascope-mlp-16k", "index": 9286, "strength": 5} #https://www.neuronpedia.org/gemma-2b/0-gemmascope-att-16k/270
# other settings
TEMPERATURE = 0.2
N_TOKENS = 16
FREQ_PENALTY = 1.0
SEED = 16
STRENGTH_MULTIPLIER = 4

# make the request
url = "https://www.neuronpedia.org/api/steer"
data = {
    "prompt": PROMPT,
    "modelId": MODEL_ID,
    "features": [FEATURE],
    "temperature": TEMPERATURE,
    "n_tokens": N_TOKENS,
    "freq_penalty": FREQ_PENALTY,
    "seed": SEED,
    "strength_multiplier": STRENGTH_MULTIPLIER,
}
headers = {"Content-Type": "application/json"}

# send request
response = requests.post(url, json=data, headers=headers)
json_response = response.json()
formatted_response = json.dumps(json_response, indent=4)
print(formatted_response)
{
    "message": "The model, source, or feature you specified is not available. Check available public models/sources (including which ones have inference enabled) at https://www.neuronpedia.org/available-resources"
}
In [ ]:
from sae_lens import SAE
import torch

# https://www.neuronpedia.org/gpt2-small/1-res-jb/14216

sae, cfg_dict, sparsity = SAE.from_pretrained(release="gpt2-small-res-jb",
                         sae_id="blocks.1.hook_resid_pre",
                         device='cuda' if torch.cuda.is_available() else 'cpu')
cfg.json: 0.00B [00:00, ?B/s]
blocks.1.hook_resid_pre/sae_weights.safe(…):   0%|          | 0.00/151M [00:00<?, ?B/s]
blocks.1.hook_resid_pre/sparsity.safeten(…):   0%|          | 0.00/98.4k [00:00<?, ?B/s]
/usr/local/lib/python3.12/dist-packages/sae_lens/saes/sae.py:248: UserWarning: 
This SAE has non-empty model_from_pretrained_kwargs. 
For optimal performance, load the model like so:
model = HookedSAETransformer.from_pretrained_no_processing(..., **cfg.model_from_pretrained_kwargs)
  warnings.warn(
/tmp/ipython-input-2246977661.py:6: DeprecationWarning: Unpacking SAE objects is deprecated. SAE.from_pretrained() now returns only the SAE object. Use SAE.from_pretrained_with_cfg_and_sparsity() to get the config dict and sparsity as well.
  sae, cfg_dict, sparsity = SAE.from_pretrained(release="gpt2-small-res-jb",
In [ ]:
feature_id = 14216  # your feature
feature_vec = sae.W_dec[feature_id, :]  # shape [d_model]
feature_vec.shape
Out[ ]:
torch.Size([768])
In [ ]:
from nnsight import LanguageModel

# TODO This does not work!

with model.trace("I've never been to University"):
    # Intervene on activations (must access in execution order!)
    #model.transformer.h[0].output[0][:] = 0

    layer = 8 #GPT2 has 0-11

    # Steering Coefficient (α)
    steer_coef = 6.0

    steering_vector = feature_vec

    amount = steer_coef * steering_vector
    model.transformer.h[layer].output[0][:] += amount

    model.transformer.h[layer].output[0].save()

    # Get model output
    output = model.output.save()

print("\n" +model.tokenizer.decode(output.logits.argmax(dim=-1)[0]))
. been seen to a of

https://www.neuronpedia.org/gemma-2-2b/19-gemmascope-mlp-16k/9286

In [ ]:
sae, cfg_dict, sparsity = SAE.from_pretrained(release="gemma-scope-2b-pt-mlp-canonical",
                         sae_id="layer_19/width_16k/canonical",
                         device='cuda' if torch.cuda.is_available() else 'cpu')
/tmp/ipython-input-2166836626.py:1: DeprecationWarning: Unpacking SAE objects is deprecated. SAE.from_pretrained() now returns only the SAE object. Use SAE.from_pretrained_with_cfg_and_sparsity() to get the config dict and sparsity as well.
  sae, cfg_dict, sparsity = SAE.from_pretrained(release="gemma-scope-2b-pt-mlp-canonical",
In [ ]:
feature_id = 9286  # your feature
feature_vec = sae.W_dec[feature_id, :]  # shape [d_model]
feature_vec.shape
Out[ ]:
torch.Size([2304])
In [ ]:
gemma_model = LanguageModel('google/gemma-2-2b', device_map='auto', dispatch=True)
config.json:   0%|          | 0.00/818 [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/46.4k [00:00<?, ?B/s]
tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]
model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]
Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]
model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]
model-00002-of-00003.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]
model-00003-of-00003.safetensors:   0%|          | 0.00/481M [00:00<?, ?B/s]
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]
WARNING:accelerate.big_modeling:Some parameters are on the meta device because they were offloaded to the disk and cpu.
In [ ]:
from nnsight import LanguageModel

with gemma_model.trace("I've never been to University"):
    # Intervene on activations (must access in execution order!)
    #model.transformer.h[0].output[0][:] = 0

    layer = 8 #GPT2 has 0-11

    # Steering Coefficient (α)
    steer_coef = 6.0

    steering_vector = feature_vec

    amount = steer_coef * steering_vector
    model.transformer.h[layer].output[0][:] += amount

    model.transformer.h[layer].output[0].save()

    # Get model output
    output = model.output.save()

print("\n" +model.tokenizer.decode(output.logits.argmax(dim=-1)[0]))
---------------------------------------------------------------------------
NNsightException                          Traceback (most recent call last)
/tmp/ipython-input-3048769813.py in <cell line: 0>()
      3 # TODO This does not work!
      4 
----> 5 with model.trace("I've never been to University"):
      6     # Intervene on activations (must access in execution order!)
      7     #model.transformer.h[0].output[0][:] = 0

/usr/local/lib/python3.12/dist-packages/nnsight/intervention/tracing/base.py in __exit__(self, exc_type, exc_val, exc_tb)
    663                 return self.backend(self)
    664 
--> 665             self.backend(self)
    666 
    667             return True

/usr/local/lib/python3.12/dist-packages/nnsight/intervention/backends/execution.py in __call__(self, tracer)
     22         except Exception as e:
     23 
---> 24             raise wrap_exception(e, tracer.info) from None
     25         finally:
     26             Globals.exit()

NNsightException: 

Traceback (most recent call last):
  File "/tmp/ipython-input-3048769813.py", line 17, in <cell line: 0>
    model.transformer.h[layer].output[0][:] += amount

RuntimeError: The size of tensor a (768) must match the size of tensor b (2304) at non-singleton dimension 2

https://www.neuronpedia.org/search-explanations/?q=%D0%BF%D0%BE%D0%B2%D1%A3%D1%81%D1%82%D1%8C

Steering Vectors in GPT2-XL¶

from: https://nnsight.net/notebooks/mini-papers/todd_function_vectors/#id8

In [ ]:
import torch as t
from nnsight import CONFIG, LanguageModel

gpt2_xl = LanguageModel("gpt2-xl", device_map="auto", torch_dtype=t.bfloat16)
tokenizer = gpt2_xl.tokenizer

REMOTE = False
config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]
`torch_dtype` is deprecated! Use `dtype` instead!
tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]
vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]
merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]
In [ ]:
SAMPLING_KWARGS = {
    "do_sample": True,
    "top_p": 0.3,
    "repetition_penalty": 1.2,
}


def calculate_and_apply_steering_vector(
    model: LanguageModel,
    prompt: str,
    activation_additions: list[tuple[int, float, str]],
    n_tokens: int,
    n_comparisons: int = 1,
    use_bos: bool = True,
) -> tuple[list[str], list[str]]:
    """
    Performs the steering vector experiments described in the LessWrong post.

    Args:
        model: LanguageModel
            the transformer you're doing this computation with
        prompt: str
            The original prompt, which we'll be doing activation steering on.

        activation_additions: list[tuple[int, float, str]], each tuple contains:
            layer - the layer we're applying these steering vectors to
            coefficient - the value we're multiplying it by
            prompt - the prompt we're inputting
            e.g. activation_additions[0] = [6, 5.0, " Love"] means we add the " Love" vector at layer 6, scaled by 5x

        n_tokens: int
            Number of tokens which will be generated for each completion

        n_comparisons: int
            Number of sequences generated in this function (i.e. we generate `n_comparisons` which are unsteered, and
            the same number which are steered).

    Returns:
        unsteered_completions: list[str]
            List of length `n_comparisons`, containing all the unsteered completions.

        steered_completions: list[str]
            List of length `n_comparisons`, containing all the steered completions.
    """
    # Add the BOS token manually, if we're including it
    if use_bos:
        bos = model.tokenizer.bos_token
        prompt = bos + prompt
        activation_additions = [[layer, coeff, bos + p] for layer, coeff, p in activation_additions]

    # Get the (layers, coeffs, prompts) in an easier form to use, also calculate the prompt lengths & check they're all the same
    act_add_layers, act_add_coeffs, act_add_prompts = zip(*activation_additions)
    act_add_seq_lens = [len(model.tokenizer.tokenize(p)) for p in act_add_prompts]
    assert len(set(act_add_seq_lens)) == 1, "All activation addition prompts must be the same length."
    assert act_add_seq_lens[0] <= len(
        model.tokenizer.tokenize(prompt)
    ), "All act_add prompts should be shorter than original prompt."

    # Get the prompts we'll intervene on (unsteered and steered)
    steered_prompts = [prompt for _ in range(n_comparisons)]
    unsteered_prompts = [prompt for _ in range(n_comparisons)]

    with model.generate(max_new_tokens=n_tokens, remote=REMOTE, **SAMPLING_KWARGS) as generator:
        # Run the act_add prompts (i.e. the contrast pairs), and extract their activations
        with generator.invoke(act_add_prompts):
            # Get all the prompts from the activation additions, and put them in a list
            # (note, we slice from the end of the sequence because of left-padding)
            act_add_vectors = [
                model.transformer.h[layer].output[0][i, -seq_len:].save()
                for i, (layer, seq_len) in enumerate(zip(act_add_layers, act_add_seq_lens))
            ]

        # Forward pass on unsteered prompts (no intervention, no activations saved - we only need the completions)
        with generator.invoke(unsteered_prompts):
            unsteered_out = model.generator.output.save()

        # Forward pass on steered prompts (we add in the results from the act_add prompts)
        with generator.invoke(steered_prompts):
            # For each act_add prompt, add the vector to residual stream, at the start of the sequence
            for i, (layer, coeff, seq_len) in enumerate(zip(act_add_layers, act_add_coeffs, act_add_seq_lens)):
                model.transformer.h[layer].output[0][:, :seq_len] += coeff * act_add_vectors[i]
            steered_out = model.generator.output.save()

    # Decode steered & unsteered completions (discarding the sequences we only used for extracting activations) & return results
    unsteered_completions = model.tokenizer.batch_decode(unsteered_out[-n_comparisons:])
    steered_completions = model.tokenizer.batch_decode(steered_out[-n_comparisons:])

    return unsteered_completions, steered_completions
In [ ]:
from rich.table import Table

unsteered_completions, steered_completions = calculate_and_apply_steering_vector(
    gpt2_xl,
    prompt="I hate you because",
    # [6, 5.0, " Love"] means we add the " Love" vector at layer 6, scaled by 5x
    activation_additions=[(6, +5.0, "Love "), (6, -5.0, "Hate")],
    n_tokens=50,
    n_comparisons=3,
    use_bos=True,
)

table = Table("Unsteered", "Steered", title="Completions", show_lines=True)
for usc, sc in zip(unsteered_completions, steered_completions):
    table.add_row(usc, sc)
table
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[ ]:
                                                    Completions                                                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Unsteered                                              ┃ Steered                                                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ <|endoftext|>I hate you because I love you.            │ <|endoftext|>I hate you because I'm a girl.            │
│                                                        │                                                        │
│ That's the message from a new study that shows how     │ The world is full of love, but we don't always see it. │
│ people with strong religious beliefs are more likely   │ It's easy to be blinded by the sun and forget about    │
│ to be attracted to others who share their values, even │ what lies beneath. But that doesn't mean we can't find │
│ if they're not exactly like them.                      │ our way back                                           │
│                                                        │                                                        │
│ The research was conducted                             │                                                        │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ <|endoftext|>I hate you because I'm a feminist.        │ <|endoftext|>I hate you because I love you.            │
│                                                        │                                                        │
│ This is the first time in my life that I've ever       │ And I want to be with you, forever and ever.           │
│ written something so personal and public, but it's     │                                                        │
│ also the first time I've ever felt like I had to say   │ I'm a lover of life, not just the things that are      │
│ it out loud.                                           │ beautiful in it but also the things that make us       │
│                                                        │ human. And I believe that love                         │
│ The last                                               │                                                        │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ <|endoftext|>I hate you because I love you.            │ <|endoftext|>I hate you because I love you.            │
│                                                        │                                                        │
│ This is a story about how the world's most popular     │ You're the best, and that's why I'm here.              │
│ video game, Pokemon Go, has changed my life and made   │                                                        │
│ me more aware of my surroundings than ever before.     │ We are a team of people who have been in love with     │
│                                                        │ each other for years. We've had our ups and downs, but │
│ The app was released in July 2016, but                 │ we always find                                         │
└────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┘
In [ ]:
unsteered_completions, steered_completions = calculate_and_apply_steering_vector(
    gpt2_xl,
    prompt="I went up to my friend and said",
    activation_additions=[
        (20, +4.0, "I talk about weddings constantly  "),
        (20, -4.0, "I do not talk about weddings constantly"),
    ],
    n_tokens=50,
    n_comparisons=3,
    use_bos=False,
)

table = Table("Unsteered", "Steered", title="Completions", show_lines=True)
for usc, sc in zip(unsteered_completions, steered_completions):
    table.add_row(usc, sc)
table
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[ ]:
                                                    Completions                                                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Unsteered                                              ┃ Steered                                                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ I went up to my friend and said, 'What are you doing?' │ I went up to my friend and said, "What do you think of │
│ He said, 'Oh, I'm just taking a nap.' And I was like,  │ the wedding ?"                                         │
│ 'No, you're not. You're sleeping.'"                    │                                                        │
│                                                        │ "Oh, I'm so excited. It's a great story. I love        │
│ "And he's like, 'Yeah, but it's                        │ weddings."                                             │
│                                                        │                                                        │
│                                                        │ And it was. But I don't want to talk about that        │
│                                                        │ because it's not                                       │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ I went up to my friend and said, 'What's going on?' He │ I went up to my friend and said, "You know what? I     │
│ said, 'You're not going to believe this.' I said,      │ think we should do this." And then the wedding  and    │
│ 'What do you mean? What happened?'"                    │ stuff. But it's a lot of fun. It's really cool. You    │
│                                                        │ get to talk about all these things that you're         │
│ The two men then walked out of the bar.                │ interested in. So                                      │
│                                                        │                                                        │
│ "We were walking down                                  │                                                        │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ I went up to my friend and said, 'What's going on?' He │ I went up to my friend and said, "Hey, I've got a      │
│ told me that he had been in a car accident. I was      │ wedding coming up in two weeks. You know, it's not     │
│ like, 'Oh my God! What happened?'"                     │ just about the bride and groom, but all of the people  │
│                                                        │ involved." And that's what I talk about with my        │
│ "It wasn't until later that we found out what the      │ clients  and their friends                             │
│ accident was," she added                               │                                                        │
└────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┘
In [ ]:
unsteered_completions, steered_completions = calculate_and_apply_steering_vector(
    gpt2_xl,
    prompt="To see the eiffel tower, people flock to",
    activation_additions=[
        (24, +10.0, "The Eiffel Tower is in Rome"),
        (24, -10.0, "The Eiffel Tower is in France"),
    ],
    n_tokens=50,
    n_comparisons=3,
    use_bos=False,
)

table = Table("Unsteered", "Steered", title="Completions", show_lines=True)
for usc, sc in zip(unsteered_completions, steered_completions):
    table.add_row(usc, sc)
table
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[ ]:
                                                    Completions                                                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Unsteered                                              ┃ Steered                                                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ To see the eiffel tower, people flock to it. It's a    │ To see the eiffel tower, people flock to it," said Mr. │
│ very special place," he said.                          │ Cappelli, who has been a journalist for nearly 40      │
│                                                        │ years and is now the editor of "The New York Times"    │
│ "It's an icon of Paris and I think it should be        │ magazine.                                              │
│ protected."<|endoftext|><|endoftext|><|endoftext|><|e… │                                                        │
│                                                        │ Newsletter Sign Up Continue reading the main story     │
│                                                        │ Please verify you're not a robot by                    │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ To see the eiffel tower, people flock to Paris. It's a │ To see the eiffel tower, people flock to it," said     │
│ tourist attraction. But it is also an architectural    │ one. "It's a great place for tourists."                │
│ masterpiece.                                           │                                                        │
│                                                        │ "I think that this is the best day of my life," said   │
│ "It's a very important part of our culture," says      │ another.                                               │
│ Jean-Pierre Gaudin, who has been working on the        │                                                        │
│ project for 20 years. "We                              │ The city has also been trying to make sure the public  │
│                                                        │ knows what's going on                                  │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ To see the eiffel tower, people flock to it. They want │ To see the eiffel tower, people flock to the city's    │
│ to be close to it and they want to look at it," he     │ main entrance and stand in line for hours.             │
│ said.                                                  │                                                        │
│                                                        │ "It is a great symbol of our city," said one woman who │
│ "It's a very special place for                         │ was waiting for her turn at the door. "I am here       │
│ me."<|endoftext|><|endoftext|><|endoftext|><|endoftex… │ because I want to pray."                               │
│                                                        │                                                        │
│                                                        │ The                                                    │
└────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┘
In [ ]: