There is often a notable gap between state of the art research and what practitioners can reasonably use. However, I'm glad to say that attention sinks can be added to any pretrained LLM at near to no additional effort.

I have released the attention_sinks Python module, which acts as a drop-in replacement for the transformers API. This Python module supports all models using the Llama, Mistral, Falcon, MPT and GPT-NeoX (Pythia) architectures, and can be used like so:

from attention_sinks import AutoModel

model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", device_map="auto")

This will automatically add an Attention Sink KV Cache to the model that correctly keeps the attention sinks in the window. You can configure this cache using the following arguments:

attention_sink_size , int , defaults to 4: The number of initial tokens to use as the attention sink. These tokens are always included in the Attention Sink KV Cache.

attention_sink_window_size , int , defaults to 1020: The size of the sliding window, i.e. the number of "recent tokens" to include in the Attention Sink KV Cache. A larger window size costs more memory. Making this larger than the LLM its context window is not recommended, as the LLM will still only be able to process the last context window tokens.

The total window size will be the sum of these two arguments, e.g. 1024 by default.

For example, loading Llama-2-7B-chat with a larger window size can be done like so:

from attention_sinks import AutoModel

model = AutoModel.from_pretrained(

"meta-llama/Llama-2-7b-chat-hf",

device_map="auto",

attention_sink_size=4,

attention_sink_window_size=4092,

)

See the Streaming Demo for a script that can be executed to simulate hundreds of subsequent prompts fed to your chosen LLM. (Note, you might have to change up the chat template).

from 🕳️ Attention Sinks in LLMs for endless fluency by Tom Aarsen