Context Compression

Definition

Context compression is the practice of shrinking long inputs into key points or necessary information to save tokens while maintaining accuracy. It is important for balancing quality, latency, and cost.

Passing 10 retrieved chunks directly to an LLM from a RAG pipeline consumes thousands of tokens. Moreover, most of that information may not be directly relevant to the user's question. Context compression is a technique that extracts or summarizes only the truly necessary information from lengthy retrieved documents, passing it efficiently to the LLM.

Why Compression Is Necessary

LLM context windows are limited, and input token counts translate directly to cost. Research has also identified the "Lost in the Middle" problem, where information in the middle of long contexts tends to be overlooked. When too much low-relevance information is included, the response can lose focus. Context compression simultaneously reduces token costs and improves response accuracy.

Extractive and Abstractive Approaches

Context compression has two main approaches. The extractive approach pulls out only the sentences or paragraphs relevant to the question from the search results. LangChain's LLMChainExtractor, for example, uses an LLM for each chunk with the instruction "extract only the parts relevant to this question," removing unnecessary portions.

The abstractive approach summarizes the entire search results into a shorter form. It integrates information spanning multiple chunks and generates a concise summary focused on the question. Because integration and compression happen simultaneously, this approach has the advantage of restructuring fragmented search results into coherent context.

Combining with Filtering

As a pre-compression step, filtering out low-relevance documents is also effective. By using a reranking model to exclude chunks scoring below a threshold and then applying compression to the remaining chunks, the number of LLM calls can also be reduced. A pipeline of reranking, filtering, compression, and generation achieves the optimal balance of accuracy and cost.

Risks and Countermeasures

Context compression comes with caveats. There is a risk of losing important details or numerical values during compression. Additionally, using an LLM for the compression itself incurs extra inference costs. It is therefore important to quantitatively measure the effect of compression and compare response quality with and without compression. Adjusting the degree of compression while monitoring the balance between token savings and response accuracy is the practical approach.