Optimizing Retrieval-Augmented Generation (Strategies and Tricks)

Retrieval-augmented generation (RAG) is simple to get up and running, but optimizing it for your specific use case can be a bit more challenging and trial-and-error.

Here I'll be documenting my learnings as I dive deeper into RAG optimization strategies.

This post assumes you have a basic understanding of RAG models and are looking to improve their performance, quality of responses, or scalability. Let's dive in!

Evaluation Metrics

Before we dive into optimization strategies, it's important to understand how you'll evaluate the performance of your RAG model.

Here are some tools and strategies:

Automatic Evaluation Metrics

RAGAS: Most popular RAG evaluation framework
ARES: Another automated RAG evaluation framework

Prompt Benchmarking

ATLAS

Manual Evaluation

RAGExplorer: Visualize and explore the retrieval-augmented generation process
Langchain Evaluators
ROUGE: Popular metric for evaluating text summarization
BLEU: Another popular metric for evaluating text generation

Manually Evaluate Retrieval Quality:

Create a sample set of queries and corresponding documents you expect to retrieve
Retrieve documents for each query
Compare the retrieved documents to the expected documents (e.g., using precision, recall, cosine similarity, F1 score, etc.)

Manually Evaluate Generation Quality:

Create a sample set of queries
Run RAG pipeline on each query
Manually inspect generated responses and retrievals
Evaluate on factors such as factual correctness, relevance, coherence, etc.

Chunk Size / Boundary Conditions

Small chunks can help the model focus on specific parts of the document, but too small can lead to fragmented information and loss of context. Large chunks can provide more context but may introduce noise and irrelevant information.

Boundary conditions refer to the criteria used to determine where to split the document into chunks, such as sentence boundaries, paragraph boundaries, or section headings.

Example chunk sizes to test:

Character-level
- e.g. 50 chars, 250 chars, 500 chars
Sentence-level
- e.g. 1 sentence, 2 sentences, 3 sentences
Paragraph-level
- e.g.
Section-level (e.g. separated by H1, H2, H3... headings)

The goal of segmentation (chunking) is to maximize the relevance and usefulness of the retrieved information while minimizing the noise.

Start off by experimenting with different chunk sizes and boundary conditions to find the optimal balance between context and efficiency.

In advanced stages, you can try dynamic chunking, where you adjust the chunk size based on the document structure or content.

Utilizing Formatting

If your documents are well-formatted (e.g., headings, bullet points, numbered lists), you should likely avoid splitting the document at arbitrary points (such as sentence or character level).

You almost never want to have a chunk that starts in the middle of a list or has a heading in the middle of it, because headings and listings represent a meaningful boundary in the document.

You can use document formatting to your advantage by splitting the document at headings, bullet points, or numbered lists, as these typically represent a shift in topic or subtopic.

Dealing with Tables

Tables are tricky because they come in all shapes and sizes.

When ingesting a document with tables, you have a few options:

1. Ignore tables completely

Simple, but you lose potentially valuable information. But before trying to embed tables, consider if the information in the table is necessary for your model to generate a response.

2. Add tables to existing segments

For example, you could add the table to the end of the previous chunk or the beginning of the next chunk. This can help maintain context while reducing noise, but likely only works for smaller tables.

3. Treat tables as separate chunks

You could treat each table as a separate chunk and embed it as its own segment. This maintains contextual integrity, but can lead to a lot of noise, especially with large or dense tables.

4. Chunk up the table and propagate headings

For large tables, you can chunk up the table and for each chunk, prepend the table heading to the chunk to maintain context.

For example, add "Table 1: Sales Data" to the beginning of each chunk, followed by slices of rows of the table until the end of the table.

You can also use this approach for long lists or bullet points.

5. Store a reference to table data and ingest headings

Another approach for large tables, which Jerry Liu of Llamaindex explains involves embedding only the headings of the table and storing a reference to the raw table data (as a Pandas dataframe for example) which you store elsewhere.

Then when performing retrieval, you retrieve on the headings, fetch the associated table data, and use that to generate the response using something like Llamaindex's PandasQueryEngine to query the table data.

6. Generate a summary of the table

You could use a LLM in a preprocessing step to generate a summary of the table, then use that summary as a chunk in the retrieval process.

Here you could also store a reference to the table to query the raw data if needed, or use the summary as a standalone chunk.

Chunk Overlap

Overlapping chunks can help maintain context between adjacent chunks, but too much overlap can introduce redundancy and increase computation time. Experiment with different overlap percentages and character counts to find the optimal balance between context and efficiency.

Example chunk overlaps to test:

No overlap
Percentage overlap: 10%, 20% overlap
Character overlap: 50, 100 characters

At least some level of overlap is typically beneficial because:

Relevant information may be split across chunks
Overlap can help maintain context between adjacent chunks
Robustness to retrieval noise: in cases where retrieval fails to return the most relevant chunks, your generation model may still be able to access the necessary information

But too much overlap can result in unwanted reinforcement of information, redundancy.

Pre-Processing

Optimizing retrieval can involve pre-processing the text used for vector retrieval separately from the text used for generation.

For example, this helpful gist describes how it can be beneficial to preprocess the text before embeddings using:

Lemmatization or stemming
Lowercase
Stop word removal

You can also add advanced NLP techniques to further preprocess, such as:

Pronoun and acronym disambiguation
Coreference resolution
Part-of-speech tagging
Named entity recognition (NER)

And, as usual, you'd ingest the preprocessed text into your vector database, and store the original text in metadata for generation.

As described in the gist, the goal of preprocessing is to "reduce the possible token space down so that we maximize our chances of finding them."

In other words, we want to reduce the dimensionality of the text while preserving the semantic meaning. In other words, reduce the possible token space.

Your LLM used for generations will never see the preprocessed text, so try anything and everything to optimize for retrieval.

Embedding Models

Your choice of embedding model can have an impact on quality of retrievals.

Popular choices are text-embedding-3-small, text-embedding-3-large, or text-embedding-ada-002 from OpenAI.

Personally, I've been satisfied with text-embedding-3-small for ease of use and affordability.

But you have a lot of options, such as

I'd recommend browsing the MTEB English Leaderboard to consider the best model for your use case.

Language-Specific Embeddings

Some models are trained on specific languages, and may perform better on documents in that language.

If your document is in a language other than English, you may want to consider using a model trained or fine-tuned on that language.

Embedding Fine-Tuning

As with each model used in the pipeline, you can fine-tune the embedding model on your specific domain or dataset to improve the relevance of the retrieved documents.

This is often a "last resort" as it can be computationally expensive and time-consuming, but it can be a powerful optimization strategy.

Top K Retrieval

This is pretty self-explanatory, but worth mentioning.

Limiting the number of retrieved documents can help reduce noise and improve the quality of the generated response. Experiment with different values of K to find the optimal number of documents to retrieve.

Example top K values to test:

K = 1, 3, 5, 10, 20

Top k = 4 is a common starting place, but depending on your segmentation strategy, you may find that a higher or lower top k is more effective.

Dynamic Top K

You can also experiment with dynamic top K retrieval, where you adjust the top K based on the query or the retrieved documents.

Queries that are highly specific may benefit from a lower top K, while more general or summarizing queries may benefit from a higher top K. Implementing a dynamic top K would involve a semantic analysis of the query—such as keyword extraction, topic modeling, or adding an agentic step—to determine the optimal top K.

Retrieval Pruning

In addition to restricting top K, you can prune the retrievals based on relevance or other criteria.

A common strategy would be to set a lower bound threshold for similarity scores, but ensure at least N retrievals are returned regardless.

Small-to-Big Retrieval

"Small-to-big" is a RAG technique which involves chunking small, retrieving big. This approach can help the model focus on specific parts of the document while still providing context from the entire document.

In practice, this may look like chunking into sentences, then either:

Using high top K retrieval
Or, using low top K with neighbor retrieval or referenced retrieval.

Referenced Retrieval

Referenced retrieval involves ingesting chunks and storing a reference to a larger document in metadata.

This is similar to the table technique mentioned earlier, but can be used for any part of the document.

This is also useful for images, videos, or other non-textual data.

For example, you may have a document broken up with H1, H2, and H3 headings.


markdown## Example heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

### Example subheading

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

The first, and typical approach, would be to chunk and embed the entire document, perhaps using headings as boundaries. When performing retrieval, you would perform a similarity search against each chunk.

But perhaps your input queries are more likely to be relevant to the headings, rather than the details in the body of the document. Often as users, we often search for information hierarchically, which is the purpsoe of such headings.

So instead, you could chunk and embed the H1, H2, H3 headings and store a reference to the corresponding section for each heading in metadata.

In other words, only ingest Example heading and Example subheading, and store the section as a reference in metadata.

When performing retrieval, you would retrieve the headings, then fetch the corresponding section from the metadata to generate the response.

This could improve retrieval performance because your queries are more likely to be relevant to the headings.

Neighbor / Window Retrieval

Neighbor retrieval (aka window retrieval) involves retrieving documents that are adjacent (or within a "window") to the retrieved documents.

This can help provide additional context and reduce redundancy in the generated response.

Let's say you chunk a document into 5 chunks. When performing retrieval, you could retrieve the top K = 2 chunks, and for each retrieval chunk, retrieve the adjacent chunks as well.

If when analyzing your retrievals you notice that the model often misses important information because it's in the adjacent chunk, this could be a good strategy to try.

Reranking

Reranking retrieved documents based on relevance or other criteria can help improve the quality of the generated response.

There are models like Cohere Rerank specifically made for this purpose.

With reranking, you will have a higher top K retrieval, then rerank the documents and use the top N for generation.

Reranking may or may not improve the quality of generations. In my personal experience, reranking has hurt the quality of generations, but it's worth testing given the affordability of such models (Cohere Rerank is $1/1000 searches as of writing).

Fine-tuning Rerankers

Fine-tuning a reranker on your specific domain or dataset can help improve the relevance of the retrieved documents.

This is self-explanatory, and another optimization strategy to consider.

Clustering

Clustering retrieved documents based on similarity can help identify related information and reduce redundancy in the generated response.

This only really applies if you're performing multiple generations simultaneously.

Instead of passing each retrieval to the generator separately, you can use a clustering algorithm to group similar retrievals together, then pass each cluster to the generator.

For both performance and cost-saving reasons, clustering can be a powerful optimization strategy.

Query Expansion

Query expansion involves expanding the original query (using an LLM) with the goal of retrieving more relevant documents.

Here's an example:


markdownUser Raw Query: What RAG technique involves chunking small, retrieving big?

Expanded Query: What retrieval-augmented generation technique involves segmenting or chunking smaller parts of a document and using larger retrievals to generate a response?

Query expansion can add more relevant (and potentially more diverse) information to the retrieval process, which can help improve the quality of the generated response.

But over-expansion could introduce irrelevant keywords that could lead to noisy retrievals, which means you'll need to carefully prompt the LLM to avoid this.

Generation Optimization

While not the focus of this post, optimizing the generation process is equally important for improving the quality of the response.

Here's several strategies to consider:

Prompt Engineering

Depending on the LLM you're using, you will need to tweak your prompts according to the model's quirks and strenghts.

For example, Claude works better with XML tags, both for generating structured output and adding semantic structure to the prompt.

Instead of just passing all retrieved documents plain, you could wrap them in XML tags to help the model understand the structure of the document.


markdown<instructions>
Use the following pieces of context to answer the question:
</instructions>

<source1>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</source1>

<source2>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</source2>

<query>
What is the meaning of life?
</query>

OpenAI's GPTs, anecdotally, perform better with Markdown-style prompts.

The 26 Prompt Principles is a great resource for prompt engineering.

Providing examples of the expected output is also a reliable and effective practice, but will increase costs.

Hyperparameter Tuning

Many LLMs offer hyperparameters to tune performance. You can adjust temperature, top-p, frequency and presence penalties, and other parameters.

For most RAG pipelines, you should aim for deterministic output.

With OpenAI's GPTs, this can be achieved by temperate=0 or top-p=0 (but not both).

OpenAI also offers a seed parameter to attempt to generate the same response given the same parameters and seed value.

Fine-Tuning

Self-explanatory: fine-tuning the generator on your specific domain or dataset can help improve the quality of the generated response. Often a "last resort" optimization strategy due to computational expense and time required.

Chain-of-Thought (CoT) Prompting

The model is given a prompt that includes a question or task and is asked to generate a series of intermediate reasoning steps.
The model generates a chain of thought, which consists of a sequence of steps that describe the reasoning process it follows to arrive at the final answer.
The final answer is generated based on the intermediate steps, which can help improve the accuracy and transparency of the model's output.

Multi-Agent Generation

Sometimes, breaking up the generation process into multiple agentic steps can help improve the quality of the response.

For example, your task may be to generate a summary based on retrievals, and also provide citations for each sentence in the summary.

You could attempt a multi-agent approach where one agent generates the summary, and another agent generates the citations.

Wrapping Up

This post is a work in progress and will be updated as I continue to explore and experiment with RAG optimization strategies.

My goal is to provide a comprehensive overview of the various techniques and tricks you can use to optimize your RAG model for performance, quality, and scalability.

If you have any suggestions, feedback, or questions, feel free to reach out!