"Retrieval-Augmented Generation for Large Language Models: A Survey" - A Read Through

Part 2

Jan 17, 2024

This is part 2 of the read through for this paper! Check out PART 1 if you haven’t read it yet. This side of things will be far more technical.

In this post I will be going through the paper titled “Retrieval-Augmented Generation for Large Language Models: A Survey” by Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang from the Shanghai Research Institute for Intellegent Autonomous Systems [1].
Link to paper if you want to follow along.

Note: They edited the publication in the middle of writing this, so some sections are no longer included in the paper. Here is the link for the paper that includes the equations, etc. that we go through below (link).

My goal is to make a series of posts as I read these papers myself, that walks through the material is an easy and quick way to read for others who are interested. I will highlight the things that I found interesting or important and try to cover any needed background within a reasonable framework.

Anywhere you see a block quote is an excerpt from the paper in the section given in the heading:

Example of a block quote

I will also try to stay in the same order of the paper.

Now we get to dive into the technical details of the second half of the paper! Let’s get to work.

Retriever

As we discussed in part 1, retrievers are our way of grabbing some K number of documents to give to the LLM to ground the answer in knowledge we want the LLM to have. The next part of the paper talks about optimizing this process in general. First up: semantic representations.

Semantic Representation

If you have tried retrieving with an embedder, you will know that it is incredibly hard to make sure you are getting back the best documents for the query given. The problem is that different lengths of sentences, queries better suited to keyword search, and other complications cannot be efficiently dealt with by just using a sentence embedder. The authors give the following collection of advice for helping with this issue.

Chunk optimization - When we ingest a document, we need to break it into chunks so that manageable pieces can be fed to the LLM, and so that we don’t exceed our embedding models token limit. But finding the size of these chunks is extremely use case dependent.

When choosing a chunking strategy, important considerations include: the characteristics of the content being indexed, the embedding model used and its optimal block size, the expected length and complexity of user queries, and how the retrieval results are used in a specific application. For example, different chunking models should be selected for longer or shorter content.

Another useful technique the authors give follows:

Current research in RAG employs diverse block optimization methods to improve retrieval efficiency and accuracy. Techniques such as sliding window technology implement layered retrieval by aggregating globally related information through multiple retrievals.
…
The Small2big technique utilizes small text blocks during the search process and provides larger affiliated text blocks to the language model for processing.

Keep in mind when you are applying this to your own pipeline, the important part is that the embedder returns the correct chunk. Sometimes only embedding a small really key piece of the chunk might be the best option! Lastly, another extremely helpful strategy here is metadata filtering, but we talked about that in part 1.

Fine-tuning - As I wrote in part 1, I think fine-tuning an embedder for retrieval is intuitive. Show it examples - it gets better. For that reason, I’m not going to go much more in depth here. The one caveat is downstream task fine-tuning.

LLM-Embedder[Zhang et al., 2023a]uses the Large Language Model to output reward values for data from multiple downstream tasks, fine-tuning the retriever with two different supervised signals via hard labeling of the dataset and the soft reward derived from LLM.

This is an example of an LLM in the loop version of down stream task fine-tuning; the point being that we give the embedding a sense of what the actual correct answer was further in the pipeline instead of just fine-tuning for semantic similarity.

Matching Semantic Space

I think the authors say it best:

In the RAG application, some retrievers use the same embedding model to encode the query and doc, while others use two models to separately encode the query and doc. Moreover, the original query of the user may have problems of poor expression and lack of semantic information.

So we end up with two potential issues… a badly expressed query from the user and a mismatch in embedding space when we jump from query to doc.

Query Rewrite (Solves badly expressed query) - Basically, why don’t we just ask the LLM to rewrite the query while it knows exactly the type of context we are looking for! With few shot prompting, we should be able to make sure any user query is much more in line with what the embedding space is expecting. There are different examples mentioned, but one popular one is called HyDE (Hypothetical Document Embedding).

In HyDE[Gao et al., 2022], query vectors are established through the use of text indicators, using these indicators to generate a ’hypothetical’ document that is relevant, yet may not truly exist, it only needs to capture the relevant pattern.

What we are really trying to do is nudge the embedding in the direction of the correct answer in vector space by giving it more content we would expect to see, knowing which documents the user might be looking for.
See also RRR which combines the above strategy with active access to web search and then performs step back prompting. In my experience, this is less useful in most cases because HyDE can take care of a lot of the discrepancy, easier.

Embedding Transformation (Solves mismatching embedding space) - It may not be clear why a query and document might be in mismatching vector space, but there are actually many reasons it would be the case. Queries can be extremely short while documents can be long and dense, linguistics can be completely different, pre-training variations might teach assumptions to the model that aren’t true, etc.
Just in case, here is an example:

Query: "How do I fix a leaky faucet?"
Document: A detailed plumbing manual with sections on faucet components, types of valves, and step-by-step repair instructions. The manual uses technical terms and is structured with headings, subheadings, and lists.

It’s likely the query and the document are not really that similar in terms of vector space.

In LlamaIndex[Liu, 2023], it is possible to connect an adapter after the query encoder, and fine-tune the adapter to optimize the representation of query embeddings, mapping it to a latent space that is better suited for specific tasks

On top of this, you can use strategies like in SANTA[Li et al., 2023d] to align your embeddings with structured data as well.

1) Using the natural alignment relationship between structured data and unstructured data for contrastive learning for structured-aware pre-training.
2) Masked Entity Prediction, which designs an entity-oriented mask strategy and asks language models to fill in the masked entities.

In short, the first maximized the distance between positive and negative examples in vector space to teach the model what structured data represents the unstructured data. The second is a bit more self explanatory.

Aligning Retriever Output and LLM’s Preference

The point of this section is to say, how can we make sure that the document given to the LLM both matches the query well, but also gives the right information the LLM is looking for to answer the query.

Supervised Training - Use the LLM’s preferred documents (through a scoring system you set up) to fine tune the embedder (basically down stream task fine tuning).

AAR[Yu et al., 2023b]
By determining the LM’s preferred documents through FiD cross-attention scores, the retriever is then fine-tuned with hard negative sampling and standard cross-entropy loss. Ultimately, the fine-tuned retriever can directly be used to enhance unseen target LMs, thereby performing better in the target task.
The training loss of retriever as:
where Da + is the documents preferred by the LLM in the retrieved set and Da − is not preferred. l is the standard cross entropy loss.

This equation calculates a score by comparing how well a query matches good documents versus bad ones.

REPLUG[Shi et al., 202] 23] uses a retriever and an LLM to calculate the probability distributions of the retrieved documents, and then performs supervised training by calculating the KL divergence.
where D is a set of input contexts, PR is the retrieval likelihood, QLM is the LM likelihood of each document.

This equation measures the average difference between what the retriever thinks are the right documents and what the language model thinks, using something called KL divergence, which tells us how one set of predictions differs from another.

The authors also discuss UPRISE, Atlas, EMDR2, and LOOP as different methods that enhance retrieval models by using language models for supervision, distilling attention patterns, leveraging latent variables, and optimizing prediction impact, respectively, to fine-tune information retrieval processes.

Lastly, plug in an adapter! Remember, adapters are like plug in layers for the model to help adjust the overall weights in a focused way to your use case.

PRCA[Yang et al., 2023b] trains the Adapter through the Contextual Extraction Stage and the RewardDriven Stage, and optimizes the output of the retriever based on a token-based autoregressive strategy.

TokenFiltering[Berchansky et al., 2023] method calculates cross-attention scores, selecting the highest scoring input tokens to effectively filter tokens.

RECOMP[Xu et al., 2023a] proposes extractive and generative compressors, which generate summaries by selecting relevant sentences or synthesizing document information to achieve multi-document query focus summaries.

PKG[Luo et al., 2023], infuses knowledge into a white-box model through directive fine-tuning, and directly replaces the retriever module, used to directly output relevant documents based on the query.

PKG shows that you can “bake in” model knowledge by fine-tuning adapters with extra information.

Alright we made it through retrieval. I think this can be summed up by “get the query embedding as close as possible to the document embedding, and then narrow the search as much as possible, and retrieval will likely improve.”

Generator

The RAG generation step transforms retrieved data into coherent text, incorporating relevant information from the retriever, ensuring the generated responses are not only natural but also filled with accurate, context-aware content. It uses diverse inputs for fine-tuning to tailor the model's output to align closely with the query and retrieved documents. The generated output is obviously what the user sees, so it better be good.

Post-retrieval Processing

Post-retrieval processing refers to the process of further treating, filtering, or optimizing the relevant information retrieved by the retriever from a large document database. Its primary purpose is to enhance the quality of retrieval results to better meet user needs or for subsequent tasks. It can be understood as a process of reprocessing the documents obtained in the retrieval phase.

Couldn’t have said it better myself.

Information Compression - Likely, there is an immense amount of potential information that could be included in your pipeline… The bigger the database or more general the chatbot is, the more this is an issue. The authors give the following strategies to solve it.

PRCA [Yang et al., 2023b] addressed this issue by training an information extractor. In the context extraction stage, given an input text S_input, it can generate an output sequence C_extracted, which represents the condensed context from the input document.

The loss function here looks like:

where f. is the information extractor and theta is the parameter of the extractor. The idea is minimize the loss between the generated summaries and the original documents as much as possible, to retain all of the information you might need.

RECOMP[Xu et al., 2023a] similarly trains an information condenser by leveraging contrastive learning. For each training data point, there exists one positive sample and five negative samples. The encoder is trained using contrastive loss…

The loss function here looks like this:

where xi is the training data, pi is the positive sample, and nj is the negative sample,sim(x,y) is to calculate the similarity between x and y. Just like contrastive loss earlier, the goal of this loss function is to maximize the difference between the positive and negative examples so that the model can bias towards the positive.

[Ma et al., 2023b] proposed the “Filter-Ranker” paradigm, which integrates the strengths of Large Language Models (LLMs) and Small Language Models (SLMs). In this paradigm, SLMs serve as filters, while LLMs function as reordering agents. By prompting LLMs to rearrange portions of difficult samples identified by SLMs, the research results indicate significant improvements across various Information Extraction (IE) tasks.

Or more simply, ask the SLM if the information is relevant, ask the LLM to reorder as necessary for priority.

Rerank - Reranking is exactly what is sounds like, and I talked about this a bit in part 1.

The core idea involves rearranging document records to place the most relevant items at the top, thereby reducing the total number of documents to a fixed quantity. This not only resolves the issue of context window expansion that may be encountered during retrieval but also contributes to improving retrieval efficiency and responsiveness[Zhuang et al., 2023].

Pretty straight forward, but also note the lost in the middle comment I made in part 1 as well - the more “middled” the information is, the less likely the model will pay attention to it.

Okay awesome we now know how the industry is retrieving well and processing the retrievals to make sure the model has everything it needs.
Now let’s explore how to make sure the LLM is saying what we want to the user! After all this work, it would be a shame to have an LLM communicating in the wrong way or context.

Optimizing Generator to Adapt Input Data

With RAG, we are essentially stuffing a ton of knowledge into the prompt and expecting the model to just deal with it. But how can we make sure that the model is equipped to effectively use the knowledge provided to answer the question given?
Afterall, the generator is receiving more than just a query, so it can drastically change how the generator should be creating responses.

General Optimization -

[Xia et al., 2019, Cai et al., 2021, Cheng et al., 2022]. For Joint-Encoder, a standard model based on encoder-decoder is used, where the encoder initially encodes the input, and the decoder, through attention mechanisms, combines the encoded results to generate tokens in an autoregressive manner:

This process describes a sequential generation of tokens, where each token is chosen based on a probability distribution conditioned on both the initial input and the tokens that have been generated so far.

The same goes for the Dual Encoder system:

Contrastive Learning -

This section is largely exploring the specifics of massaging the contrastive loss function. Since I don’t have a lot to add here (the paper basically lists the methods and equations and calls it a day), and we have already talked about contrastive learning in general, I’ll skip this part and recommend you read the paper if interest. However, I’ll post the equations here for posterity :)

…graph-text contrastive learning method has been proposed by SURGE [Kang et al., 2023]

SANTA[Li et al., 2023d] utilized a threestage training process to fully understand the structural and semantic information. Specifically, in the training phase of the retriever, contrastive learning was adopted, with the main goal of optimizing the embedding representations of the queries and documents.
Initial Stage:
Later Stage:

Augmentation in RAG

As the authors say, they split this section into the following:

…the stage of augmentation, augmentation data sources, and the process of augmentation…

In general, these sections are summaries of work that has been done to lay out the augmentation step wholistically by the ML community. See the following figure for an overview:

Figure 4: Taxonomy of RAG’s Core Components

In these sections, I would highly recommend going through the papers listed there if you have not spent much time understanding the progression of embedders and encoder/decoder models in this space. The author’s do a great job of summarizing these papers into a very dense 2 pages, so I don’t think an overview here will be actually very helpful.
I tried, but I didn’t have that much to add without making this article 5x longer :D

The most important information from this section might be Iterative / Recursive / Adaptive retrieval.

Iterative Retrieval - Repeatedly collecting documents based on the initial query and the generated text and enhancing answer robustness by providing additional context through multiple iterations. This process (multi-hop retrieval, ITER-RETGEN) works to refine and deepen the knowledge base for more accurate and contextually relevant response.

Recursive Retrieval - Enhances search depth and relevance by iteratively refining queries based on previous search results, creating a feedback loop (IRCoT and ToC). RR uses chain-of-thought and clarification trees to optimize query ambiguity, particularly useful in complex searches where user needs are unclear or information is specialized, leading to continuous adaptation and improved search outcomes.

Adaptive Retrieval - Allow LLMs to actively decide the timing and content of information retrieval, improving efficiency and relevance. These methods (tools like AutoGPT, Toolformer, and Graph-Toolformer that exercise active judgment) feature proactive steps in information retrieval, such as using Self-Ask techniques and few-shot prompts. WebGPT, for example, trains GPT-3 to autonomously use a search engine with reinforcement learning. Flare automates retrieval timing based on the generation process's confidence, while Self-RAG introduces “reflection tokens” for the model to introspect and decide on retrieval, enhancing its autonomous judgment in generating responses. Basically, you let the LLM do the heavy lifting of deciding whether more information is needed or not. Really cool!

Evaluation

Okay only one more section, I promise. Lastly, we can dive into evaluation of RAG pipelines. Sure, this whole time we have been discussing strategies to make the pipeline better… but ultimately how do we know it is ACTUALLY better?

Most metrics focus on the following two key topics: Retrieval Quality, Generation Quality. But since there aren’t really correct answers or labels to compare against, this is trickier than you might think.

Table 2 shows a great summary of the most popular frameworks for measuring the quality of a RAG pipeline. Let’s quickly define the metrics across the top.

Context Relevance - Precision and Specificity of retrieved context (How much does the context actually relate to the query?)

Answer Faithfulness - Is the answer true to the retrieved context? Is it making anything up that isn’t within the context?

Answer Relevance - Is the answer actually relevant to the core meaning of the query?

Noise Robustness - How well can the model ignore useless information that is retrieved?

Negative Rejection - How well can the model refrain from responding when the context does not have the necessary information included?

Information Integration - How well can the model combine all of the information into a clean and summarized answer?

Counterfactual Robustness - How well can the model recognize that the provided context is actually wrong, and discard the information?

Most of these metrics are measured on a scale of something like 0-1 or -1-1, and the actual measurement is done by incorporating LLMs.
For example, let’s look at RAGAS.

evol-generate — Source: https://docs.ragas.io/en/stable/concepts/metrics/index.html

In each of these cases, we pass many examples to the model and ask for input.
Example:

In this case, we ask the LLM how many claims can be made from the context, and how many are present in the answer. This will give us a “faithfulness score”:

All of the aforementioned metrics are some iteration of this method so that the performance measurement of the pipeline can still be automated in a meaningful way.

So what if you have bad scores?
Well, I recommend starting from the beginning of this article :D

But more seriously, generally the biggest performance gains come from improving the retrieval step as much as possible. If your precision and recall are bad, likely your pipeline is going to perform poorly.. if you generation is great but gets bad information, the application isn’t useful!

Alright that marks the end of the paper! We made it.

Thanks for sticking with me through this 2 part dive into this RAG paper… If you thought this was helpful, expedited your reading or understanding, or you want to see more, please leave a comment and say so!

See you next time!

Buy me a Coffee ☕

Citations

[1] Gao, Y. (2023, December 18). Retrieval-Augmented Generation for Large Language Models: A survey. arXiv.org. https://arxiv.org/abs/2312.10997v1

[2] Gao, Y. (2023, December 18). Retrieval-Augmented Generation for Large Language Models: A survey. arXiv.org. https://arxiv.org/abs/2312.10997