Sign upSign inSign upSign in–1ListenShareRetrieval augmented generation is an advanced technique in the field of natural language processing that combines the power of information retrieval and generative models. It aims to generate more informative and contextually appropriate responses to user queries by leveraging the retrieval of relevant passages or documents. This technique has significant potential in various applications, among them is Open Domain Question Answering; it is a field of research that focuses on developing systems capable of comprehending and answering a wide range of questions posed by users. It involves extracting relevant information from vast amounts of unstructured data using information retrieval techniques.Information retrieval used to rely on sparse techniques based on word statistics. In the traditional approach, documents were represented by a bag-of-words model, where the presence or absence of specific words determined the relevance of a document to a query. Score functions, like BM25 and TF-IDF, use word frequencies to score documents, balancing how frequently a keyword appears in a document versus how prevalent the word is in general. One popular DB is the Elasticsearch DB which uses the Lucene text search engine; it is based on word statistics and edit distance (a syntax method).With advancements in natural language processing and machine learning, information retrieval has shifted towards denser representations using embeddings. Embeddings capture the semantic meaning of words and phrases, allowing for a more nuanced understanding of the content; this enables more accurate matching of queries with relevant documents, as embeddings can capture subtle semantic similarities that traditional word statistics fail to capture. The shift from sparse to dense representations has significantly improved the performance and precision of retrieval systems. For an introduction to neural IR, see (Mitra and Craswell 2018); for a review of IR for Q&A, see (Abbasiantaeb and Momtazi 2021).This blog serves as an introduction to dense retrieval and we focus on two dense document retrieval models: Dense Passage Retrieval (DPR, Karpukhin et al. 2020) and Contextualized Late Interaction over BERT (ColBERT, Khattab and Zaharia 2020). Both models use Semantic Search to find the relevant documents. Semantic search means we use a text’s dense representation to measure similarity between a given query and the potential relevant documents. The two models use different methods of storing the documents’ vectors and measuring similarity between queries and documents. We will compare the models by measuring accuracy and latency on a known benchmark, called Natural Questions (NQ, Kwiatkowski et al. 2019), a collection of user submitted questions where answers can be found in Wikipedia articles.We would like to introduce fastRAG, a framework developed at Intel Labs and released as an open-source software recently. The goal of the framework is to enable rapid research and development of retrieval-augmented generative AI applications. These can be used for generative tasks such as question answering, summarization, dialogue systems, and content creation, while utilizing information-retrieval components to anchor LLM output using external knowledge.An application is represented by a pipeline, typically comprised of a knowledge-base (KB), retriever, ranker and a reader, typically an LLM, which “reads” the query and retrieved documents, and generates an output. One can experiment with different architectures, models, benchmarking the results for performance and latency. Several of the models we offer are better suited for Intel hardware, achieving lower latency with comparable accuracy; on that in the next blog post.In the field of information retrieval, relatively recent updates promote the use of transformer encoder models as retrievals: documents in the knowledge-base are encoded as vectors and stored in an index. At runtime, the query is encoded as a vector and vector similarity search is used to find the most relevant documents. Similar process is used in re-ranking retrieved documents, where the encoding is done on-the-fly, specifically for the retrieved documents.Among the dense retrievals there are several approaches. One approach is to use a single token’s embeddings as a representative of the entire document. DPR (Karpukhin et al. 2020) is an example of that approach, where the encoders are trained to “summarize” the entire document in the first token’s embeddings. The method is a form of a bi-encoder, since it uses two encoders, one for the query and another for the documents; see illustration.Another approach is called Late Interaction, as defined first in ColBERT (Khattab and Zaharia 2020). The idea is to save (and index) the encoded vectors for all the words in the documents. At run-time the query vectors are compared with all the documents words’ vectors (hence the “late” in late interaction) thus retrieving more relevant documents than DPR. Notice that indexing every token, instead of just the first token for each document, can increase the index size.Later refinements to this work, namely ColBERT v2 and PLAID (Santhanam, Khattab, Saad-Falcon, et al. 2022; Santhanam, Khattab, Potts, et al. 2022) helped reduce the index size and latency time thanks to two main improvements: first is quantization and compression of the vectors in the index. Secondly is a set of heuristics that cluster the vectors using the K-means algorithm, hierarchically choose the relevant documents’ tokens for the query tokens based on the clusters’ centroids. ColBERT v2 with PLAID index achieves state of the art retrieval performance with a low latency, close to the order of sparse retrieval (BM25, Lucene, Elasticsearch, etc.) but with much higher accuracy.The first step is creating a documents store of the type PLAIDDocumentStore. The store requires three paths: checkpoint, collection and an index.A ColBERT checkpoint is an encoder model, fine tuned for the task of retrieving. It’s based on a BERT architecture. One can download a trained checkpoint, for example here, trained by the paper authors. Encoders can be fine-tuned using these instructions: training. Next is the collection of documents which comprise the corpus. The collection should be a signle tsv file with columns: id, text, title (optional). Finally, the index is the vectors index created using the same checkpoint, encoding all tokens in the corpus, compressing and saving the result.We provide a script to create a PLAID vector index using a ColBERT encoder and a documents collection in here.Once we have all the ingredients, we initialize the document store:Next we define a retriever using the document store we just define:We define a pipeline; it has the following form:We can use the Haystack pipeline API to connect with external components. In this example the pipeline contains just the retriever:Running the queries through the pipeline is very easy:The results is a hash map with documents key containing a list of results: documents with relevancy scores.To test ColBERT, we will use the Natural Questions benchmark (Kwiatkowski et al. 2019). The external knowledge is a collection of Wikipedia passages.As a baseline, we’ll use the original implementation of DPR, together with a checkpoint that was fine-tuned on Natural Questions, see download instructions. DPR model is released under the CC-BY-NC 4.0 license.DPR uses the Faiss vector search library (Johnson, Douze, and Jégou 2019). We test two configurations for storing the vectors: flat and HNSW. Flat is slow but accurate, since an exhaustive similarity search is done. HNSW (Malkov and Yashunin 2018) is an approximate vector search method; the vectors are organized into a graph to enable faster than linear search. Building an optimal HNSW graph requires some parameter tuning; these control the trade-off between speed, accuracy and index size.For ColBERT, we use the ColBERTv2 checkpoint from here, which was fine-tuned on the MS MARCO (Bajaj et al. 2018) dataset, which comprised of Bing questions and answers based on web search results.We report recall and MRR values for k values of 5, 10, 20, 50, and 100. We also measure latency (at k=100) (ms/query), and report the vector index size in GBs, as there is a trade-off between performance and accuracy.Measurements done on an Intel AWS instance, with a Xeon processor. AWS instance type is r6i.16xlarge; 32 cores, 512GB RAM, Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Conda 23.1.0, python 3.9.16, AMI image ami-0f1a5f5ada0e7da53, Amazon Linux 2, version 5.10.177-158.645.amzn2.x86_64. The Wikipedia text collection size is 13GB.First, we note there is a difference in accuracy between the ColBERT model and DPR. The quality of the embeddings generated from a trained encoder is crucial for high quality retrieval. As the DPR encoders were fine-tuned on the Natural Questions dataset, this is probably one of the reasons explaining the difference.Next, we compare the two indexing methods for DPR: flat and HNSW. Flat index query takes 35x longer than HNSW, at almost 1.5 seconds per query. HNSW is faster, with only a small accuracy penalty; however, index size is bigger, at ~2.3x the size of the flat index.It is notable that although ColBERT encodes and stores all the documents tokens, thanks to its optimizations, the index size is comparable to a flat Faiss index, storing only the first token’s embedding for each document.One of the goals was to present the clear trade off between accuracy and performance, more specifically, between recall, latency and memory usage (the index size, as these are stored in-memory). To summarize, we introduced two dense retrieval algorithms, ColBERT with a PLAID index and DPR. We tested these on the open-domain Q&A dataset Natural Questions, measuring accuracy and latency.Experience the capabilities of ColBERT in fastRAG through the following Notebook example. Familiarize yourself with fastRAG by exploring our user-friendly UI demos at Running Demos in fastRAG. Start using the ColBERT encoder, accessible from the HuggingFace hub. Easily create a document index, as detailed in our guide at Indexing in fastRAG. Furthermore, we offer full support for the DPR retriever; see example DPR configuration. Unleash the potential of fastRAG and revolutionize your workflow today!Tests done by Intel on March 14th, 2023.Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.Abbasiantaeb, Zahra, and Saeedeh Momtazi. 2021. “Text-Based Question Answering from Information Retrieval and Deep Neural Network Perspectives: A Survey.” Wires Data Mining and Knowledge Discovery 11 (6): e1412. https://doi.org/10.1002/widm.1412.Bajaj, Payal, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, et al. 2018. “MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.” October 31, 2018. https://doi.org/10.48550/arXiv.1611.09268.Johnson, Jeff, Matthijs Douze, and Hervé Jégou. 2019. “Billion-Scale Similarity Search with GPUs.” Ieee Transactions on Big Data 7 (3): 535–47. https://doi.org/10.1109/TBDATA.2019.2921572.Karpukhin, Vladimir, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. “Dense Passage Retrieval for Open-Domain Question Answering.” September 30, 2020. https://doi.org/10.48550/arXiv.2004.04906.Khattab, Omar, and Matei Zaharia. 2020. “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” June 4, 2020. https://doi.org/10.48550/arXiv.2004.12832.Kwiatkowski, Tom, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, et al. 2019. “Natural Questions: A Benchmark for Question Answering Research.” Transactions of the Association for Computational Linguistics 7 (August): 453–66. https://doi.org/10.1162/tacl_a_00276.Malkov, Yu A., and D. A. Yashunin. 2018. “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” August 14, 2018. https://doi.org/10.48550/arXiv.1603.09320.Mitra, Bhaskar, and Nick Craswell. 2018. “An Introduction to Neural Information Retrieval.” Foundations and Trends® in Information Retrieval 13 (1): 1–126. https://doi.org/10.1561/1500000061.Santhanam, Keshav, Omar Khattab, Christopher Potts, and Matei Zaharia. 2022. “PLAID: An Efficient Engine for Late Interaction Retrieval.” May 19, 2022. https://doi.org/10.48550/arXiv.2205.09707.Santhanam, Keshav, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. “ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction.” July 10, 2022. https://doi.org/10.48550/arXiv.2112.01488.—-1Research scientist at Intel LabsHelpStatusAboutCareersPressBlogPrivacyRulesTermsText to speech