What is RAG? A technical guide
Knowledge, search and retrieval
This is the technical companion to our non-technical pillar at /what-is/rag. Retrieval-augmented generation (RAG) is a system architecture that fetches relevant passages from an external store at query time and conditions a language model's generation on them. This guide treats the engineering: ingestion, chunking, embeddings, vector indexing, dense and sparse retrieval, reranking, context assembly, generation, evaluation and failure diagnosis. It assumes you know what RAG is and why it matters commercially, and focuses on the design choices that decide whether a system works in production.
What this means
If you have read the non-technical pillar at /what-is/rag, you already have the retrieve-then-generate concept, the open-book-exam analogy and the business case. This companion does not repeat them. It is written for the people who will design, build, evaluate or buy a RAG system: engineering leads, solution architects, data and ML practitioners.
The framing for everything below is that RAG is not a model, it is a pipeline. A query passes through a sequence of stages, each of which can independently degrade quality. The retriever can miss the relevant passage. The reranker can bury it. The chunker can have split it so the answer is no longer in any single unit. The generator can ignore what it was given. Most of the discipline of building RAG is making each stage measurable so you can tell which one failed. The rest of this article walks the pipeline end to end, then covers evaluation, failure modes and the build versus buy decision.
Why it matters
The four business benefits are covered on the non-technical page and are not repeated here. What matters technically is where quality, cost and reliability are actually decided, and the answer is rarely the language model. It is the retrieval layer. A strong model given the wrong passages produces a confident, fluent, wrong answer that is harder to catch than a hallucination, because it looks grounded. The leverage in a RAG system sits upstream of generation: in how documents are parsed and split, which embedding model represents them, how the index trades recall for latency, and whether retrieval is dense, sparse or hybrid.
These choices also set your cost and latency budget. Every stage adds milliseconds and, in many designs, an API call. Reranking a large candidate set, generating contextual summaries for every chunk, or running multiple retrieval passes all buy quality at the price of latency and spend. A senior engineer's job is to know which of those trades is worth making for a given workload, and to be able to prove it with numbers rather than intuition. The sections that follow are organised so you can reason about each trade independently.
How it works
The pipeline end to end
A production RAG system is two pipelines that meet at the index. The first is the offline, or ingestion, pipeline: source documents are loaded, parsed into clean text, split into chunks, optionally enriched with context or metadata, embedded into vectors, and written to an index along with their original text and metadata. The second is the online, or query, pipeline: a user query is optionally transformed, embedded with the same model used at ingestion, used to retrieve candidate chunks, optionally reranked, assembled into a prompt within the context window, and passed to the generator, whose output is then post-processed for citation, formatting and safety checks.
The data flow is concrete. A 40-page PDF policy document might be parsed into roughly 12,000 words, split into about 60 chunks of 200 to 300 words each, embedded into 60 vectors of, say, 1,024 dimensions, and stored. At query time, the question becomes one vector, the index returns the 20 or 50 nearest chunks, a reranker narrows that to the best 5, those 5 are concatenated into a prompt with an instruction to answer only from the supplied text and to cite, and the model produces an answer with references. Every arrow in that flow is a design decision and a potential point of failure. The remaining subsections take them one at a time.
Document processing and chunking
Chunking is the most underrated lever in the pipeline. The retriever can only return units you created at ingestion, so if the answer to a question is split across a chunk boundary, no retrieval strategy can recover it intact. Four families of strategy are in common use. Fixed-size chunking splits on a token or character count with optional overlap; it is simple, fast and ignores document structure, so it routinely cuts sentences, tables and clauses in half. Recursive chunking splits hierarchically on a priority list of separators (paragraphs, then sentences, then words) until chunks fit a target size, which respects natural boundaries far better. Semantic chunking places boundaries where the embedding similarity between adjacent sentences drops, grouping text by meaning rather than length. Structure-aware chunking uses the document's own markup, such as Markdown headings, HTML tags, or PDF layout, to split along sections, rows and list items.
The evidence on which to pick is nuanced. Structure-aware chunking tends to give the highest retrieval effectiveness on documents that have real structure, and at lower computational cost than semantic chunking. Semantic chunking improves contextual coherence but its gains over fixed-length baselines are not always large enough to justify its cost, and depend heavily on the dataset. Chunk size and overlap are the two parameters you will tune most. Smaller chunks raise precision (each unit is tightly on-topic) but risk fragmenting answers and losing context; larger chunks preserve context but dilute the embedding with off-topic text and waste context-window budget. Overlap, typically 10 to 20 per cent of chunk size, reduces boundary-cut losses at the price of duplication in the index.
Tables, PDFs and headings deserve specific handling. Naive text extraction from a PDF linearises columns and destroys tables, so a figure ends up next to the wrong label. Layout-aware parsers, table-extraction passes, and converting tables to Markdown or sentence form before chunking all materially help. Headings should be retained and ideally prepended to each child chunk as metadata, so a chunk taken from deep in a document still knows which section it belongs to. Metadata captured at ingestion (source, author, date, section, document type, access level) is not decoration: it powers filtering, freshness control and permission-aware retrieval later in the pipeline. Treat chunking as a core research problem with its own evaluation, not a preprocessing afterthought.
Embeddings in practice
The concept of embeddings is covered at /what-is/embeddings; here the concern is selection and operations. An embedding model maps text to a dense vector such that semantically similar text lands nearby in the vector space. The single most important operational rule is that the same model must embed both the indexed chunks and the incoming query. The vector space is model-specific; a query vector from one model and a chunk vector from another are not comparable, and similarity scores between them are meaningless. This sounds obvious and is violated constantly, usually after a well-meaning model upgrade.
Model choice should be driven by evidence rather than reputation. No single embedding method dominates across all tasks, so a model that tops a leaderboard for semantic similarity may not be best for your retrieval workload. The Massive Text Embedding Benchmark (MTEB) exists precisely because performance does not transfer cleanly across tasks; use it to shortlist, then evaluate candidates on your own data. Dimensionality is a direct trade: higher-dimensional vectors can capture more nuance but cost more memory and compute per comparison, and the gain often plateaus. Domain fit matters more than raw dimension count: a model trained on general web text may underperform a smaller domain-tuned model on legal, clinical or code corpora.
The operational reality that catches teams out is re-embedding. Changing the embedding model is not a configuration change; it invalidates every vector in your index. You must re-embed the entire corpus, which for large knowledge bases is a substantial compute cost and an availability problem if done in place. Plan embedding-model changes as migrations: build the new index alongside the old, validate retrieval quality on your eval set, then cut over. Treat the embedding model as a long-lived dependency, because changing it is expensive and disruptive.
Indexing and vector search
The concept of a vector database is covered at /what-is/vector-database. The engineering question is how nearest-neighbour search scales. Exact nearest-neighbour search over millions of high-dimensional vectors is too slow for interactive use, so production systems use approximate nearest neighbour (ANN) methods that trade a small, controllable loss of recall for large gains in speed and memory. Three families dominate. Hierarchical Navigable Small World (HNSW) builds a multi-layer proximity graph and navigates from a coarse top layer down to fine layers, achieving logarithmic-scaling search with high recall; it is the default in most modern vector stores and is memory-hungry but fast. Inverted file (IVF) indexes partition the space into clusters and search only the clusters nearest the query, trading recall against the number of clusters probed. Product quantization (PQ) compresses vectors into short codes by splitting each vector into subvectors and quantizing each separately, which dramatically cuts memory at some cost to accuracy; IVF and PQ are frequently combined (IVF-PQ) for billion-scale corpora.
The governing trade-off is recall versus latency and memory. HNSW exposes parameters (the graph degree M, and the search-time breadth efSearch) that let you dial recall up at the cost of query time and build time. There is no universally correct setting; it depends on corpus size, dimensionality and your latency budget, and should be tuned against measured recall on your own queries. Where the index lives is a separate decision. You can run a dedicated vector store, or use vector features increasingly built into existing databases and search engines (PostgreSQL with pgvector, OpenSearch and Elasticsearch, and others). For many teams the right first move is to add vector search to a database they already operate, rather than introduce a new system, then graduate to a dedicated store only when scale or feature needs demand it.
Retrieval strategies
There are two fundamentally different ways to find relevant text, and the strongest systems use both. Dense retrieval embeds query and documents into the same vector space and finds nearest neighbours; it captures meaning and handles paraphrase and synonymy well, which is its great advantage over keyword search. Dense passage retrieval demonstrated that a learned dual-encoder can beat a strong lexical baseline substantially on open-domain question answering: the dense retriever outperformed a strong Lucene-BM25 system by 9 to 19 per cent absolute in top-20 passage retrieval accuracy, establishing dense retrieval as the default semantic method. Sparse retrieval, classically BM25, scores documents on weighted exact-term overlap; it is unbeatable for rare tokens, codes, product numbers, names and acronyms that an embedding may smear together, and it needs no training.
Hybrid retrieval runs both and fuses the results, because their failure modes are complementary: dense retrieval finds the conceptually relevant chunk that shares no keywords, sparse retrieval finds the exact identifier that the embedding glossed over. The standard fusion method is Reciprocal Rank Fusion (RRF), which scores each document by summing the reciprocals of its rank in each result list, with a constant k that controls how steeply influence decays with rank. RRF was shown to consistently yield better results than any individual system and better results than the standard Condorcet Fuse method; the value k = 60 used in the original study remains the default in OpenSearch, Elasticsearch, Azure AI Search and Weaviate. RRF is popular because it operates on ranks, not raw scores, so it sidesteps the problem that BM25 scores and cosine similarities live on incomparable scales, and it needs no tuning to work well.
Beyond the core retrievers, several techniques lift recall and precision. Metadata filtering restricts the search to chunks matching structured predicates (date ranges, document type, access level) and is essential for both relevance and permissions. Query transformation rewrites the user's raw query into a form that retrieves better: expansion adds synonyms or related terms; multi-query generates several paraphrases and unions their results to cover more of the relevant space. Hypothetical Document Embeddings (HyDE) is a notable transformation that asks a model to draft a hypothetical answer, then embeds that draft to retrieve real documents near it, which can outperform a strong unsupervised dense retriever in zero-shot settings. The top-k parameter (how many candidates retrieval returns) is the recall-cost dial: a larger k raises the chance the answer is somewhere in the set but adds noise and downstream cost, which is exactly what reranking exists to resolve.
Reranking
Retrieval and reranking use different model architectures for a reason. A bi-encoder (the dual-encoder used for dense retrieval) embeds query and document separately, which is what makes indexing possible: you embed every chunk once, ahead of time, and only the query at runtime. The price of that efficiency is that query and document never interact until the final dot product, so subtle relevance signals are lost. A cross-encoder feeds the query and a candidate document through the model together, letting every query token attend to every document token, which produces a far more accurate relevance score. A fine-tuned BERT-Large cross-encoder reached MRR@10 of 35.8 on the MS MARCO passage-ranking eval set as the top leaderboard entry, outperforming the previous state of the art by 27 per cent relative in MRR@10, and roughly doubling the MRR@10 of BM25.
The catch is cost. A cross-encoder must run once per candidate document at query time, so you cannot run it over a whole corpus. This is why the established design is two-stage retrieve-then-rerank: a cheap bi-encoder or hybrid retriever fetches a candidate set (say the top 50 to 100), then an expensive cross-encoder reorders that set and you keep the top handful. Zero-shot evaluations across diverse retrieval tasks confirm the pattern: reranking-based models achieve the best accuracy but at high computational cost, while dense retrievers are cheaper but often less accurate. Reranking earns its latency when first-stage retrieval returns the right chunks but in the wrong order, or when you need high precision in a small final context. If your retriever is already returning the answer at rank one, a reranker adds latency for little gain, which is why you measure before adding it.
Context assembly and generation
Once you have your final chunks, you must fit them into the context window, the fixed token budget the model can attend to, covered at /what-is/context-window. Naively concatenating everything you retrieved is a mistake on two counts. First, it wastes budget and money on marginal passages. Second, and less obviously, where you place a passage matters: language models use information at the very beginning and the very end of a long context far more reliably than information buried in the middle, a degradation that holds even for models marketed as long-context. The practical consequence is to put the most important retrieved passages first and last, keep the retrieved set tight, and not assume that a bigger window means you can dump more in.
Prompt construction for grounded answers is its own small discipline. The system prompt should instruct the model to answer only from the supplied passages, to cite which passage each claim came from (pass stable chunk identifiers alongside the text so it can), and crucially to abstain, to say it does not know, when the passages do not contain the answer. That last instruction is what converts a confident fabrication into an honest "not found", and it is the single highest-value line in most RAG prompts. RAG reduces but does not eliminate hallucination, as the non-technical pillar notes; explicit abstention instructions, citation requirements and keeping the context focused are the levers that push the residual rate down.
Advanced and current patterns
Several patterns extend the basic pipeline, and it is worth being clear about what is established versus emerging. Query routing, sending a query to one of several indexes, tools or prompts based on its type, is established and low-risk. Parent-document, or small-to-big, retrieval is established and widely used: you embed and retrieve on small, precise child chunks for accuracy, then feed the larger parent chunk or full section to the model for context, getting the best of both granularities. Multi-hop, or iterative, retrieval, where the system retrieves, reasons, then retrieves again using what it learned, is necessary for questions whose answer requires chaining facts across documents, and is moving from research into mainstream practice.
Contextual retrieval, prepending a short model-generated summary that situates each chunk within its source document before embedding and indexing, is a recent and well-evidenced technique. Adding contextual embeddings alone reduced the top-20-chunk retrieval failure rate by 35 per cent (from 5.7 per cent to 3.7 per cent); combining contextual embeddings with contextual BM25 reduced it by 49 per cent (to 2.9 per cent); and reranking the contextualised hybrid results reduced it by 67 per cent (to 1.9 per cent). GraphRAG, which uses a language model to build an entity-and-relationship knowledge graph from the corpus and pre-generates community summaries, targets a specific weakness: global "sensemaking" questions over an entire corpus ("what are the main themes?") that vector retrieval handles poorly because no single chunk contains the answer. It is powerful but expensive to build and maintain, so it is justified by the question type, not adopted by default. Agentic RAG, where a model-driven agent decides when to retrieve, what to retrieve, and whether to retrieve again, and can critique its own draft, is the leading edge; self-reflective approaches that train a model to retrieve on demand and grade its own output have shown significant gains in factuality and citation accuracy for long-form generation, for example scoring 81 per cent on the PubHealth fact-verification task and 80 per cent factuality on biography generation against ChatGPT's 71 per cent. Treat agentic RAG as promising and fast-moving rather than settled, and instrument it heavily, because every added decision point is another failure mode.
Evaluation
Evaluation is the discipline that separates a working system from a convincing demo. You cannot eyeball RAG quality; a system that answers ten cherry-picked questions well can fail on the eleventh in ways you will only catch with measurement. The foundation is a golden, or eval, set: a curated collection of representative questions paired with known-good answers and the passages that support them, built from real user queries where possible. Without it you are tuning blind.
Measurement splits cleanly into two layers. Retrieval metrics ask whether the right chunks were found and ranked well: recall@k (did the relevant chunk appear in the top k), precision (how many of the returned chunks were relevant), Mean Reciprocal Rank (how high the first relevant chunk ranked), and normalised Discounted Cumulative Gain (nDCG, which rewards putting the most relevant chunks highest). These let you tune chunking, embeddings, top-k and reranking in isolation, before generation muddies the picture. End-to-end and generation metrics ask whether the final answer is good: faithfulness or groundedness (is every claim supported by the retrieved context, the direct measure of hallucination), answer relevance (does the answer address the question), and context precision and recall (was the retrieved context focused and complete). Recognised frameworks such as RAGAS provide reference-free metrics for faithfulness, answer relevance and context relevance, and benchmarks such as BEIR (zero-shot retrieval) and MTEB (embeddings) help you choose components. Instrument both layers, because an end-to-end failure tells you something broke but not where; the retrieval metrics tell you whether to look upstream or down.
Failure modes and architecture choices
The value of a two-layer evaluation is that it localises failure. When a RAG answer is wrong, isolate the stage. A retrieval miss means the relevant chunk was never in the candidate set: check chunking, the embedding model, and whether you need hybrid or a higher k. A reranking miss means the chunk was retrieved but the reranker pushed it out of the final set: check the reranker and the final-k cut. A generation miss means the right chunk was in the prompt but the model ignored it, contradicted it, or hallucinated around it: check the prompt, the abstention instruction, and context ordering. A missing-source failure means the answer simply is not in the corpus, in which case the correct behaviour is abstention, not invention. Diagnosing these requires logging the intermediate artefacts (retrieved chunk IDs, reranker scores, final prompt) for every query, which most teams add only after their first production incident.
This framing also clarifies the larger architecture choices, including how RAG relates to fine-tuning and long-context approaches. The conceptual difference (behaviour versus information) is on the non-technical page; the engineering trade is this. Fine-tuning changes what the model is and how it behaves, costs training compute, and bakes knowledge in at a point in time, so it suits style, format and narrow-domain behaviour but not fast-changing facts. Long-context prompting, putting whole documents in the window, avoids a retrieval layer but pays per token on every call, scales badly in cost and latency as documents grow, and runs into the lost-in-the-middle degradation; a larger window is not a memory architecture and does not remove the need to decide what to show the model. RAG keeps knowledge external, current and auditable, at the cost of pipeline complexity. These are complements, not rivals: a mature system may fine-tune for tone, use RAG for facts, and reserve long-context for whole-document reasoning on a single retrieved file. The build-versus-buy decision follows the same logic: managed retrieval and RAG services remove undifferentiated heavy lifting and are a sensible default for a first system, while the case for building grows with scale, latency requirements, data-residency constraints and the need to control every stage. Budget latency and cost across the whole pipeline (embedding, retrieval, reranking, generation, and any extra retrieval or summarisation passes) rather than optimising one stage, because the slowest or most expensive stage sets your envelope.
Examples
Hybrid-retrieval knowledge assistant with reranking. An internal assistant over policy, HR and engineering documents. Ingestion uses structure-aware chunking that respects Markdown and PDF headings, prepends the section path to each chunk as metadata, and stores access-level tags. Retrieval is hybrid: a dense retriever for conceptual questions and BM25 for exact policy codes and product names, fused with RRF. The fused top 50 is reranked by a cross-encoder down to the best 5, which are ordered most-relevant-first-and-last in the prompt. The generator is instructed to cite chunk IDs and to abstain when the passages do not contain the answer. This is the sensible default architecture for most enterprise question-answering, and the baseline every team should measure against before adding complexity.
High-recall regulated-document system. A system over clinical or financial filings where missing a relevant clause is the cardinal sin. Here the design biases everything toward recall: smaller chunks with generous overlap so no clause is cut, a high top-k, hybrid retrieval to catch both exact regulatory references and paraphrased obligations, and a reranker to restore precision after the wide first-stage net. Permission-aware filtering is applied at retrieval time, not after generation, so the model never sees a passage the user is not cleared for. Evaluation weights recall@k heavily and tracks faithfulness strictly, and the abstention behaviour is tested as a first-class requirement because a wrong confident answer here carries real liability.
Multi-hop research assistant. A system answering questions that require chaining facts ("which suppliers mentioned in the 2024 risk report also appear in the procurement exceptions log?"). A single retrieval pass cannot answer this, so the design is iterative: retrieve, let the model identify what it still needs, retrieve again with a refined query, and synthesise. Parent-document retrieval gives precise matching on child chunks with full-section context for reasoning. This is the most capable and the most fragile of the three; it needs the heaviest instrumentation, per-hop logging, and tight guardrails on the number of iterations to bound latency and cost.
Common misunderstandings
"A big enough context window removes the need for retrieval." It does not. Models use information at the start and end of a long context far more reliably than the middle, so stuffing a window degrades quality even when the answer is present, and you pay per token on every call. Retrieval decides what deserves to be in the window; a large window is a bigger desk, not a librarian.
"More chunks are always better." Raising top-k raises recall but adds noise, dilutes the prompt, increases cost, and can push the model toward irrelevant passages. Beyond a point, more retrieved chunks lower answer quality. The fix is a tight final set, which is what reranking produces.
"Vector search alone beats hybrid." Dense retrieval handles meaning but smears exact tokens, codes, names and rare terms that BM25 nails. Their errors are complementary, which is why fused hybrid retrieval consistently beats either alone on heterogeneous corpora.
"Embeddings from different models are comparable." They are not. Each model defines its own vector space; mixing a query vector from one model with chunk vectors from another produces meaningless similarities. Index and query must use the same model, and changing it means re-embedding everything.
"Evaluation can be eyeballed." A demo proves a system can work, not that it does. Without a golden set and both retrieval and generation metrics, you cannot tell whether a change helped, and you cannot localise a failure to a stage. Measurement is the difference between engineering and hoping.
Risks and boundaries
The failure-mode taxonomy above (retrieval miss, reranking miss, generation miss, missing source) is the core diagnostic discipline; log intermediate artefacts so you can apply it. Beyond accuracy, three operational risks dominate. Freshness and synchronisation: an index is a snapshot, so when a source document changes or is deleted, the index must be updated or the system will confidently cite stale or withdrawn content. Deletion is the harder half; a retracted document must leave the index, and "right to be forgotten" obligations make this a compliance matter, not just hygiene. Permission-aware retrieval: access control must be enforced at the retrieval stage through metadata filtering, so the model never receives a passage the user is not authorised to see. Filtering after generation is too late, because the answer may already embed the protected content.
Security failure modes deserve a technical pointer rather than a full treatment here; consult dedicated security explainers and the OWASP and NIST resources for depth. Two categories are RAG-specific. Knowledge-base poisoning is when an attacker plants malicious or misleading content in a source the system ingests, so the poisoned passage is later retrieved and trusted; OWASP describes this data-poisoning risk as content that "can originate from insiders, prompts, data seeding, or unverified data providers, leading to manipulated model outputs." Indirect prompt injection is when retrieved content itself contains instructions that hijack the model, for example hidden text in an ingested document telling the model to ignore its instructions; OWASP defines these as occurring "when an LLM accepts input from external sources, such as websites or files", and notes that techniques like RAG "do not fully mitigate prompt injection vulnerabilities." Because RAG deliberately feeds external content into the prompt, it is structurally exposed to this. OWASP treats "Vector and Embedding Weaknesses" (LLM08:2025) as a distinct risk, warning that "weaknesses in how vectors and embeddings are generated, stored, or retrieved can be exploited... to inject harmful content, manipulate model outputs, or access sensitive information", and flags unauthorised access, cross-tenant context leakage in multi-tenant vector stores, and embedding inversion attacks that "recover significant amounts of source information." Prompt injection (LLM01:2025) is OWASP's top-ranked LLM application risk. Mitigations include treating retrieved content as untrusted, isolating and clearly delimiting it in the prompt, applying least-privilege access to the store, and validating ingested sources. The NIST Generative AI Profile (NIST AI 600-1, published July 2024) provides a complementary risk vocabulary, enumerating twelve GenAI risk categories including confabulation (its term for hallucination), information security and information integrity, with data poisoning treated within those areas.
Finally, the limits of RAG as an architecture. RAG grounds answers in retrieved text; it does not reason over the whole corpus at once, so global sensemaking questions need graph-augmented or summarisation approaches. It cannot answer what is not in its sources, and the correct behaviour there is abstention. And it inherits the source-quality dependence the non-technical page frames as "garbage in, garbage out": no retrieval cleverness compensates for a corpus that is wrong, contradictory or out of date.
What to do next
These are technical-lead actions, distinct from the sequenced non-technical steps on /what-is/rag.
First, build the eval set before you build the system. Curate representative questions with known-good answers and supporting passages from real user queries. This is the asset that lets every later decision be measured rather than argued, and it is the single highest-return thing you can do.
Second, instrument every stage. Log retrieved chunk IDs, reranker scores and the final assembled prompt for each query from day one. Without this you cannot localise failures, and you will add it anyway after your first incident, so add it now.
Third, start with a hybrid retrieval plus reranking baseline. Dense plus BM25 fused with RRF, a cross-encoder reranker over the top 50, a tight final context with abstention instructions. Measure it on your eval set. Only add contextual retrieval, multi-hop, GraphRAG or agentic patterns when the metrics show the baseline failing on a specific question type, and prove each addition earns its latency and cost.
Fourth, budget latency and cost across the whole pipeline. Add up embedding, retrieval, reranking, generation and any extra passes, and find the stage that sets your envelope before optimising. Reranking and per-chunk contextualisation buy quality for latency and spend; know the exchange rate for your workload.
Fifth, decide build versus buy deliberately. A managed retrieval service is a sensible default for a first system and removes undifferentiated work. The case for building grows with scale, strict latency, data-residency constraints and the need to control each stage. Revisit the decision when any of those thresholds is crossed, not before.
The benchmark that should change your plan: if recall@k on your eval set is low, your problem is upstream (chunking, embeddings, retrieval) and no prompt engineering will fix it. If recall is high but faithfulness is low, your problem is downstream (prompt, ordering, abstention, model). Let those two numbers, not intuition, direct your effort.
FAQs
How do I choose a chunk size?
There is no universal answer; it depends on your documents and queries. Smaller chunks raise precision but risk splitting answers and losing context; larger chunks preserve context but dilute the embedding and waste context-window budget. Start with structure-aware splitting on natural boundaries, use 10 to 20 per cent overlap, and tune size against recall@k on your eval set rather than guessing.
Dense, sparse or hybrid retrieval?
Hybrid, in almost all cases. Dense retrieval captures meaning and paraphrase; sparse retrieval (BM25) nails exact terms, codes, names and rare tokens. Their failure modes are complementary, so fusing them with Reciprocal Rank Fusion consistently beats either alone on mixed corpora. Use dense-only only if your queries are purely conceptual and contain no exact identifiers.
When is reranking worth the latency?
When first-stage retrieval returns the right chunks but in the wrong order, or when you need high precision in a small final context. Run a cheap retriever to get a wide candidate set, then a cross-encoder to reorder it. If your retriever already ranks the answer first, a reranker adds cost for little gain, so measure before adding it.
How do I actually evaluate a RAG system?
Build a golden set of representative questions with known answers and supporting passages. Measure retrieval separately (recall@k, precision, MRR, nDCG) and generation separately (faithfulness, answer relevance, context precision and recall). Frameworks such as RAGAS automate the generation-side metrics. The two layers together let you localise failures, which eyeballing cannot.
Does a long context window replace RAG?
No. Models use the start and end of a long context far more reliably than the middle, so quality degrades even when the answer is present, and you pay per token on every call. Long context suits reasoning over a single retrieved document; retrieval is still what decides which document and which passages belong in the window.
What is the difference between a bi-encoder and a cross-encoder?
A bi-encoder embeds query and document separately, which lets you index documents ahead of time and is what makes fast retrieval possible. A cross-encoder processes query and document together for a much more accurate relevance score, but must run per candidate at query time, so it is too slow to search a whole corpus. Retrieval uses the bi-encoder; reranking uses the cross-encoder.
How should I handle tables and PDFs?
Do not rely on naive text extraction, which linearises columns and destroys tables. Use layout-aware parsing, extract tables explicitly and convert them to Markdown or sentence form, retain headings and prepend them to child chunks as metadata, and chunk along document structure. Poor parsing is a silent source of retrieval failure that no downstream stage can fix.
How do I trade recall against latency in the index?
Approximate nearest-neighbour methods expose tuning parameters: HNSW's graph degree and search breadth, or the number of IVF clusters probed, dial recall up at the cost of query time and memory. Product quantization cuts memory at some accuracy cost. There is no universal setting; measure recall on your own queries against your latency budget and tune to the knee of that curve.
When should I use GraphRAG or agentic RAG?
Use GraphRAG for global "sensemaking" questions over a whole corpus that ordinary vector retrieval handles poorly, accepting its higher build and maintenance cost. Use agentic or multi-hop patterns when answers require chaining facts across documents. Both are justified by the question type, not adopted by default, and both need heavier instrumentation because each added decision point is another failure mode.
How do I stop the system citing deleted or stale content?
Treat the index as a snapshot that must be synchronised with its sources. Build a pipeline that updates and, critically, deletes index entries when source documents change or are removed; deletion also carries compliance weight under right-to-be-forgotten obligations. Stale citations are a freshness failure, not a model failure, and are fixed in the ingestion pipeline.
Sources
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (NeurIPS (Advances in Neural Information Processing Systems 33)). The original RAG formulation; retrieve-then-generate architecture and end-to-end framing.
Dense Passage Retrieval for Open-Domain Question Answering (ACL Anthology (EMNLP 2020)). Dense dual-encoder retrieval and its measured 9 to 19 per cent absolute advantage over BM25.
Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs (IEEE Transactions on Pattern Analysis and Machine Intelligence). HNSW approximate nearest neighbour indexing and recall-versus-latency behaviour.
Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods (ACM SIGIR (Proceedings of the 32nd ACM SIGIR Conference)). Reciprocal Rank Fusion for hybrid retrieval fusion, including the k = 60 default.
Passage Re-ranking with BERT (arXiv (New York University)). Cross-encoder reranking and two-stage retrieve-then-rerank gains (MRR@10 35.8 on MS MARCO).
Lost in the Middle: How Language Models Use Long Contexts (Transactions of the Association for Computational Linguistics (MIT Press)). Positional degradation in long contexts; context-ordering guidance.
RAGAs: Automated Evaluation of Retrieval Augmented Generation (ACL Anthology (EACL 2024 System Demonstrations)). Reference-free RAG evaluation metrics (faithfulness, answer relevance, context relevance).
OWASP Top 10 for LLM Applications 2025 (LLM01 Prompt Injection; LLM08 Vector and Embedding Weaknesses) (OWASP Gen AI Security Project). RAG security failure modes: prompt injection, data poisoning, and vector/embedding weaknesses.
