RAG (Retrieval-Augmented Generation) — Definition

RAG solves 3 fundamental LLM limitations: (1) **knowledge cutoff** — the model's training data doesn't include anything created after its cutoff date; (2) **hallucinations** — the model fabricates when it doesn't know; (3) **private data** — the model has no knowledge of a given company's internal content.

Standard pipeline: indexing (split documents into chunks, embed each chunk, store in a vector DB) then runtime (embed the query, retrieve the top-k nearest chunks, inject them into the prompt with an instruction like "answer using only these sources").

2026 variants: hybrid RAG (vector cosine + BM25 keyword), agentic RAG (the agent decides what to retrieve across multiple turns), tool-RAG (retrieving tools rather than documents), GraphRAG (retrieval over a graph rather than a flat vector store).

FAQ

What chunk size should I use for RAG?

For technical text: 500–1,000 tokens. For marketing or brand voice content: 200–500 tokens. The rule of thumb: a chunk should be relevant and self-contained when read in isolation.

Which embedding model is recommended in 2026?

OpenAI text-embedding-3-large for quality, or voyage-3 / cohere-embed-v3 for multilingual use cases with a better price-to-quality ratio. For non-English languages specifically, Voyage frequently leads benchmarks.