Glossary
RAG (Retrieval-Augmented Generation)
RAG (Retrieval-Augmented Generation) is an AI architecture that couples a large language model with an upfront semantic retrieval step — typically via a vector database — to ground generation in fresh, private, or domain-specific data that the model doesn't have in its weights.
Also known as
- RAG
- Retrieval-Augmented Generation
RAG solves 3 fundamental LLM limitations: (1) **knowledge cutoff** — the model's training data doesn't include anything created after its cutoff date; (2) **hallucinations** — the model fabricates when it doesn't know; (3) **private data** — the model has no knowledge of a given company's internal content.
Standard pipeline: indexing (split documents into chunks, embed each chunk, store in a vector DB) then runtime (embed the query, retrieve the top-k nearest chunks, inject them into the prompt with an instruction like "answer using only these sources").
2026 variants: hybrid RAG (vector cosine + BM25 keyword), agentic RAG (the agent decides what to retrieve across multiple turns), tool-RAG (retrieving tools rather than documents), GraphRAG (retrieval over a graph rather than a flat vector store).
In the getchatsocial.com product
getchatsocial.com uses multi-stage hybrid RAG: tool-RAG pgvector + BM25 for per-turn tool selection, KG-RAG for brand context retrieval, and a tool-result cache for intra-conversation deduplication.
FAQ
What chunk size should I use for RAG?
For technical text: 500–1,000 tokens. For marketing or brand voice content: 200–500 tokens. The rule of thumb: a chunk should be relevant and self-contained when read in isolation.
Which embedding model is recommended in 2026?
OpenAI text-embedding-3-large for quality, or voyage-3 / cohere-embed-v3 for multilingual use cases with a better price-to-quality ratio. For non-English languages specifically, Voyage frequently leads benchmarks.