A Visual Deep Dive

What is
Retrieval‑Augmented
Generation?

RAG is a technique that gives language models a long-term memory — letting them answer questions grounded in your documents, not just what they learned during training.

Explore the Pipeline →

RAG at a glance

📄

Your Documents

PDFs, DOCX, text files

🔢

Chunked & Embedded

Split into pieces → vectors

🔍

Semantic Retrieval

Query finds relevant chunks

🤖

LLM Generates Answer

Grounded in retrieved context

✅

Cited Response

Answer + source evidence

The Problem

LLMs have a knowledge problem

Large language models are trained on data up to a certain date. They can't read your internal documents, latest reports, or private knowledge bases. Ask them about your company's Q3 policy and they'll guess — or hallucinate.

The core limitation

"What is in our Q3 2024 earnings report?"

LLM alone: Makes up numbers. No access to your actual report.

LLM + RAG: Finds your actual report, reads the right sections, cites them.

Retrieval

Given a user query, search a vector database to find the most semantically similar document chunks. Not keyword search — meaning-based search.

Augmented

Take those retrieved chunks and add them to the LLM's prompt as context. The model now "sees" your documents before answering.

Generation

The LLM generates a response grounded in the retrieved context — not from memory or training data. Every claim is traceable.

Step by Step

The RAG Pipeline

Click each stage to explore what happens under the hood — from raw documents to a grounded answer.

Stage 01

Document Ingestion

The pipeline starts by loading your raw documents. RAG is format-agnostic — PDFs, Word docs, plain text, HTML pages, code files — any source of knowledge can be indexed.

▸ PDF Loader — extracts text from each page, preserves structure

▸ DOCX Loader — reads Word documents including headings and paragraphs

▸ Text Loader — handles plain text, markdown, CSV files

▸ Metadata — source filename, page number, doc type are attached to every chunk

Documents loaded

PDFreport.pdf2.4 MB · 48 pages

DOCXpolicy.docx890 KB · 22 pages

TXTnotes.txt14 KB · plain text

PDFmanual.pdf5.1 MB · 120 pages

DOCXfaq.docx340 KB · 8 pages

TXTdata.txt62 KB · csv data

6 documents · 198 pages · ready to chunk

Stage 02

Chunking

Documents are too large to embed as a whole. They're split into smaller, overlapping pieces called chunks. Chunk size and overlap are critical tuning parameters.

▸ Chunk size — typically 256–1024 tokens. Smaller = precise retrieval, larger = more context per result

▸ Overlap — consecutive chunks share ~20% of content to prevent answers from being split across boundaries

▸ Metadata preserved — each chunk remembers its source doc, page number, and position

▸ Sentence-aware splitting — chunks never cut mid-sentence to preserve semantic meaning

One document → many chunks

report.pdf · page 4
"The Q3 revenue grew by 23% year-over-year, driven primarily by enterprise subscription growth in the APAC region. Customer retention improved to 94%..."

↓ split into chunks ↓

C-001

C-002

C-003

C-004

C-005

C-006

C-007

C-008

198 pages → 1,247 chunks · 512 tokens each · 64 token overlap

Stage 03

Embedding

Each chunk is converted into a vector — a list of hundreds of numbers that captures its meaning. Chunks with similar meaning end up close together in vector space.

▸ Embedding model — transforms text into a 1024-dimensional numeric vector

▸ Semantic proximity — "revenue growth" and "sales increase" will have similar vectors

▸ Stored in ChromaDB — vectors are persisted in a vector database for fast similarity search

▸ One-time cost — indexing runs once. Querying reuses the stored vectors every time

Vector space visualization

Each dot = one chunk · proximity = semantic similarity

Stage 04

Retrieval

When a user asks a question, it's embedded using the same model. The vector database finds the chunks whose vectors are closest — the most semantically relevant passages.

▸ Query embedding — the user's question is converted to a vector on the fly

▸ Cosine similarity — measures the angle between the query vector and every chunk vector

▸ Top-K results — returns the K most similar chunks (e.g. top 5) to use as context

▸ Multi-query — advanced RAG uses conversation history as multiple queries for better recall

Similarity search results

USER QUERY

"What was the revenue growth in Q3?"

"Q3 revenue grew by 23% year-over-year..." 94%

"Enterprise subscriptions drove Q3 growth..." 88%

"APAC region contributed 41% of new revenue..." 79%

"Customer retention reached 94% in Q3..." 71%

Stage 05

Generation

The retrieved chunks are assembled into a prompt and sent to the LLM. The model reads the context and generates an answer grounded entirely in what was retrieved — not from training memory.

▸ Prompt construction — chunks are formatted as context blocks with source labels

▸ Instruction — model is told to answer only from the provided context, not from training data

▸ LLM reads chunks — the model synthesizes a coherent answer from multiple passages

▸ No hallucination — if the answer isn't in the retrieved chunks, the model says so

Prompt construction

RETRIEVED CONTEXT

[1] "Q3 revenue grew 23% YoY..." (report.pdf p.4)
[2] "Enterprise subs drove growth..." (report.pdf p.5)
[3] "APAC contributed 41%..." (report.pdf p.6)

USER QUESTION

"What was the revenue growth in Q3?"

↓ LLM generates ↓

"Based on the Q3 report, revenue grew 23% year-over-year, driven by enterprise subscription growth, particularly in the APAC region which contributed 41% of new revenue."

Stage 06

Response

The final response is returned to the user — not just the answer, but structured metadata about what was retrieved, from where, and how confident the system is.

▸ answer — the generated response text from the LLM

▸ evidence[] — the exact chunks used, with source document and similarity score

▸ evidence_count — how many chunks were retrieved and used

▸ Full traceability — every claim in the answer can be traced to a source page and document

API Response JSON

{
  "answer": "Revenue grew 23% YoY...",
  "evidence_count": 4,
  "vectors_searched": 1247,
  "evidence": [
    {
      "text": "Q3 revenue grew 23%...",
      "score": 0.94,
      "source": "report.pdf"
    }, ...
  ]
}

Every answer is grounded · Every source is cited

Comparison

RAG vs Vanilla LLM

Why not just ask ChatGPT directly? Here's exactly where RAG wins.

Without RAG

Vanilla LLM

Answers from training data only

❌ No access to your private documents
❌ Knowledge frozen at training cutoff date
❌ Can hallucinate confident-sounding facts
❌ No source citations — unverifiable
❌ Can't answer about your internal data
❌ One context window, no long-term memory

With RAGRAG System
Retrieval + generation combined
✅ Reads your documents, PDFs, knowledge base
✅ Always up to date — just re-index new docs
✅ Grounded answers — only says what's in the docs
✅ Full source citations with page numbers
✅ Works on private, confidential, proprietary data
✅ Conversation history improves retrieval quality

Applications

Where RAG Shines

Any domain where you need accurate, cited answers from a private knowledge base.

Enterprise Knowledge Base

Query internal wikis, HR policies, SOPs, and company documentation. New employees can ask questions and get precise, cited answers from official docs.

Internal Tools

Legal Document Analysis

Search across hundreds of contracts, case files, and regulations. Get specific clause references with exact page numbers and document sources.

Legal Tech

Medical Research Assistant

Query clinical studies, drug databases, and research papers. RAG ensures answers are grounded in peer-reviewed sources, not LLM guesses.

Healthcare

Customer Support Bot

Answer support tickets using your product documentation, FAQ, and release notes. Responses are accurate and point to the exact help article.

Customer Success

Financial Research

Analyze earnings reports, SEC filings, and financial models. RAG retrieves the exact figure from the right document — no hallucinated numbers.

Finance

Personal Document Assistant

Index your resume, notes, research papers, and project files. Ask "what did I write about X last year?" and get the exact passage back.

Personal Productivity

What isRetrieval‑AugmentedGeneration?

LLMs have a knowledge problem

The RAG Pipeline

RAG vs Vanilla LLM

Where RAG Shines

What is
Retrieval‑Augmented
Generation?