A Visual Deep Dive

What is
Retrieval‑Augmented
Generation?

RAG is a technique that gives language models a long-term memory — letting them answer questions grounded in your documents, not just what they learned during training.

Explore the Pipeline
RAG at a glance
📄
Your Documents
PDFs, DOCX, text files
🔢
Chunked & Embedded
Split into pieces → vectors
🔍
Semantic Retrieval
Query finds relevant chunks
🤖
LLM Generates Answer
Grounded in retrieved context
Cited Response
Answer + source evidence
The Problem

LLMs have a knowledge problem

Large language models are trained on data up to a certain date. They can't read your internal documents, latest reports, or private knowledge bases. Ask them about your company's Q3 policy and they'll guess — or hallucinate.

The core limitation
"What is in our Q3 2024 earnings report?"

LLM alone: Makes up numbers. No access to your actual report.

LLM + RAG: Finds your actual report, reads the right sections, cites them.
R
Retrieval
Given a user query, search a vector database to find the most semantically similar document chunks. Not keyword search — meaning-based search.
A
Augmented
Take those retrieved chunks and add them to the LLM's prompt as context. The model now "sees" your documents before answering.
G
Generation
The LLM generates a response grounded in the retrieved context — not from memory or training data. Every claim is traceable.
Step by Step

The RAG Pipeline

Click each stage to explore what happens under the hood — from raw documents to a grounded answer.

Stage 01
Document Ingestion
The pipeline starts by loading your raw documents. RAG is format-agnostic — PDFs, Word docs, plain text, HTML pages, code files — any source of knowledge can be indexed.
PDF Loader — extracts text from each page, preserves structure
DOCX Loader — reads Word documents including headings and paragraphs
Text Loader — handles plain text, markdown, CSV files
Metadata — source filename, page number, doc type are attached to every chunk
Documents loaded
PDFreport.pdf2.4 MB · 48 pages
DOCXpolicy.docx890 KB · 22 pages
TXTnotes.txt14 KB · plain text
PDFmanual.pdf5.1 MB · 120 pages
DOCXfaq.docx340 KB · 8 pages
TXTdata.txt62 KB · csv data
6 documents · 198 pages · ready to chunk
Stage 02
Chunking
Documents are too large to embed as a whole. They're split into smaller, overlapping pieces called chunks. Chunk size and overlap are critical tuning parameters.
Chunk size — typically 256–1024 tokens. Smaller = precise retrieval, larger = more context per result
Overlap — consecutive chunks share ~20% of content to prevent answers from being split across boundaries
Metadata preserved — each chunk remembers its source doc, page number, and position
Sentence-aware splitting — chunks never cut mid-sentence to preserve semantic meaning
One document → many chunks
report.pdf · page 4
"The Q3 revenue grew by 23% year-over-year, driven primarily by enterprise subscription growth in the APAC region. Customer retention improved to 94%..."
↓ split into chunks ↓
C-001
C-002
C-003
C-004
C-005
C-006
C-007
C-008
198 pages → 1,247 chunks · 512 tokens each · 64 token overlap
Stage 03
Embedding
Each chunk is converted into a vector — a list of hundreds of numbers that captures its meaning. Chunks with similar meaning end up close together in vector space.
Embedding model — transforms text into a 1024-dimensional numeric vector
Semantic proximity — "revenue growth" and "sales increase" will have similar vectors
Stored in ChromaDB — vectors are persisted in a vector database for fast similarity search
One-time cost — indexing runs once. Querying reuses the stored vectors every time
Vector space visualization
Each dot = one chunk · proximity = semantic similarity
Stage 04
Retrieval
When a user asks a question, it's embedded using the same model. The vector database finds the chunks whose vectors are closest — the most semantically relevant passages.
Query embedding — the user's question is converted to a vector on the fly
Cosine similarity — measures the angle between the query vector and every chunk vector
Top-K results — returns the K most similar chunks (e.g. top 5) to use as context
Multi-query — advanced RAG uses conversation history as multiple queries for better recall
Similarity search results
USER QUERY
"What was the revenue growth in Q3?"
"Q3 revenue grew by 23% year-over-year..." 94%
"Enterprise subscriptions drove Q3 growth..." 88%
"APAC region contributed 41% of new revenue..." 79%
"Customer retention reached 94% in Q3..." 71%
Stage 05
Generation
The retrieved chunks are assembled into a prompt and sent to the LLM. The model reads the context and generates an answer grounded entirely in what was retrieved — not from training memory.
Prompt construction — chunks are formatted as context blocks with source labels
Instruction — model is told to answer only from the provided context, not from training data
LLM reads chunks — the model synthesizes a coherent answer from multiple passages
No hallucination — if the answer isn't in the retrieved chunks, the model says so
Prompt construction
RETRIEVED CONTEXT
[1] "Q3 revenue grew 23% YoY..." (report.pdf p.4)
[2] "Enterprise subs drove growth..." (report.pdf p.5)
[3] "APAC contributed 41%..." (report.pdf p.6)
+
USER QUESTION
"What was the revenue growth in Q3?"
↓ LLM generates ↓
"Based on the Q3 report, revenue grew 23% year-over-year, driven by enterprise subscription growth, particularly in the APAC region which contributed 41% of new revenue."
Stage 06
Response
The final response is returned to the user — not just the answer, but structured metadata about what was retrieved, from where, and how confident the system is.
answer — the generated response text from the LLM
evidence[] — the exact chunks used, with source document and similarity score
evidence_count — how many chunks were retrieved and used
Full traceability — every claim in the answer can be traced to a source page and document
API Response JSON
{
  "answer": "Revenue grew 23% YoY...",
  "evidence_count": 4,
  "vectors_searched": 1247,
  "evidence": [
    {
      "text": "Q3 revenue grew 23%...",
      "score": 0.94,
      "source": "report.pdf"
    }, ...
  ]
}
Every answer is grounded · Every source is cited
Comparison

RAG vs Vanilla LLM

Why not just ask ChatGPT directly? Here's exactly where RAG wins.

Without RAG
Vanilla LLM
Answers from training data only
  • No access to your private documents
  • Knowledge frozen at training cutoff date
  • Can hallucinate confident-sounding facts
  • No source citations — unverifiable
  • Can't answer about your internal data
  • One context window, no long-term memory
VS
With RAG
RAG System
Retrieval + generation combined
  • Reads your documents, PDFs, knowledge base
  • Always up to date — just re-index new docs
  • Grounded answers — only says what's in the docs
  • Full source citations with page numbers
  • Works on private, confidential, proprietary data
  • Conversation history improves retrieval quality
Applications

Where RAG Shines

Any domain where you need accurate, cited answers from a private knowledge base.

01
Enterprise Knowledge Base
Query internal wikis, HR policies, SOPs, and company documentation. New employees can ask questions and get precise, cited answers from official docs.
Internal Tools
02
Legal Document Analysis
Search across hundreds of contracts, case files, and regulations. Get specific clause references with exact page numbers and document sources.
Legal Tech
03
Medical Research Assistant
Query clinical studies, drug databases, and research papers. RAG ensures answers are grounded in peer-reviewed sources, not LLM guesses.
Healthcare
04
Customer Support Bot
Answer support tickets using your product documentation, FAQ, and release notes. Responses are accurate and point to the exact help article.
Customer Success
05
Financial Research
Analyze earnings reports, SEC filings, and financial models. RAG retrieves the exact figure from the right document — no hallucinated numbers.
Finance
06
Personal Document Assistant
Index your resume, notes, research papers, and project files. Ask "what did I write about X last year?" and get the exact passage back.
Personal Productivity