#RAG Knowledge Base Explained: WhatsApp AI Context Retrieval
When AI Doesn't Know Your Business
A customer messages your WhatsApp bot: "What's your refund policy?" The AI responds with: "Most companies offer a 30-day refund window. You'll typically need your receipt and the item in original condition."
That answer is confidently wrong. Your actual policy is a 60-day money-back guarantee with no questions asked. The AI didn't lie on purpose. It just doesn't know your business. It generated a plausible-sounding answer based on its training data, which includes thousands of generic refund policies but none of yours.
This is the core problem with using LLMs directly for customer support. They're excellent at language but ignorant about your specific documentation, pricing, policies, and product details. The solution isn't fine-tuning (expensive, slow, requires retraining on every doc update). The solution is RAG: Retrieval-Augmented Generation.
RAG gives your AI access to your actual documents. Instead of guessing, it searches your knowledge base, retrieves the relevant sections, and generates a response grounded in your real content. The same refund question now returns: "We offer a 60-day money-back guarantee with no questions asked. To request a refund, email [email protected] or reply here and I'll connect you with our team."
This post goes deep into how RAG works, from document ingestion to vector search to response generation. You'll understand every stage of the pipeline, compare chunking strategies and embedding models, and see MoltFlow's implementation with working code examples.
What Is RAG?
Think of RAG like a librarian. When you ask a question, the librarian doesn't try to answer from memory. They walk to the relevant shelf, pull the right books, read the relevant sections, and then give you an answer based on what they found. RAG works the same way, in three phases.
Phase 1: Index (happens once). Convert your documents into a searchable format. This is like cataloging every book in the library, creating an index card for each section so you can find it quickly later.
Phase 2: Retrieve (per query). When a question arrives, search your indexed documents for the most relevant sections. This is the librarian walking to the shelf and pulling the right books.
Phase 3: Generate (per query). Feed the retrieved sections plus the original question to the AI model. The model reads the context and crafts an answer based on your actual documentation, not its training data.
Here's the difference in practice:
Without RAG (standard AI):
User: "What's your refund policy?"
AI: "Most companies offer 30-day refunds..." (generic, possibly wrong)With RAG:
User: "What's your refund policy?"
[System searches knowledge base -> finds refund-policy.pdf, page 3]
AI: "We offer a 60-day money-back guarantee with no questions asked.
To request a refund, email [email protected]." (your actual policy)The difference is trust. With RAG, every answer traces back to a source document. You can verify it. Your customers can trust it. And when your policies change, you update the document and the AI's answers change immediately, no retraining required.
RAG Pipeline Architecture
Here's the complete pipeline from document upload to AI response:
Document Upload (PDF, TXT, DOCX)
|
v
Text Extraction (PDF -> plain text)
|
v
Chunking (split into 500-token segments with overlap)
|
v
Embedding Generation (text -> 1536-dimension vector)
|
v
Vector Storage (PostgreSQL + pgvector)
|
v
[User Query] -> Query Embedding -> Cosine Similarity Search -> Top 3 Chunks
|
v
Context Injection (chunks + question -> AI prompt)
|
v
AI Model -> Response (grounded in your documents)Let's walk through each stage.
Stage 1: Document Ingestion
The pipeline starts when you upload a document. MoltFlow accepts PDF, TXT, DOCX, and Markdown files. The ingestion process extracts raw text, normalizes whitespace, strips headers and footers, and prepares the content for chunking.
# MoltFlow ingestion pipeline (simplified)
@app.post("/api/v2/ai/knowledge/ingest")
async def ingest_document(file: UploadFile):
text = extract_text(file) # PDF/DOCX/TXT extraction
chunks = chunk_text(text, max_tokens=500)
embeddings = embed_chunks(chunks)
store_in_db(chunks, embeddings)
return {"chunks": len(chunks), "status": "indexed"}Text extraction sounds simple but has nuances. PDFs with columns, tables, or images require careful parsing to maintain reading order. MoltFlow uses PyPDF2 for standard PDFs and python-docx for Word documents, with fallback to raw text extraction for edge cases.
Stage 2: Chunking
You can't feed an entire 100-page manual into an AI prompt. Context windows are limited, and even with models that support 128k+ tokens, stuffing too much irrelevant text degrades response quality. Chunking solves this by splitting documents into smaller, semantically coherent segments.
MoltFlow defaults to 500 tokens per chunk with 50 tokens of overlap between adjacent chunks. The overlap ensures that concepts split across chunk boundaries still appear together in at least one chunk.
Why 500 tokens? It's a sweet spot. Smaller chunks (200-300 tokens) lose context. Larger chunks (800-1000 tokens) include too much irrelevant content alongside the relevant bits. At 500 tokens, you get roughly 1-2 paragraphs of coherent content, enough to answer most questions without noise.
We'll go deeper on chunking strategies in a dedicated section below.
Stage 3: Embedding Generation
An embedding is a vector representation of text meaning. It converts human-readable text into a list of numbers (typically 1,536 dimensions for OpenAI's model) where similar meanings produce similar vectors.
For example, "How do I reset my password?" and "I forgot my login credentials" would produce vectors that are close together in the 1,536-dimensional space, even though they share zero words. That's the power of semantic similarity over keyword matching.
from openai import OpenAI
client = OpenAI(api_key=API_KEY)
def embed_text(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding # [0.023, -0.045, 0.112, ...]Each chunk gets embedded once during ingestion. User queries get embedded at search time. The embedding model must be the same for both, as vectors from different models aren't comparable.
Stage 4: Vector Storage
MoltFlow stores embeddings in PostgreSQL using the pgvector extension. This keeps everything in a single database, no separate vector store to manage. Here's the schema:
CREATE TABLE knowledge_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
document_id UUID NOT NULL,
content TEXT NOT NULL,
embedding vector(1536), -- pgvector column type
chunk_index INT,
metadata JSONB,
created_at TIMESTAMP DEFAULT now()
);
-- IVFFlat index for approximate nearest neighbor search
CREATE INDEX ON knowledge_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);The IVFFlat index enables fast approximate nearest neighbor search. For knowledge bases under 100,000 chunks, this delivers sub-10ms search times. For larger datasets, HNSW indexing provides better recall at the cost of more memory.
Stage 5: Vector Search
When a user asks a question, the system converts the query to a vector using the same embedding model, then finds the closest matching chunks using cosine similarity:
SELECT content,
metadata,
1 - (embedding <=> $2::vector) AS similarity
FROM knowledge_chunks
WHERE tenant_id = $1
AND 1 - (embedding <=> $2::vector) > 0.7
ORDER BY embedding <=> $2::vector
LIMIT 3;The <=> operator is pgvector's cosine distance function. We filter for similarity above 0.7 (70% match) to avoid injecting irrelevant context. The query returns the top 3 most relevant chunks, ranked by semantic similarity to the user's question.
Stage 6: Context Injection
The retrieved chunks are combined with the user's question in a structured prompt template:
You are a customer support assistant. Answer using ONLY the context below.
If the context doesn't contain the answer, say "I don't have that information"
and offer to connect the customer with a human agent.
Context:
---
[Chunk 1: Our refund policy allows customers to request a full refund
within 60 days of purchase. No questions asked. Contact support@...]
---
[Chunk 2: For returns of physical products, items must be in original
packaging. Digital products are refund-eligible within 14 days...]
---
[Chunk 3: Refund processing takes 5-7 business days. Credit card
refunds appear on your next statement...]
---
Customer question: What's your refund policy?
Answer:The AI model reads the provided context, understands the question, and generates a response grounded in your actual documentation. The "ONLY the context below" instruction prevents the model from falling back to its training data when the context doesn't cover the question.
Chunking Strategies Deep Dive
Chunking is where most RAG implementations succeed or fail. The same document, chunked differently, can produce wildly different retrieval quality. Here are three approaches, from simple to sophisticated.
Fixed-Size Chunking
The simplest approach: split text every N tokens regardless of content boundaries.
def chunk_fixed(text: str, size: int = 500, overlap: int = 50) -> list[str]:
words = text.split()
chunks = []
for i in range(0, len(words), size - overlap):
chunk = ' '.join(words[i:i + size])
if chunk.strip():
chunks.append(chunk)
return chunksPros: Simple to implement, predictable chunk count, consistent chunk sizes.
Cons: Breaks mid-sentence, splits paragraphs arbitrarily, can separate a question from its answer in FAQ documents.
Best for: Uniform documents like CSV exports, server logs, or structured data where sentence boundaries don't matter.
Sentence-Aware Chunking (MoltFlow Default)
Split at sentence boundaries, accumulating sentences until the chunk reaches the target size. This preserves semantic coherence within each chunk.
import nltk
nltk.download('punkt_tab')
def chunk_sentences(text: str, max_tokens: int = 500) -> list[str]:
sentences = nltk.sent_tokenize(text)
chunks, current = [], []
token_count = 0
for sentence in sentences:
sent_tokens = len(sentence.split())
if token_count + sent_tokens > max_tokens and current:
chunks.append(' '.join(current))
current = [sentence]
token_count = sent_tokens
else:
current.append(sentence)
token_count += sent_tokens
if current:
chunks.append(' '.join(current))
return chunksPros: Preserves sentence integrity, better semantic coherence, natural boundaries.
Cons: Variable chunk sizes (some chunks may be very small if a long sentence pushes past the limit), requires sentence detection library.
Best for: Documentation, articles, user manuals, FAQ pages, anything written in prose.
Semantic Chunking (Advanced)
Uses embeddings to detect topic changes within a document. When the cosine similarity between consecutive sentences drops below a threshold, a new chunk begins. This produces topically coherent segments regardless of length.
Pros: Highest retrieval quality because each chunk covers a single topic. No information bleed across topics within the same chunk.
Cons: Requires embedding every sentence during ingestion (expensive and slow). Complex to implement. Chunk sizes are unpredictable.
Best for: Multi-topic documents like annual reports, comprehensive guides, or policy manuals where a single page might cover multiple unrelated subjects.
When to Use Each Strategy
| Document Type | Strategy | Reasoning |
|---|---|---|
| FAQ pages | Sentence-aware | Each Q&A pair is standalone |
| User manuals | Sentence-aware | Step-by-step instructions benefit from sentence boundaries |
| Legal contracts | Fixed-size | Consistent structure, exact text preservation matters |
| Knowledge base articles | Semantic | Multiple topics per article need topic-level segmentation |
| Product catalogs | Fixed-size | Structured, repetitive entries with predictable format |
| Support transcripts | Sentence-aware | Conversational flow follows sentence patterns |
MoltFlow uses sentence-aware chunking by default because it works well across the widest range of document types. You can switch strategies per document via the ingestion API.
Embedding Model Comparison
The embedding model you choose directly affects retrieval quality. A poor embedding model means the system retrieves irrelevant chunks, and even the best LLM can't generate good answers from bad context.
Here are the four most practical options for production RAG systems:
text-embedding-3-small (OpenAI) is MoltFlow's default. It produces 1,536-dimensional vectors at $0.02 per 1M tokens, with 62.3% on the MTEB benchmark. Latency is approximately 50ms per batch. This is the best balance of quality, cost, and speed for most production workloads.
text-embedding-3-large (OpenAI) produces 3,072-dimensional vectors at $0.13 per 1M tokens, scoring 64.6% on MTEB. The higher dimensionality captures finer semantic distinctions, making it better for domains where precise retrieval matters (legal, medical, technical). The tradeoff is 6.5x higher cost and larger index sizes.
all-MiniLM-L6-v2 (Sentence Transformers) is open-source and free to self-host. It produces 384-dimensional vectors with 58.8% on MTEB. Lower dimensionality means smaller indexes and faster search, but retrieval quality drops noticeably on nuanced queries. Best for privacy-sensitive deployments or extremely high volume where API costs are prohibitive.
voyage-2 (Voyage AI) produces 1,024-dimensional vectors at $0.12 per 1M tokens, scoring 63.8% on MTEB. It's a strong alternative to OpenAI's models, especially for code-heavy knowledge bases where Voyage's training data gives it an edge.
| Model | Dimensions | Cost/1M Tokens | MTEB Score | Latency (batch) |
|---|---|---|---|---|
| text-embedding-3-small | 1,536 | $0.02 | 62.3% | ~50ms |
| text-embedding-3-large | 3,072 | $0.13 | 64.6% | ~120ms |
| all-MiniLM-L6-v2 | 384 | FREE | 58.8% | ~80ms |
| voyage-2 | 1,024 | $0.12 | 63.8% | ~90ms |
For most WhatsApp automation use cases, text-embedding-3-small is the right choice. You'd only upgrade to text-embedding-3-large if you're seeing retrieval quality issues with domain-specific queries after optimizing chunk size and overlap.
MoltFlow Knowledge Base Implementation
MoltFlow's RAG system is built-in. You don't need to manage embeddings, vector databases, or retrieval logic separately. Here's how to use it.
Upload a Document
curl -X POST https://apiv2.waiflow.app/api/v2/ai/knowledge/ingest \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-F "[email protected]" \
-F "metadata={\"category\": \"support\", \"version\": \"2.1\"}"Response:
{
"document_id": "doc_abc123",
"chunks_created": 42,
"embedding_model": "text-embedding-3-small",
"processing_time_ms": 3200
}The document is chunked, embedded, and indexed automatically. Processing time depends on document size: a 10-page PDF typically takes 2-4 seconds.
Search the Knowledge Base
Test retrieval manually before wiring it into your bot:
curl -X POST https://apiv2.waiflow.app/api/v2/ai/knowledge/search \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "refund policy",
"limit": 3,
"min_similarity": 0.7
}'Response:
{
"results": [
{
"chunk_id": "chunk_xyz789",
"content": "Our refund policy allows 60 days...",
"similarity": 0.89,
"document_id": "doc_abc123",
"metadata": {"category": "support", "page": 12}
}
]
}The similarity score (0.89 in this case) tells you how closely the chunk matches the query. Anything above 0.8 is a strong match. Between 0.7 and 0.8 is acceptable. Below 0.7, the chunk is likely not relevant enough to use as context.
Generate an AI Response with RAG
This is the endpoint that powers your WhatsApp bot. It combines retrieval and generation in a single call:
curl -X POST https://apiv2.waiflow.app/api/v2/ai/generate \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"session_name": "support-bot",
"message": "What is your refund policy?",
"use_rag": true,
"rag_config": {
"top_k": 3,
"min_similarity": 0.7,
"document_filter": {"category": "support"}
}
}'Response:
{
"reply": "We offer a 60-day money-back guarantee with no questions asked. To request a refund, email [email protected] or reply here and I'll connect you with our team.",
"rag_sources": [
{"document_id": "doc_abc123", "page": 12, "similarity": 0.89}
],
"model": "gpt-4o",
"tokens_used": 450
}The rag_sources array lets you audit which documents influenced the response. The document_filter parameter limits retrieval to specific document categories, useful when you have separate knowledge bases for support, sales, and technical documentation.
Common RAG Pitfalls and Solutions
After deploying RAG for hundreds of MoltFlow users, we've identified the five most common failure modes and how to fix them.
Pitfall 1: Chunks Too Small
Symptom: The AI retrieves sentence fragments without enough context to generate useful answers. Responses are vague or incomplete.
Example: A chunk containing just "within 60 days of purchase" gets retrieved for a refund question, but without the surrounding context about how to initiate the refund or what conditions apply.
Fix: Increase chunk size to 500-800 tokens and ensure overlap is at least 50 tokens. Sentence-aware chunking prevents splitting mid-thought.
Pitfall 2: Chunks Too Large
Symptom: Retrieved chunks contain relevant information buried in irrelevant text. The AI either misses the key detail or gets confused by contradictory information within the same chunk.
Fix: Decrease to 300-500 tokens. If your documents cover multiple topics per page, consider semantic chunking to split at topic boundaries.
Pitfall 3: Wrong Embedding Model
Symptom: Search returns chunks that are lexically similar but semantically different. "Apple revenue Q4" matches chunks about Apple (the fruit company) growing revenue instead of Apple (the tech company).
Fix: Use text-embedding-3-small for general text. For specialized domains (code, legal, medical), test domain-specific models like voyage-2 or specialized fine-tunes.
Pitfall 4: Stale Knowledge Base
Symptom: AI answers based on outdated policies, old pricing, or deprecated features. Customers receive incorrect information even though you updated the documentation weeks ago.
Fix: Version your documents. Set up a re-ingestion pipeline that runs when source documents change. MoltFlow supports document replacement: upload a new version with the same document_id to overwrite the previous embeddings.
Pitfall 5: No Fallback When RAG Fails
Symptom: When no relevant chunks are found (similarity below threshold), the AI either says nothing useful or falls back to its training data and generates a generic (potentially wrong) answer.
Fix: Implement explicit fallback logic:
results = search_knowledge_base(query, min_similarity=0.7)
if not results or results[0].similarity < 0.7:
response = (
"I don't have specific information on that in our documentation. "
"Let me connect you with our team who can help."
)
escalate_to_human(sender_id)
else:
response = generate_with_rag(query, results)The key insight: admitting "I don't know" is always better than confidently stating something wrong. Configure your system prompt to prefer escalation over fabrication, and set the similarity threshold high enough (0.7 minimum) to filter out irrelevant context.
Performance and Scaling
RAG adds latency to every AI response. Here's what to expect and how to optimize.
Embedding generation: 30-80ms per query (depends on embedding model and text length).
Vector search: 5-15ms for knowledge bases under 100,000 chunks with IVFFlat indexing. Scales linearly without indexing, logarithmically with it.
Total overhead: 50-100ms added to each response, well within the 3-second acceptable latency for WhatsApp.
For larger knowledge bases (500k+ chunks), consider HNSW indexing instead of IVFFlat. HNSW provides better recall at the cost of higher memory usage, roughly 2-3x more RAM per index. MoltFlow switches to HNSW automatically when chunk count exceeds the IVFFlat sweet spot.
Further Reading
The RAG paradigm was formally introduced in "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Lewis et al. (2020). The paper demonstrated that combining retrieval with generation significantly outperforms pure generative models on knowledge-intensive benchmarks. Since then, RAG has become the standard architecture for grounding LLMs in domain-specific knowledge, and the core insight, retrieve then generate, remains the foundation of every production RAG system today.
What's Next?
RAG transforms your AI from a generic chatbot into a domain expert grounded in your actual documentation. The key takeaways: chunk at sentence boundaries with 500 tokens as your starting point. Use OpenAI's text-embedding-3-small for production. Version your knowledge base and re-ingest when documents change. Always implement fallback logic for low-confidence retrievals.
Implementation guides:
- Build a Knowledge Base AI — Step-by-step RAG setup with document upload and testing
- AI Auto-Replies Setup Guide — Combine RAG with intelligent auto-responses
- AI Model Comparison for WhatsApp Bots — Choose the right embedding model for your use case
Advanced topics:
- Train AI Writing Style — Combine RAG accuracy with Learn Mode personality
- WhatsApp 2026 AI Compliance — Keep your RAG bot compliant with Meta policies
Ready to eliminate AI hallucinations? MoltFlow's RAG is built-in — upload PDFs and start answering with your actual documentation. No vector database to manage, no embedding pipeline to build, no infrastructure to maintain. Start your 14-day free trial and test RAG with your business documents risk-free.
> Try MoltFlow Free — 100 messages/month