Skip to main content
Skip to article

#RAG Knowledge Base Explained: WhatsApp AI Context Retrieval

When AI Doesn't Know Your Business

A customer messages your WhatsApp bot: "What's your refund policy?" The AI responds with: "Most companies offer a 30-day refund window. You'll typically need your receipt and the item in original condition."

That answer is confidently wrong. Your actual policy is a 60-day money-back guarantee with no questions asked. The AI didn't lie on purpose. It just doesn't know your business. It generated a plausible-sounding answer based on its training data, which includes thousands of generic refund policies but none of yours.

This is the core problem with using LLMs directly for customer support. They're excellent at language but ignorant about your specific documentation, pricing, policies, and product details. The solution isn't fine-tuning (expensive, slow, requires retraining on every doc update). The solution is RAG: Retrieval-Augmented Generation.

RAG gives your AI access to your actual documents. Instead of guessing, it searches your knowledge base, retrieves the relevant sections, and generates a response grounded in your real content. The same refund question now returns: "We offer a 60-day money-back guarantee with no questions asked. To request a refund, email [email protected] or reply here and I'll connect you with our team."

This post goes deep into how RAG works, from document ingestion to vector search to response generation. You'll understand every stage of the pipeline, compare chunking strategies and embedding models, and see MoltFlow's implementation with working code examples.

What Is RAG?

Think of RAG like a librarian. When you ask a question, the librarian doesn't try to answer from memory. They walk to the relevant shelf, pull the right books, read the relevant sections, and then give you an answer based on what they found. RAG works the same way, in three phases.

Phase 1: Index (happens once). Convert your documents into a searchable format. This is like cataloging every book in the library, creating an index card for each section so you can find it quickly later.

Phase 2: Retrieve (per query). When a question arrives, search your indexed documents for the most relevant sections. This is the librarian walking to the shelf and pulling the right books.

Phase 3: Generate (per query). Feed the retrieved sections plus the original question to the AI model. The model reads the context and crafts an answer based on your actual documentation, not its training data.

Here's the difference in practice:

Without RAG (standard AI):

text
User: "What's your refund policy?"
AI: "Most companies offer 30-day refunds..." (generic, possibly wrong)

With RAG:

text
User: "What's your refund policy?"
[System searches knowledge base -> finds refund-policy.pdf, page 3]
AI: "We offer a 60-day money-back guarantee with no questions asked.
     To request a refund, email [email protected]." (your actual policy)

The difference is trust. With RAG, every answer traces back to a source document. You can verify it. Your customers can trust it. And when your policies change, you update the document and the AI's answers change immediately, no retraining required.

RAG Pipeline Architecture

Here's the complete pipeline from document upload to AI response:

text
Document Upload (PDF, TXT, DOCX)
    |
    v
Text Extraction (PDF -> plain text)
    |
    v
Chunking (split into 500-token segments with overlap)
    |
    v
Embedding Generation (text -> 1536-dimension vector)
    |
    v
Vector Storage (PostgreSQL + pgvector)
    |
    v
[User Query] -> Query Embedding -> Cosine Similarity Search -> Top 3 Chunks
    |
    v
Context Injection (chunks + question -> AI prompt)
    |
    v
AI Model -> Response (grounded in your documents)

Let's walk through each stage.

Stage 1: Document Ingestion

The pipeline starts when you upload a document. MoltFlow accepts PDF, TXT, DOCX, and Markdown files. The ingestion process extracts raw text, normalizes whitespace, strips headers and footers, and prepares the content for chunking.

python
# MoltFlow ingestion pipeline (simplified)
@app.post("/api/v2/ai/knowledge/ingest")
async def ingest_document(file: UploadFile):
    text = extract_text(file)       # PDF/DOCX/TXT extraction
    chunks = chunk_text(text, max_tokens=500)
    embeddings = embed_chunks(chunks)
    store_in_db(chunks, embeddings)
    return {"chunks": len(chunks), "status": "indexed"}

Text extraction sounds simple but has nuances. PDFs with columns, tables, or images require careful parsing to maintain reading order. MoltFlow uses PyPDF2 for standard PDFs and python-docx for Word documents, with fallback to raw text extraction for edge cases.

Stage 2: Chunking

You can't feed an entire 100-page manual into an AI prompt. Context windows are limited, and even with models that support 128k+ tokens, stuffing too much irrelevant text degrades response quality. Chunking solves this by splitting documents into smaller, semantically coherent segments.

MoltFlow defaults to 500 tokens per chunk with 50 tokens of overlap between adjacent chunks. The overlap ensures that concepts split across chunk boundaries still appear together in at least one chunk.

Why 500 tokens? It's a sweet spot. Smaller chunks (200-300 tokens) lose context. Larger chunks (800-1000 tokens) include too much irrelevant content alongside the relevant bits. At 500 tokens, you get roughly 1-2 paragraphs of coherent content, enough to answer most questions without noise.

We'll go deeper on chunking strategies in a dedicated section below.

Stage 3: Embedding Generation

An embedding is a vector representation of text meaning. It converts human-readable text into a list of numbers (typically 1,536 dimensions for OpenAI's model) where similar meanings produce similar vectors.

For example, "How do I reset my password?" and "I forgot my login credentials" would produce vectors that are close together in the 1,536-dimensional space, even though they share zero words. That's the power of semantic similarity over keyword matching.

python
from openai import OpenAI
client = OpenAI(api_key=API_KEY)

def embed_text(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding  # [0.023, -0.045, 0.112, ...]

Each chunk gets embedded once during ingestion. User queries get embedded at search time. The embedding model must be the same for both, as vectors from different models aren't comparable.

Stage 4: Vector Storage

MoltFlow stores embeddings in PostgreSQL using the pgvector extension. This keeps everything in a single database, no separate vector store to manage. Here's the schema:

sql
CREATE TABLE knowledge_chunks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL,
  document_id UUID NOT NULL,
  content TEXT NOT NULL,
  embedding vector(1536),  -- pgvector column type
  chunk_index INT,
  metadata JSONB,
  created_at TIMESTAMP DEFAULT now()
);

-- IVFFlat index for approximate nearest neighbor search
CREATE INDEX ON knowledge_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

The IVFFlat index enables fast approximate nearest neighbor search. For knowledge bases under 100,000 chunks, this delivers sub-10ms search times. For larger datasets, HNSW indexing provides better recall at the cost of more memory.

Stage 5: Vector Search

When a user asks a question, the system converts the query to a vector using the same embedding model, then finds the closest matching chunks using cosine similarity:

sql
SELECT content,
       metadata,
       1 - (embedding <=> $2::vector) AS similarity
FROM knowledge_chunks
WHERE tenant_id = $1
  AND 1 - (embedding <=> $2::vector) > 0.7
ORDER BY embedding <=> $2::vector
LIMIT 3;

The &lt;=&gt; operator is pgvector's cosine distance function. We filter for similarity above 0.7 (70% match) to avoid injecting irrelevant context. The query returns the top 3 most relevant chunks, ranked by semantic similarity to the user's question.

Stage 6: Context Injection

The retrieved chunks are combined with the user's question in a structured prompt template:

text
You are a customer support assistant. Answer using ONLY the context below.
If the context doesn't contain the answer, say "I don't have that information"
and offer to connect the customer with a human agent.

Context:
---
[Chunk 1: Our refund policy allows customers to request a full refund
within 60 days of purchase. No questions asked. Contact support@...]
---
[Chunk 2: For returns of physical products, items must be in original
packaging. Digital products are refund-eligible within 14 days...]
---
[Chunk 3: Refund processing takes 5-7 business days. Credit card
refunds appear on your next statement...]
---

Customer question: What's your refund policy?

Answer:

The AI model reads the provided context, understands the question, and generates a response grounded in your actual documentation. The "ONLY the context below" instruction prevents the model from falling back to its training data when the context doesn't cover the question.

Chunking Strategies Deep Dive

Chunking is where most RAG implementations succeed or fail. The same document, chunked differently, can produce wildly different retrieval quality. Here are three approaches, from simple to sophisticated.

Fixed-Size Chunking

The simplest approach: split text every N tokens regardless of content boundaries.

python
def chunk_fixed(text: str, size: int = 500, overlap: int = 50) -> list[str]:
    words = text.split()
    chunks = []
    for i in range(0, len(words), size - overlap):
        chunk = ' '.join(words[i:i + size])
        if chunk.strip():
            chunks.append(chunk)
    return chunks

Pros: Simple to implement, predictable chunk count, consistent chunk sizes.

Cons: Breaks mid-sentence, splits paragraphs arbitrarily, can separate a question from its answer in FAQ documents.

Best for: Uniform documents like CSV exports, server logs, or structured data where sentence boundaries don't matter.

Sentence-Aware Chunking (MoltFlow Default)

Split at sentence boundaries, accumulating sentences until the chunk reaches the target size. This preserves semantic coherence within each chunk.

python
import nltk
nltk.download('punkt_tab')

def chunk_sentences(text: str, max_tokens: int = 500) -> list[str]:
    sentences = nltk.sent_tokenize(text)
    chunks, current = [], []
    token_count = 0

    for sentence in sentences:
        sent_tokens = len(sentence.split())
        if token_count + sent_tokens > max_tokens and current:
            chunks.append(' '.join(current))
            current = [sentence]
            token_count = sent_tokens
        else:
            current.append(sentence)
            token_count += sent_tokens

    if current:
        chunks.append(' '.join(current))
    return chunks

Pros: Preserves sentence integrity, better semantic coherence, natural boundaries.

Cons: Variable chunk sizes (some chunks may be very small if a long sentence pushes past the limit), requires sentence detection library.

Best for: Documentation, articles, user manuals, FAQ pages, anything written in prose.

Semantic Chunking (Advanced)

Uses embeddings to detect topic changes within a document. When the cosine similarity between consecutive sentences drops below a threshold, a new chunk begins. This produces topically coherent segments regardless of length.

Pros: Highest retrieval quality because each chunk covers a single topic. No information bleed across topics within the same chunk.

Cons: Requires embedding every sentence during ingestion (expensive and slow). Complex to implement. Chunk sizes are unpredictable.

Best for: Multi-topic documents like annual reports, comprehensive guides, or policy manuals where a single page might cover multiple unrelated subjects.

When to Use Each Strategy

Document TypeStrategyReasoning
FAQ pagesSentence-awareEach Q&A pair is standalone
User manualsSentence-awareStep-by-step instructions benefit from sentence boundaries
Legal contractsFixed-sizeConsistent structure, exact text preservation matters
Knowledge base articlesSemanticMultiple topics per article need topic-level segmentation
Product catalogsFixed-sizeStructured, repetitive entries with predictable format
Support transcriptsSentence-awareConversational flow follows sentence patterns

MoltFlow uses sentence-aware chunking by default because it works well across the widest range of document types. You can switch strategies per document via the ingestion API.

Embedding Model Comparison

The embedding model you choose directly affects retrieval quality. A poor embedding model means the system retrieves irrelevant chunks, and even the best LLM can't generate good answers from bad context.

Here are the four most practical options for production RAG systems:

text-embedding-3-small (OpenAI) is MoltFlow's default. It produces 1,536-dimensional vectors at $0.02 per 1M tokens, with 62.3% on the MTEB benchmark. Latency is approximately 50ms per batch. This is the best balance of quality, cost, and speed for most production workloads.

text-embedding-3-large (OpenAI) produces 3,072-dimensional vectors at $0.13 per 1M tokens, scoring 64.6% on MTEB. The higher dimensionality captures finer semantic distinctions, making it better for domains where precise retrieval matters (legal, medical, technical). The tradeoff is 6.5x higher cost and larger index sizes.

all-MiniLM-L6-v2 (Sentence Transformers) is open-source and free to self-host. It produces 384-dimensional vectors with 58.8% on MTEB. Lower dimensionality means smaller indexes and faster search, but retrieval quality drops noticeably on nuanced queries. Best for privacy-sensitive deployments or extremely high volume where API costs are prohibitive.

voyage-2 (Voyage AI) produces 1,024-dimensional vectors at $0.12 per 1M tokens, scoring 63.8% on MTEB. It's a strong alternative to OpenAI's models, especially for code-heavy knowledge bases where Voyage's training data gives it an edge.

ModelDimensionsCost/1M TokensMTEB ScoreLatency (batch)
text-embedding-3-small1,536$0.0262.3%~50ms
text-embedding-3-large3,072$0.1364.6%~120ms
all-MiniLM-L6-v2384FREE58.8%~80ms
voyage-21,024$0.1263.8%~90ms

For most WhatsApp automation use cases, text-embedding-3-small is the right choice. You'd only upgrade to text-embedding-3-large if you're seeing retrieval quality issues with domain-specific queries after optimizing chunk size and overlap.

MoltFlow Knowledge Base Implementation

MoltFlow's RAG system is built-in. You don't need to manage embeddings, vector databases, or retrieval logic separately. Here's how to use it.

Upload a Document

bash
curl -X POST https://apiv2.waiflow.app/api/v2/ai/knowledge/ingest \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -F "[email protected]" \
  -F "metadata={\"category\": \"support\", \"version\": \"2.1\"}"

Response:

json
{
  "document_id": "doc_abc123",
  "chunks_created": 42,
  "embedding_model": "text-embedding-3-small",
  "processing_time_ms": 3200
}

The document is chunked, embedded, and indexed automatically. Processing time depends on document size: a 10-page PDF typically takes 2-4 seconds.

Search the Knowledge Base

Test retrieval manually before wiring it into your bot:

bash
curl -X POST https://apiv2.waiflow.app/api/v2/ai/knowledge/search \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "refund policy",
    "limit": 3,
    "min_similarity": 0.7
  }'

Response:

json
{
  "results": [
    {
      "chunk_id": "chunk_xyz789",
      "content": "Our refund policy allows 60 days...",
      "similarity": 0.89,
      "document_id": "doc_abc123",
      "metadata": {"category": "support", "page": 12}
    }
  ]
}

The similarity score (0.89 in this case) tells you how closely the chunk matches the query. Anything above 0.8 is a strong match. Between 0.7 and 0.8 is acceptable. Below 0.7, the chunk is likely not relevant enough to use as context.

Generate an AI Response with RAG

This is the endpoint that powers your WhatsApp bot. It combines retrieval and generation in a single call:

bash
curl -X POST https://apiv2.waiflow.app/api/v2/ai/generate \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "session_name": "support-bot",
    "message": "What is your refund policy?",
    "use_rag": true,
    "rag_config": {
      "top_k": 3,
      "min_similarity": 0.7,
      "document_filter": {"category": "support"}
    }
  }'

Response:

json
{
  "reply": "We offer a 60-day money-back guarantee with no questions asked. To request a refund, email [email protected] or reply here and I'll connect you with our team.",
  "rag_sources": [
    {"document_id": "doc_abc123", "page": 12, "similarity": 0.89}
  ],
  "model": "gpt-4o",
  "tokens_used": 450
}

The rag_sources array lets you audit which documents influenced the response. The document_filter parameter limits retrieval to specific document categories, useful when you have separate knowledge bases for support, sales, and technical documentation.

Common RAG Pitfalls and Solutions

After deploying RAG for hundreds of MoltFlow users, we've identified the five most common failure modes and how to fix them.

Pitfall 1: Chunks Too Small

Symptom: The AI retrieves sentence fragments without enough context to generate useful answers. Responses are vague or incomplete.

Example: A chunk containing just "within 60 days of purchase" gets retrieved for a refund question, but without the surrounding context about how to initiate the refund or what conditions apply.

Fix: Increase chunk size to 500-800 tokens and ensure overlap is at least 50 tokens. Sentence-aware chunking prevents splitting mid-thought.

Pitfall 2: Chunks Too Large

Symptom: Retrieved chunks contain relevant information buried in irrelevant text. The AI either misses the key detail or gets confused by contradictory information within the same chunk.

Fix: Decrease to 300-500 tokens. If your documents cover multiple topics per page, consider semantic chunking to split at topic boundaries.

Pitfall 3: Wrong Embedding Model

Symptom: Search returns chunks that are lexically similar but semantically different. "Apple revenue Q4" matches chunks about Apple (the fruit company) growing revenue instead of Apple (the tech company).

Fix: Use text-embedding-3-small for general text. For specialized domains (code, legal, medical), test domain-specific models like voyage-2 or specialized fine-tunes.

Pitfall 4: Stale Knowledge Base

Symptom: AI answers based on outdated policies, old pricing, or deprecated features. Customers receive incorrect information even though you updated the documentation weeks ago.

Fix: Version your documents. Set up a re-ingestion pipeline that runs when source documents change. MoltFlow supports document replacement: upload a new version with the same document_id to overwrite the previous embeddings.

Pitfall 5: No Fallback When RAG Fails

Symptom: When no relevant chunks are found (similarity below threshold), the AI either says nothing useful or falls back to its training data and generates a generic (potentially wrong) answer.

Fix: Implement explicit fallback logic:

python
results = search_knowledge_base(query, min_similarity=0.7)

if not results or results[0].similarity < 0.7:
    response = (
        "I don't have specific information on that in our documentation. "
        "Let me connect you with our team who can help."
    )
    escalate_to_human(sender_id)
else:
    response = generate_with_rag(query, results)

The key insight: admitting "I don't know" is always better than confidently stating something wrong. Configure your system prompt to prefer escalation over fabrication, and set the similarity threshold high enough (0.7 minimum) to filter out irrelevant context.

Performance and Scaling

RAG adds latency to every AI response. Here's what to expect and how to optimize.

Embedding generation: 30-80ms per query (depends on embedding model and text length).

Vector search: 5-15ms for knowledge bases under 100,000 chunks with IVFFlat indexing. Scales linearly without indexing, logarithmically with it.

Total overhead: 50-100ms added to each response, well within the 3-second acceptable latency for WhatsApp.

For larger knowledge bases (500k+ chunks), consider HNSW indexing instead of IVFFlat. HNSW provides better recall at the cost of higher memory usage, roughly 2-3x more RAM per index. MoltFlow switches to HNSW automatically when chunk count exceeds the IVFFlat sweet spot.

Further Reading

The RAG paradigm was formally introduced in "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Lewis et al. (2020). The paper demonstrated that combining retrieval with generation significantly outperforms pure generative models on knowledge-intensive benchmarks. Since then, RAG has become the standard architecture for grounding LLMs in domain-specific knowledge, and the core insight, retrieve then generate, remains the foundation of every production RAG system today.

What's Next?

RAG transforms your AI from a generic chatbot into a domain expert grounded in your actual documentation. The key takeaways: chunk at sentence boundaries with 500 tokens as your starting point. Use OpenAI's text-embedding-3-small for production. Version your knowledge base and re-ingest when documents change. Always implement fallback logic for low-confidence retrievals.

Implementation guides:

Advanced topics:

Ready to eliminate AI hallucinations? MoltFlow's RAG is built-in — upload PDFs and start answering with your actual documentation. No vector database to manage, no embedding pipeline to build, no infrastructure to maintain. Start your 14-day free trial and test RAG with your business documents risk-free.

> Try MoltFlow Free — 100 messages/month

$ curl https://molt.waiflow.app/pricing

bash-5.2$ echo "End of post."_