Services

RAG Pipeline Development: Architecture, Costs, and Best Practices

Retrieval-Augmented Generation (RAG) is the dominant pattern for enterprise AI. This guide covers RAG pipeline architecture, common failure modes, cost optimization, and what production-grade RAG actually requires.

·6 min read·fdeai.agency

Retrieval-Augmented Generation (RAG) is a pattern where a language model is given relevant context retrieved from a document corpus before generating a response. RAG is the most common architecture for enterprise AI systems in 2026 — search, knowledge bases, customer support, document analysis, and internal Q&A all typically use RAG at their core.

Building a RAG pipeline that works in demos is easy. Building one that works reliably in production at scale — with good retrieval precision, low latency, manageable cost, and monitoring — is hard. This guide covers the architecture, common failure modes, and what production-grade RAG actually requires.

Core RAG Pipeline Architecture

A production RAG pipeline has five components:

1. Ingestion pipeline — processes source documents into chunks, generates embeddings, and stores them in a vector database. This runs continuously as documents are added, updated, or deleted. Document freshness in retrieval depends entirely on ingestion pipeline reliability.

2. Vector database — stores embedding vectors and enables semantic similarity search. Common choices: Pinecone (managed, easy to start), Weaviate (self-hosted or managed, flexible schema), pgvector (in-database via PostgreSQL extension, zero new infrastructure), Qdrant (high-performance self-hosted). Choice depends on scale, data sovereignty requirements, and existing stack.

3. Retrieval layer — takes a user query, embeds it, retrieves the k most relevant chunks, and optionally reranks them using a cross-encoder. The retrieval layer determines what context the LLM sees — it is the most impactful component for output quality.

4. Augmentation layer — formats retrieved context into the LLM prompt, manages context window limits, handles citations, and controls the instruction structure. The prompt template matters significantly — small changes in how context is formatted produce measurable quality differences.

5. Generation layer — the LLM that produces the final response based on the retrieved context. Model choice affects quality, latency, and cost. Different query types may warrant different models (smaller/faster for simple lookups, larger for complex synthesis).

Where RAG Pipelines Fail in Production

Chunking strategy is the most underestimated problem. Naive fixed-size chunking (split every N characters) breaks semantic units and degrades retrieval quality. A 500-character split might cut a table definition in half, leaving both halves useless. Production systems use semantic chunking (split on meaning boundaries), hierarchical chunking (store both a summary and full detail for each section), or domain-specific chunking logic.

Embedding model mismatch — the embedding model used during ingestion must be exactly the one used at query time. If you update the embedding model without re-embedding the entire corpus, retrieval silently degrades. Version-pin your embedding models and build a re-embedding pipeline before you need it.

Retrieval precision failure — cosine similarity between query and chunk embeddings alone is insufficient for many enterprise use cases. Production RAG systems combine dense retrieval (embedding similarity) with sparse retrieval (BM25/keyword matching) in a hybrid search, then rerank results with a cross-encoder. The combination consistently outperforms either approach alone.

Context window management — retrieved chunks that together exceed the model's context window get truncated at the end, which often means the most recent or specific context is dropped. Systems that don't actively manage context window usage produce hallucinations from truncated prompts. Implement dynamic chunk selection that fills the context window optimally given the query.

Staleness — if your ingestion pipeline falls behind or fails silently, the RAG system answers from outdated data. Users encounter answers that contradict what they see in the source system. Monitor ingestion freshness as a first-class metric: how old is the most recently indexed document? Set alerts.

Hallucination with retrieved context — retrieval doesn't eliminate hallucination. An LLM can retrieve accurate context and then produce an answer that contradicts it, especially under ambiguous prompting. Ground responses with explicit instructions to cite sources and decline to answer if the context doesn't support a claim.

Production Retrieval Quality Techniques

Hybrid search: Combine dense (embedding) and sparse (BM25) retrieval, then fuse the result lists with Reciprocal Rank Fusion (RRF). Consistently achieves 15–25% better retrieval recall than dense-only.

Cross-encoder reranking: After initial retrieval, use a cross-encoder model to score each retrieved chunk against the full query in context. Reranking is expensive but dramatically improves precision. Apply to the top 20–50 candidates, then return the top 5–10.

Query expansion: Generate multiple phrasings of the user query and retrieve against each. Aggregate results. Handles the vocabulary mismatch problem — a user asking about "vendor invoices" needs to retrieve documents that say "supplier payments."

Contextual chunking: Instead of storing raw chunks, add a document-level summary prefix to each chunk before embedding. The prefix provides context that the raw chunk lacks, improving retrieval relevance for queries that reference document-level concepts.

Hypothetical Document Embedding (HyDE): For question-answering use cases, generate a hypothetical answer to the query, embed that answer, and retrieve against it instead of the raw query. Retrieves passages that would answer the question, not just passages that contain the query keywords.

Cost Architecture

Naive RAG is expensive. A 10-million-token corpus embedded with OpenAI text-embedding-3-large costs ~$1,300 for initial ingestion. At scale, inference adds up quickly.

Inference cost model:

  • 100,000 queries/day × 4,000 average input tokens × $5/million tokens (GPT-4o) = $2,000/day = $60,000/month

Cost reduction stack:

  • Semantic cache (cache responses to semantically similar queries): reduces effective query volume 40–70% for typical enterprise knowledge base use cases
  • Prompt compression: remove redundant context from retrieved chunks, shorter prompts = lower cost
  • Tiered model routing: smaller/cheaper models (GPT-4o-mini at $0.15/million tokens vs $5) for high-confidence retrievals, expensive models only for ambiguous cases
  • Batch processing: non-real-time queries grouped and processed during off-peak hours at reduced API rates
  • Self-hosted embedding: for large ingestion volumes, self-hosted embedding models eliminate per-token API costs

Infrastructure Requirements

Vector database sizing: Plan for 1,536 dimensions (OpenAI text-embedding-3-large) or 768 dimensions (smaller models). 1 million chunks at 1,536 dimensions ≈ 9.2 GB of vector storage. Add metadata overhead (3–5x vector size) and ANN index overhead.

Ingestion pipeline: Must handle document creates, updates, and deletes. Deletes are often overlooked — if a document is removed from the source system, the corresponding vectors must be deleted from the vector store or retrieval returns stale results.

Observability: Trace every query end-to-end: query received, embedding generated, chunks retrieved (with scores), reranking applied, prompt assembled, response generated. Store traces for quality analysis and debugging.

Frequently Asked Questions

How long does it take to build a production RAG pipeline? With a dedicated FDE, 6–10 weeks for a well-scoped RAG system. This includes ingestion pipeline, retrieval layer with hybrid search and reranking, eval framework, and integration with your existing systems.

What vector database should we use? For most teams starting out: pgvector if you're already on PostgreSQL (zero new infrastructure), or Pinecone (fully managed, fast to start). For high-scale or self-hosted requirements: Weaviate or Qdrant. Don't over-engineer the vector database choice — they're more interchangeable than they seem, and you can migrate later.

Can RAG work with proprietary or confidential documents? Yes. RAG is the standard architecture for internal document corpora — legal documents, internal knowledge bases, support tickets, proprietary research. Ensure your embedding provider's data handling meets your compliance requirements, or use self-hosted embedding models to keep data entirely within your infrastructure.

How do we measure RAG quality? Core metrics: retrieval precision@k (are the right chunks being retrieved?), answer groundedness (does the answer reflect the retrieved context?), answer faithfulness (does the answer accurately reflect the documents?), and end-to-end accuracy on your labeled test set. Tools: Ragas, ARES, or custom eval frameworks.

When does RAG fail and fine-tuning work better? RAG fails when: the use case is narrow and highly repetitive (fine-tuning specializes the model more efficiently), the source documents are too large to chunk meaningfully (full-document tasks), or when response latency requirements are incompatible with retrieval overhead. Fine-tuning works better for behavior and style adaptation, not for injecting new knowledge.


Talk to us about building your production RAG pipeline →

fdeai.agency

Ready to ship your AI system?

An embedded FDE scopes your project in 2 days, owns delivery end-to-end, and exits with a working production system — not a slide deck.