Retrieval-Augmented Generation (RAG) is the foundational pattern for building AI applications over private data. A "hello world" RAG pipeline takes an hour to build. A production RAG pipeline — one that is accurate, observable, versioned, and maintainable — takes careful engineering. This guide covers the full stack, from document ingestion through evaluation and CI/CD on GitHub.

Why Most RAG Implementations Fail in Production

The gap between a demo RAG pipeline and a production one is large. Common failure modes include:

  • Naive fixed-size chunking that splits sentences mid-thought
  • Single embedding model that performs poorly on domain-specific text
  • Pure vector search that misses exact-match queries
  • No evaluation — no way to know if a change made the system better or worse
  • No version control of prompts, embeddings, or retrieval logic

This guide addresses all of these.

Step 1: Document Ingestion and Chunking

The quality of your RAG pipeline is largely determined by your chunking strategy. In 2026, fixed-size chunking (splitting every N tokens) is considered an anti-pattern. The recommended approaches are:

Semantic Chunking

LangChain's SemanticChunker splits text at semantic boundaries — places where the topic changes — rather than at arbitrary token counts. This produces chunks that contain coherent, complete ideas and dramatically improves retrieval precision.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)
chunks = chunker.split_documents(docs)

Structure-Aware Chunking

For documents with clear structure (Markdown, HTML, code), use RecursiveCharacterTextSplitter with document-specific separators that respect headers, paragraphs, and code blocks.

Step 2: Embedding Model Selection

The default OpenAI text-embedding-ada-002 is no longer the best choice in 2026. The MTEB leaderboard consistently shows that task-specific and newer models outperform it:

  • General text: text-embedding-3-large (OpenAI) or nomic-embed-text-v2 (open source)
  • Code: voyage-code-3 (Voyage AI) or jina-embeddings-v3
  • Multilingual: multilingual-e5-large

When in doubt, run your candidate models on a sample of your actual queries and documents using the MTEB evaluation suite before committing to one.

Step 3: Vector Store and Hybrid Search

Pure vector search misses queries with exact-match requirements. In 2026, hybrid search (combining dense vector similarity with BM25 sparse retrieval) is the production standard. Most vector databases now support it natively:

  • Pinecone — built-in hybrid search
  • Qdrant — hybrid search with sparse vectors
  • Weaviate — hybrid search with BM25
  • pgvector + Elasticsearch — self-hosted hybrid

LangChain's LangChain's EnsembleRetriever makes it easy to combine retrievers with configurable weights:

from langchain.retrievers import EnsembleRetriever

ensemble = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]
)

Step 4: Re-Ranking

After retrieval, apply a cross-encoder re-ranker to the top-K results before passing them to the LLM. Re-ranking consistently improves response quality by 10-20% in benchmark evaluations:

from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

reranker = CrossEncoderReranker(
    model=HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3"),
    top_n=5
)

Step 5: Evaluation with RAGAS

RAGAS is the standard evaluation framework for RAG pipelines in 2026. It measures four key dimensions automatically using an LLM as judge:

  • Faithfulness — Is the answer grounded in the retrieved context?
  • Answer Relevancy — Does the answer address the question?
  • Context Precision — Is the retrieved context relevant?
  • Context Recall — Was all relevant context retrieved?

Set up RAGAS evaluation as a GitHub Actions workflow that runs on every PR touching the RAG pipeline, with a quality gate that fails the PR if any metric drops below a threshold.

Step 6: GitHub CI/CD Integration

Version control everything in your RAG pipeline — not just code, but prompts, chunking parameters, embedding model choices, and evaluation thresholds. The recommended repository structure:

rag-pipeline/
  config/
    chunking.yaml       # chunking strategy and parameters
    embeddings.yaml     # model selection and settings
    retrieval.yaml      # vector store, hybrid weights, top-K
    prompts/            # all LLM prompt templates (versioned)
  src/
    ingest.py
    retriever.py
    chain.py
  tests/
    eval_dataset.json   # golden Q&A pairs for evaluation
    test_ragas.py       # RAGAS evaluation test suite
  .github/
    workflows/
      eval.yml          # runs RAGAS on every PR

Storing prompts as versioned files (not hardcoded strings) means you can track what changed when quality shifts, and roll back prompt changes with git revert.

Monitoring in Production

The final production requirement is observability. LangSmith (from LangChain) and Langfuse (open source alternative) both provide trace-level visibility into every retrieval and generation step. Set up dashboards to track faithfulness scores, latency, and retrieval coverage over time — and alert when quality metrics degrade.

Building a production RAG pipeline correctly takes more effort than a demo, but the result is a system that actually performs reliably for users — and that you can confidently improve over time using data rather than guesswork.