Building Production-Ready RAG: A Symfony + pgvector Architecture Blueprint

April 27, 2026#production RAG architecture

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

A SaaS team in Munich shipped their first retrieval-augmented generation feature in two weeks last autumn — a "chat with our documentation" panel built from a LangChain quickstart, an OpenAI key, and a managed Pinecone index. Demo day went well. Six weeks later the same feature was their largest single line item on the AI invoice, the answers had quietly degraded as the docs grew, the index was out of sync with two product areas no one had noticed, and the legal team had just discovered customer support tickets — including names, emails, and account IDs — were being embedded and shipped to a US vendor without a Data Processing Agreement covering that flow. The prototype worked. The production RAG architecture did not exist.

This is the recurring shape of RAG projects in 2026. Building a working retrieval pipeline is now a weekend exercise; building one that holds up under real document volume, regulated data, evolving content, and a finance team that asks pointed questions about per-query cost is a different engineering problem. This post walks through a production-grade blueprint we have shipped repeatedly for European clients: PostgreSQL with the pgvector extension, Symfony as the application backbone, hybrid BM25 + vector search, deliberate evaluation, and the operational pieces that decide whether the system still works in six months.

Why Symfony and pgvector Instead of a Dedicated Vector DB

The default 2026 stack diagram for RAG involves a managed vector database — Pinecone, Weaviate Cloud, Qdrant Cloud — and a Python service running LangChain or LlamaIndex. For a lot of European SaaS teams, that diagram is wrong on two counts.

The first is data gravity. The documents most worth retrieving over — product content, customer records, support history, internal knowledge — already live in PostgreSQL. Embedding them and shipping the embeddings to a separate system means you now run two stores, two backup regimes, two access-control models, and a synchronisation job between them that is going to drift. The pgvector extension turns Postgres into a competent vector store with HNSW indexes, cosine and L2 distance, and predictable performance up to tens of millions of rows on commodity hardware. For most B2B SaaS workloads, that ceiling is far above what the product will see.

The second is operational fit. A Symfony application already has the parts you need: Doctrine for the data model, the Messenger component for asynchronous embedding and re-indexing jobs, a mature dependency injection container, the Security component for access control, and a sensible HTTP layer. Wiring vector search PHP code into an existing service is faster, cheaper, and easier to operate than introducing a Python sidecar that the rest of the team does not maintain. PHP is no longer a strange place to do RAG; the LLM provider SDKs work over HTTP, the bottleneck is the model not the language, and the result is one less moving part.

Where this blueprint breaks down: workloads above ~50M chunks, sub-50ms p99 retrieval at very high QPS, or specialised vector operations like multi-vector ColBERT-style retrieval. For everyone else, RAG with PostgreSQL is the boring choice that ships.

The Data Model: Documents, Chunks, Embeddings

A clean schema is the entire foundation. Three tables, each with a clear job.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE rag_document (
    id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL,
    source_uri TEXT NOT NULL,
    source_type VARCHAR(32) NOT NULL,   -- 'help_article', 'pdf', 'ticket', ...
    content_hash CHAR(64) NOT NULL,     -- sha256 of normalised source content
    title TEXT,
    metadata JSONB NOT NULL DEFAULT '{}',
    indexed_at TIMESTAMPTZ,
    UNIQUE (tenant_id, source_uri)
);

CREATE TABLE rag_chunk (
    id BIGSERIAL PRIMARY KEY,
    document_id UUID NOT NULL REFERENCES rag_document(id) ON DELETE CASCADE,
    tenant_id UUID NOT NULL,
    chunk_index INT NOT NULL,
    content TEXT NOT NULL,
    content_tsv tsvector
        GENERATED ALWAYS AS (to_tsvector('simple', content)) STORED,
    embedding vector(1536),             -- adjust to your model dimension
    token_count INT NOT NULL,
    embedding_model VARCHAR(64) NOT NULL,
    UNIQUE (document_id, chunk_index)
);

CREATE INDEX rag_chunk_tenant_idx ON rag_chunk (tenant_id);
CREATE INDEX rag_chunk_tsv_idx    ON rag_chunk USING GIN (content_tsv);
CREATE INDEX rag_chunk_embedding_idx
    ON rag_chunk USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

Three details earn their keep in production. Storing content_hash on documents lets the indexer skip re-embedding when the source has not changed — a huge cost saving on incremental syncs. Generating content_tsv as a stored column gives you full-text search for free, which is what makes hybrid retrieval cheap to add later. And tagging each chunk with embedding_model lets you run two embedding models side by side during a migration without throwing away the old vectors.

The HNSW parameters above are sensible defaults for a few-million-chunk corpus. For larger workloads, increase ef_construction at index build time and tune hnsw.ef_search per query for the recall/latency curve you want.

Chunking: Where Quality Is Won or Lost

The chunking strategy decides what the retriever can possibly find. It also decides what cannot be found, which is where most RAG quality failures originate.

The default of "512 tokens with 50-token overlap, split on whitespace" is a reasonable baseline and a poor production choice. For technical documentation, splitting on the document's natural structure — headings and paragraphs, with a maximum chunk size as a safety limit — produces chunks the retriever can actually rank cleanly. For support tickets, the right unit is usually one ticket per chunk up to a size cap, with longer threads split on speaker turns. For PDFs of contracts, splitting on clause boundaries beats any token-window approach.

A useful Symfony abstraction:

interface Chunker
{
    /** @return Chunk[] */
    public function chunk(SourceDocument $doc): array;
}

final class StructuralMarkdownChunker implements Chunker
{
    public function __construct(
        private readonly int $maxTokens = 800,
        private readonly int $overlapTokens = 80,
    ) {}

    public function chunk(SourceDocument $doc): array
    {
        $sections = $this->splitOnHeadings($doc->getContent());
        $chunks = [];
        foreach ($sections as $section) {
            foreach ($this->packParagraphs($section, $this->maxTokens, $this->overlapTokens) as $idx => $text) {
                $chunks[] = new Chunk(
                    index: count($chunks),
                    content: $this->prefixWithBreadcrumb($section, $text),
                    metadata: ['heading' => $section->heading],
                );
            }
        }
        return $chunks;
    }
}

Two production tricks live in that prefixing step. Adding the section heading (or document title) as a breadcrumb at the top of every chunk gives the embedding model context that pure body text lacks, which materially improves retrieval on short queries. And keeping chunks under your re-ranker's context window — typically 512 tokens for a cross-encoder — means you can re-rank without truncation.

Hybrid Search: Why BM25 + Vectors Beats Either Alone

A pure vector search excels at semantic similarity ("how do I cancel" finds "subscription termination") and fails on exact identifiers ("error E_AUTH_412" returns nothing useful because the embedding does not encode the token). A pure BM25 keyword search is the inverse. Hybrid search BM25 vectors combined consistently outperform either method alone in production benchmarks, and pgvector + Postgres full-text gives you both in one query.

// src/Repository/ChunkRepository.php
public function hybridSearch(
    string $query,
    array $queryEmbedding,
    string $tenantId,
    int $limit = 50,
): array {
    $sql = <<<SQL
        WITH vector_hits AS (
            SELECT id, 1 - (embedding <=> :embedding) AS vector_score
            FROM rag_chunk
            WHERE tenant_id = :tenant
            ORDER BY embedding <=> :embedding
            LIMIT :limit
        ),
        keyword_hits AS (
            SELECT id, ts_rank_cd(content_tsv, plainto_tsquery('simple', :query)) AS bm25_score
            FROM rag_chunk
            WHERE tenant_id = :tenant
              AND content_tsv @@ plainto_tsquery('simple', :query)
            ORDER BY bm25_score DESC
            LIMIT :limit
        ),
        unioned AS (
            SELECT id FROM vector_hits
            UNION
            SELECT id FROM keyword_hits
        )
        SELECT
            c.id,
            c.content,
            c.metadata,
            COALESCE(v.vector_score, 0) AS vector_score,
            COALESCE(k.bm25_score, 0)   AS bm25_score
        FROM unioned u
        JOIN rag_chunk c ON c.id = u.id
        LEFT JOIN vector_hits  v ON v.id = u.id
        LEFT JOIN keyword_hits k ON k.id = u.id
    SQL;

    return $this->em->getConnection()->executeQuery($sql, [
        'query'     => $query,
        'embedding' => '[' . implode(',', $queryEmbedding) . ']',
        'tenant'    => $tenantId,
        'limit'     => $limit,
    ])->fetchAllAssociative();
}

The fusion step happens in PHP, not SQL, because that is where it belongs. Reciprocal Rank Fusion is the boringly correct choice — robust to scale differences between the two scores and easy to tune:

public function fuse(array $results, float $k = 60.0): array
{
    $vectorRank = array_flip(array_keys($this->sortBy($results, 'vector_score')));
    $bm25Rank   = array_flip(array_keys($this->sortBy($results, 'bm25_score')));
    foreach ($results as $i => &$r) {
        $r['rrf_score'] =
            1 / ($k + ($vectorRank[$i] ?? PHP_INT_MAX)) +
            1 / ($k + ($bm25Rank[$i]   ?? PHP_INT_MAX));
    }
    usort($results, fn($a, $b) => $b['rrf_score'] <=> $a['rrf_score']);
    return $results;
}

After fusion, a cross-encoder re-ranker on the top 20–40 candidates — Cohere Rerank, BGE-Reranker, or a self-hosted equivalent — tightens the final ordering. This three-stage pipeline (lexical + vector → fusion → re-rank) is what closes the gap between "demo quality" and "people actually use it."

Evaluation: The Step That Makes Everything Else Real

Without evaluation, every RAG change is a vibe. With evaluation, you can compare a chunking strategy, an embedding model, or a re-ranker against the previous one and decide on numbers.

The minimum viable RAG evaluation harness has two parts. First, a curated set of 50–200 representative queries with the chunk IDs (or document IDs) that should ideally be retrieved for each — built once with a domain expert, then maintained. Second, a CI job that runs the retrieval pipeline against every query in the set and reports recall@k, MRR, and an answer-quality score from an LLM judge prompted with a strict rubric.

final class RetrievalEvaluator
{
    public function evaluate(EvalSet $set, Pipeline $pipeline): EvalReport
    {
        $hitsAt5 = $hitsAt10 = $reciprocalRankSum = 0.0;
        foreach ($set->cases() as $case) {
            $retrieved = array_column($pipeline->retrieve($case->query, k: 10), 'id');
            $rank = array_search($case->goldChunkId, $retrieved, true);
            if ($rank !== false) {
                $reciprocalRankSum += 1 / ($rank + 1);
                if ($rank < 5)  $hitsAt5++;
                if ($rank < 10) $hitsAt10++;
            }
        }
        $n = count($set->cases());
        return new EvalReport(
            recallAt5:  $hitsAt5  / $n,
            recallAt10: $hitsAt10 / $n,
            mrr:        $reciprocalRankSum / $n,
        );
    }
}

Wire that into your CI. A change that drops recall@5 by more than two points fails the build until somebody has a defensible answer. RAG evaluation is what stops the slow, invisible quality decay that kills these systems six months in.

The Operational Concerns Tutorials Skip

Three operational topics decide whether the system survives in production.

Index updates. Source documents change constantly. The right pattern is event-driven: a domain event ("article published", "ticket closed") enqueues a ReindexDocument Messenger message, the handler computes a new content hash, and only re-embeds chunks whose content actually changed. A nightly drift-detection job re-checks a sample of documents against their stored hashes to catch missed events. Without this, the index silently rots.

Privacy and tenant isolation. Every chunk row carries a tenant_id, every retrieval query filters on it, every Doctrine query goes through a tenant filter, and personal data — names, emails, account identifiers — is either redacted before embedding or kept in a separately-encrypted lane with stricter retention. For European clients, the embedding provider's data processing terms are part of the architecture review, not an afterthought; for regulated industries it often means a self-hosted embedding model rather than an external API. This is the same boring discipline a serious code quality consulting pass would ask for, applied to a new data flow.

Cost tracking. Embedding calls and LLM completion calls are billable per token and trivial to misuse. Wire a metrics middleware into every provider call that records tokens-in, tokens-out, latency, model, and a feature tag. A simple dashboard of cost-per-feature and cost-per-tenant turns an opaque AI bill into a line item the finance team can reason about and the product team can optimise. The teams that skip this find out at the end of the month.

What to Build First

A pragmatic six-week plan: week one, schema, ingestion job, and structural chunker for one document type. Week two, embeddings and pure vector retrieval, end-to-end. Week three, full-text indexing and the hybrid query. Week four, re-ranking and the evaluation harness with 50 curated queries. Week five, the operational layer — event-driven re-index, tenant filters, cost metrics. Week six, hardening, load testing, and rollout behind a feature flag.

That sequence ships a production RAG system that is measurably good, observable, cost-aware, and built on a stack the existing team can operate. It is also dramatically cheaper to maintain than the LangChain + managed vector DB + Python sidecar version that most teams reach for first.

If you are evaluating RAG architecture choices for a regulated European product, or rescuing a prototype that has crossed the threshold from "cool demo" to "we depend on this and the bill is alarming," this is the kind of work a custom software development engagement is built for. Contact us at hello@wolf-tech.io or visit wolf-tech.io to discuss your RAG system design.