Building Production-Ready RAG: A Symfony + pgvector Architecture Blueprint

April 29, 2026#Symfony pgvector RAG

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

A retrieval-augmented generation prototype takes an afternoon. A retrieval-augmented generation system that survives real users, real documents, and real cost discipline takes a quarter. The gap between those two things is where almost every B2B AI feature shipped in 2025 quietly underperformed — wrong chunks retrieved, irrelevant citations, latency spikes when corpora grew past a few thousand documents, and OpenAI bills that could not be explained to finance. None of that is a model problem. It is an architecture problem.

This post is a concrete Symfony + pgvector blueprint for production-ready RAG: the pipeline shape, the database schema, the retrieval logic, the index strategy, and the operational guardrails that separate a demo from a system you can put in front of a paying enterprise customer. The stack is deliberately boring — PostgreSQL with pgvector, Symfony for the application layer, an embeddings model behind an interface, an LLM behind another. Boring scales; clever does not.

What "production-ready" actually means for RAG

Before any code, fix the definition. A production RAG system has four properties that the typical demo does not. It has measurable retrieval quality — a frozen evaluation set with answer-supporting documents, run in CI, with a recall@k number that nobody is allowed to regress without a written justification. It has bounded cost per query — a token budget, a context-size ceiling, and a circuit breaker that refuses to call the LLM when the retrieval step returns nothing useful. It has observability — every query logs the retrieved chunk IDs, the reranker scores, the prompt size, and the model latency, queryable per tenant. And it has a human-readable answer to "why did the model say that?" — citation-tracked outputs the support team can debug without reading vector embeddings.

Anything missing from that list is technical debt that compounds the moment your corpus crosses ten thousand documents or your usage crosses a thousand requests per day. A serious code quality audit of an early-stage RAG feature will almost always find at least three of the four absent.

The reference architecture

The blueprint has four runtime components and one offline pipeline. The ingestion pipeline reads source documents, chunks them, embeds the chunks, and writes them into pgvector. The Symfony application accepts a user query, embeds it, runs hybrid retrieval against pgvector and a full-text index, reranks the results, builds a bounded prompt, calls the LLM, and returns a cited answer. PostgreSQL is the only stateful component — no Pinecone, no Weaviate, no separate search cluster. For corpora up to roughly ten million chunks on commodity hardware, that is a defensible default and a substantial operational simplification.

The schema is small enough to fit on one screen:

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE TABLE documents (
    id          BIGSERIAL PRIMARY KEY,
    tenant_id   UUID NOT NULL,
    source_uri  TEXT NOT NULL,
    title       TEXT NOT NULL,
    metadata    JSONB NOT NULL DEFAULT '{}'::jsonb,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE chunks (
    id            BIGSERIAL PRIMARY KEY,
    document_id   BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    tenant_id     UUID NOT NULL,
    ord           INT NOT NULL,
    content       TEXT NOT NULL,
    content_tsv   tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,
    embedding     vector(1024) NOT NULL,
    token_count   INT NOT NULL,
    created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX chunks_embedding_hnsw
    ON chunks USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

CREATE INDEX chunks_tsv_gin ON chunks USING gin (content_tsv);
CREATE INDEX chunks_tenant ON chunks (tenant_id);

Two things matter in that schema beyond the obvious. The tenant_id column is propagated everywhere because multi-tenancy at the row level is the only sane way to keep one customer's documents from leaking into another's retrieval — application-level filtering is not enough when an indexer query forgets a WHERE clause. And the content_tsv column gives us a free full-text index that lets retrieval combine vector similarity with classic BM25-style keyword matching, which materially improves recall on queries containing product names, error codes, or proper nouns where embeddings alone underperform.

The ingestion pipeline

Ingestion is where most RAG systems silently fail. Bad chunks at this stage cannot be rescued by any amount of retrieval cleverness downstream. Three decisions dominate.

Chunk size and overlap. Aim for 512–1,024 token chunks with a 10–15% overlap. Smaller chunks improve precision but fragment context; larger chunks waste tokens on irrelevant text. Use a structure-aware splitter — split on headings first, paragraphs second, sentences third, and only fall back to fixed-size windows when the document has no useful structure. Splitting purely by character count is the most common bug in shipped RAG systems.

Metadata extraction. Every chunk should carry the parent document title, section heading, source URL, and any tenant-relevant filters (product, region, date) in metadata. These fields drive filtered retrieval later. A chunk without metadata is almost useless in a multi-product corpus.

Embedding model choice. Pick one and freeze it. The embedding model is part of your data, not your code — switching it requires re-embedding the entire corpus. For European deployments, the practical 2026 options are OpenAI text-embedding-3-large, Cohere embed-multilingual-v3, or a self-hosted bge-m3 for fully on-prem requirements. Whichever you pick, abstract it behind an interface so the choice is reversible at refactor cost rather than rewrite cost.

A Symfony ingestion command looks like this:

#[AsCommand(name: 'rag:ingest')]
final class IngestCommand extends Command
{
    public function __construct(
        private readonly DocumentLoader $loader,
        private readonly Chunker $chunker,
        private readonly EmbeddingClient $embeddings,
        private readonly ChunkRepository $chunks,
    ) {
        parent::__construct();
    }

    protected function execute(InputInterface $input, OutputInterface $output): int
    {
        $sourceUri = $input->getArgument('source');
        $tenantId  = $input->getArgument('tenant');

        $document = $this->loader->load($sourceUri);
        $chunks   = $this->chunker->split($document, maxTokens: 800, overlap: 100);

        // Batch embed — 96 chunks per request is the sweet spot for OpenAI.
        foreach (array_chunk($chunks, 96) as $batch) {
            $vectors = $this->embeddings->embedBatch(
                array_map(fn ($c) => $c->content, $batch)
            );

            foreach ($batch as $i => $chunk) {
                $this->chunks->save($chunk->withEmbedding($vectors[$i], $tenantId));
            }
        }

        return Command::SUCCESS;
    }
}

In production this command runs inside a Messenger queue with retries and idempotency keys; ingestion failures are normal and a half-ingested document must not be retrievable until completion.

Hybrid retrieval and reranking

Pure vector search loses to hybrid retrieval almost every time on real B2B corpora. The implementation in pgvector is two SQL queries plus a fusion step:

public function retrieve(string $query, string $tenantId, int $k = 20): array
{
    $queryVector = $this->embeddings->embed($query);

    $semantic = $this->db->fetchAllAssociative(
        'SELECT id, content, document_id,
                1 - (embedding <=> :v) AS score
         FROM chunks
         WHERE tenant_id = :tenant
         ORDER BY embedding <=> :v
         LIMIT :k',
        ['v' => $queryVector, 'tenant' => $tenantId, 'k' => $k * 2]
    );

    $lexical = $this->db->fetchAllAssociative(
        "SELECT id, content, document_id,
                ts_rank_cd(content_tsv, plainto_tsquery('english', :q)) AS score
         FROM chunks
         WHERE tenant_id = :tenant
           AND content_tsv @@ plainto_tsquery('english', :q)
         ORDER BY score DESC
         LIMIT :k",
        ['q' => $query, 'tenant' => $tenantId, 'k' => $k * 2]
    );

    // Reciprocal rank fusion — the simplest fusion that consistently works.
    $fused = $this->fuse($semantic, $lexical, k: 60);

    return $this->reranker->rerank($query, array_slice($fused, 0, $k));
}

The reranker is a cross-encoder call (Cohere Rerank, Voyage Rerank, or a self-hosted bge-reranker) that scores each candidate chunk against the query individually. Reranking the top 20 candidates down to the top 5 typically lifts retrieval quality by 15–30 points on standard evaluation sets. It costs a millisecond and a fraction of a cent and is the highest-leverage single change you can make after hybrid search.

Index strategy: HNSW versus IVF

pgvector supports two index types, and the choice has real consequences. HNSW (Hierarchical Navigable Small World) gives the best recall-latency tradeoff for corpora up to roughly ten million vectors and is the right default. IVFFlat is faster to build and uses less memory but trades meaningful recall, and rebuilding it as the corpus grows is painful. Use HNSW with m = 16 and ef_construction = 200 as a starting point, and tune ef_search per query when latency budget allows. Always benchmark on your own data — published benchmarks tell you very little about your corpus.

For corpora above ten million chunks, the conversation shifts. Either you partition the table by tenant or domain (which keeps HNSW manageable per partition), or you move the vector workload off PostgreSQL onto a dedicated store. Both are real options; the lazy choice is to assume you need the dedicated store before measuring whether you do.

Cost, observability, and evaluation

Three operational disciplines turn a working RAG into a defensible product. Cost ceilings belong in code: a per-query maximum context size, a per-tenant daily budget, a fallback that returns a deterministic "no answer with sufficient confidence" response when retrieval scores fall below threshold. Observability belongs at the chunk level: log retrieved chunk IDs, fusion ranks, reranker scores, final prompt token count, and model latency for every query, queryable by tenant and time window. Evaluation belongs in CI: a small frozen set of question-and-supporting-document pairs, run on every embedding-model or chunking-strategy change, with a hard floor on recall@5 that blocks merges.

The teams that get RAG right in 2026 are the ones treating these three things as non-optional from day one rather than retrofitting them after the first procurement question lands. The architecture above is what it looks like when they do — boring database, instrumented pipeline, evaluation in version control, and an LLM behind a thin enough interface that swapping it is a one-day change rather than a quarter.

If you are scoping a RAG feature for a B2B product and want a second set of eyes on the architecture before the first commit, that is exactly the kind of thing our custom software development practice is built for. Contact us at hello@wolf-tech.io or visit wolf-tech.io — eighteen years of European software work, including substantial AI-integration and Symfony architecture engagements, sits behind every recommendation we make.