Hybrid Search With pgvector and BM25: Better Answers Without Elasticsearch

#hybrid search with pgvector
Sandor Farkas - Founder & Lead Developer at Wolf-Tech

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

If you have shipped a RAG feature on PostgreSQL, you have probably hit this wall: a user searches for an exact invoice number, a product SKU, or a function name, and your semantically brilliant vector search returns five loosely related paragraphs that do not contain the term at all. Hybrid search with pgvector and a BM25-style keyword ranker fixes exactly this failure mode, and you can build it inside the Postgres instance you already run. No Elasticsearch cluster, no sync pipeline, no second source of truth.

This post walks through why single-method retrieval fails, what PostgreSQL actually offers on the keyword side, and how to fuse both result lists with Reciprocal Rank Fusion in plain SQL.

Why pure vector search misses obvious answers

Embedding models compress meaning into a fixed-size vector. That is their strength and their blind spot. Two texts that mean roughly the same thing land close together, which is what makes semantic retrieval feel magical in demos. But the compression discards surface detail, and surface detail is exactly what many real queries are made of:

  • Identifiers: ticket numbers, SKUs, error codes like SQLSTATE[40P01], IBANs. Embeddings treat these as near-noise.
  • Rare proper nouns: a niche library name or a customer-specific term that the embedding model barely saw during training.
  • Negations and exact phrases: "termination without notice" and "termination with notice" embed almost identically, with opposite legal meaning.

Keyword search has the mirrored weakness. It cannot tell that "How do I cancel my plan?" should match a document titled "Ending your subscription". One method matches words, the other matches meaning, and production queries need both. The research backs this up: hybrid retrieval consistently outperforms either method alone on standard benchmarks such as BEIR, which is why every dedicated search vendor now ships it as the default.

If your pipeline already returns plausible-but-wrong answers, retrieval quality is the first place to look. We covered the downstream symptoms in Why your RAG pipeline returns confident garbage; this post is about fixing the retrieval layer itself.

What BM25 is, and what Postgres really gives you

BM25 is the ranking function behind Elasticsearch and most classic search engines. It scores a document by term frequency, dampened so that the tenth occurrence of a word counts less than the second, normalized by document length, and weighted by how rare the term is across the corpus (inverse document frequency).

Honesty checkpoint: vanilla PostgreSQL full-text search is not BM25. ts_rank and ts_rank_cd rank by term frequency and proximity, but they do not use corpus-wide IDF statistics. In practice you have two options:

  1. Native FTS with ts_rank_cd. Ships with every Postgres. For corpora in the tens or hundreds of thousands of documents, the ranking difference against true BM25 is rarely the thing your users notice, especially once you add rank fusion and a reranker on top.
  2. The pg_search extension (ParadeDB). Brings genuine BM25 scoring with its own index type. Worth evaluating if keyword relevance is central to your product, but it is another extension to install, upgrade, and get approved on managed platforms.

Our recommendation for most teams: start with native FTS. It removes the operational argument for Elasticsearch in one step, and you can swap the keyword leg for pg_search later without touching the fusion logic.

Building hybrid search with pgvector: schema and indexes

Assume a typical RAG chunk table. The two additions are a generated tsvector column and the right pair of indexes:

CREATE TABLE chunks (
  id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  document_id BIGINT NOT NULL REFERENCES documents(id),
  tenant_id BIGINT NOT NULL,
  content TEXT NOT NULL,
  embedding VECTOR(1536) NOT NULL,
  content_tsv TSVECTOR GENERATED ALWAYS AS (
    to_tsvector('english', content)
  ) STORED
);

CREATE INDEX chunks_embedding_idx ON chunks
  USING hnsw (embedding vector_cosine_ops);

CREATE INDEX chunks_tsv_idx ON chunks USING gin (content_tsv);

Three details that matter in production:

  • Generated column, not trigger. GENERATED ALWAYS AS ... STORED keeps the tsvector in sync with zero application code and survives bulk imports.
  • HNSW over IVFFlat. Since pgvector 0.5, HNSW gives better recall-latency trade-offs and, unlike IVFFlat, does not need retraining as data grows.
  • Filter columns live in the same table. tenant_id filtering happens in the same query plan. This is the quiet superpower over a separate search engine, where pre-filtering vectors by tenant is either expensive or approximate.

Fusing the two result lists: Reciprocal Rank Fusion in SQL

You now have two ranked lists per query: nearest neighbors by cosine distance and top keyword matches by ts_rank_cd. The scores live on incompatible scales, so do not try to average them. Reciprocal Rank Fusion (RRF) sidesteps the problem by combining ranks instead of scores: each document earns 1 / (k + rank) from every list it appears in, with k = 60 as the standard damping constant.

WITH semantic AS (
  SELECT id, RANK() OVER (ORDER BY embedding <=> $1) AS rnk
  FROM chunks
  WHERE tenant_id = $3
  ORDER BY embedding <=> $1
  LIMIT 20
),
keyword AS (
  SELECT id, RANK() OVER (
    ORDER BY ts_rank_cd(content_tsv, websearch_to_tsquery('english', $2)) DESC
  ) AS rnk
  FROM chunks
  WHERE tenant_id = $3
    AND content_tsv @@ websearch_to_tsquery('english', $2)
  LIMIT 20
)
SELECT COALESCE(s.id, k.id) AS id,
       COALESCE(1.0 / (60 + s.rnk), 0) +
       COALESCE(1.0 / (60 + k.rnk), 0) AS rrf_score
FROM semantic s
FULL OUTER JOIN keyword k USING (id)
ORDER BY rrf_score DESC
LIMIT 8;

That is the entire hybrid engine. One round trip, one transaction, tenant filtering applied to both legs before fusion. websearch_to_tsquery accepts raw user input safely, including quoted phrases and minus-operators, so you do not need a query parser.

Two tuning knobs are worth knowing. The per-leg LIMIT 20 controls how deep each method can reach; raise it to 50 or 100 if your reranker can afford the candidates. And if your users skew heavily toward identifier-style lookups, weight the keyword leg by multiplying its RRF term by 1.2 to 1.5 before summing. Resist the urge to tune further before you have an evaluation set; rank fusion is robust precisely because it ignores absolute scores.

Operational notes from real deployments

Index maintenance. HNSW build time grows noticeably with corpus size. Set maintenance_work_mem generously (several GB if you have it) before CREATE INDEX, and build concurrently on live systems.

Statement timeouts. A hybrid query is two index scans plus a small join. On a few million chunks, expect single-digit to low double-digit milliseconds for the keyword leg and similar for HNSW at ef_search = 40. If you see hundreds of milliseconds, the usual culprit is a filter that the planner applies after the vector scan instead of before; check EXPLAIN ANALYZE for post-filtering.

Connection pressure. Embedding calls are slow (tens to hundreds of ms), so never hold a database connection while you wait for the embedding API. Get the vector first, then run the SQL. Teams on Symfony or Next.js with PgBouncer in transaction mode get this for free if the embedding call lives outside the transaction.

Language handling. to_tsvector('english', ...) stems aggressively. For mixed-language corpora, store a language column and use it as the regconfig argument, or fall back to the simple configuration and accept weaker stemming rather than wrong stemming.

We see these issues most often when auditing AI features that grew out of a prototype. If that sounds familiar, our code quality consulting work regularly covers exactly this retrieval layer, and for teams building the feature from scratch we handle the full stack as part of custom software development.

When Elasticsearch or a dedicated engine still wins

Hybrid search in Postgres is the right default for most SaaS products, but not a universal answer. Reach for a dedicated engine when:

  • Search is the product, with faceting, typo-tolerant autocomplete, and per-field boosting as core UX, not a supporting feature.
  • Corpus size passes the tens of millions of chunks and you need sharded horizontal scale-out that a single Postgres primary cannot offer.
  • You need cross-cluster or multi-region search topologies that Postgres replication was not designed for.

Below those thresholds, the operational math favors Postgres heavily: one system to back up, one access-control model, transactional consistency between your application data and your search index, and one less pipeline that can silently drift. We ran the broader comparison in pgvector vs dedicated vector databases and the conclusion holds for the keyword side too.

FAQ

Does hybrid search replace a reranker? No, they stack. RRF produces a better candidate list; a cross-encoder reranker then orders the top 20-50 candidates with much higher precision. Hybrid retrieval plus reranking is the strongest practical pipeline before you touch fine-tuning.

Can I do this with an ORM like Doctrine or Prisma? Run the fusion query as native SQL. Both Doctrine DBAL and Prisma raw queries handle it cleanly; do not try to express RRF through the ORM query builder.

How do I know it actually improved anything? Build a small evaluation set first: 50-100 real queries with known-good chunks, measured with recall@k. Without it you are tuning blind.

Ship it, then measure it

Hybrid search with pgvector and Postgres full-text ranking is a one-day change for most codebases: one generated column, two indexes, one SQL query. The payoff is retrieval that handles both "cancel my subscription" and "invoice INV-2024-0871" without a second search system on the payroll.

If you want a second pair of eyes on your retrieval architecture, or someone to build it so it survives contact with production traffic, write to hello@wolf-tech.io or visit wolf-tech.io. We are happy to look at what you have before you commit to new infrastructure.