PostgreSQL

pgvector in Production: A 2026 Reality Check

Dr. Somya Hallan · May 7, 2026 · 18 min read

In 2026, pgvector is finally stable enough for most production AI workloads under ~10M vectors. The 0.7.0 release brought 30× QPS gains, parallel HNSW builds, and binary quantization. But running pgvector in production still breaks in predictable ways, index memory walls, silent seq scans, and pre-filter traps. This guide covers what changed, what didn’t, and what to actually do.

Every pgvector tutorial ends just before things get interesting. You spin it up locally, insert ten thousand embeddings, run a similarity query, and it just works. Then the dataset crosses a few million rows, traffic ramps up, and the easy part ends quickly, index builds eat tens of gigabytes of RAM, queries silently fall back to sequential scans, and a LIMIT 10 with a WHERE clause returns three results instead of ten.

The most-shared critique of pgvector last year argued, fairly, that tutorials skipped most of this. What that critique can’t fully account for is how much has changed since: parallel HNSW builds, halfvec quantization, SIMD distance operations, and pgvectorscale’s StreamingDiskANN backend. The failure modes still exist but the pgvector production playbook is now real.

This guide looks at pgvector through the same operational lens as our database scaling breakdown: not how to demo it, but what actually starts breaking once it’s under production load.

Is pgvector production-ready in 2026?

Yes, pgvector is production-ready for vector workloads under ~10M vectors. Since 0.5.0 added HNSW indexing and 0.7.0 introduced parallel builds and quantization, query throughput improved up to 30× over earlier versions. For larger workloads, pgvector scale extends pgvector to 50M+ vectors with disk-based indexing.

The honest answer has three tiers:

Under 10M vectors:

Vanilla pgvector with HNSW handles most production workloads cleanly such as semantic search, RAG retrieval, recommendations etc. Query latency sits in the 5–20ms range with reasonable RAM provisioning.

10M to 50M vectors:

Still pgvector territory, but tuning becomes a real job. You’ll want halfvec quantization, careful memory budgeting, and likely pgvectorscale‘s Streaming Disk ANN backend to keep latency predictable.

Above 50M vectors:

pgvector is a genuine engineering investment. Most teams at this scale should at least price out a dedicated vector database, as the cost section below will show, the math often still favors Postgres.

The shift since 2024 is real. HNSW indexing landed in 0.5.0 and changed everything. Before that, IVFFlat’s cluster-based approach made large-scale recall painful. Version 0.6.0 added parallel index builds. Version 0.7.0 brought halfvec (16-bit floats), binary quantization, and SIMD-accelerated distance operations. Together, these cut HNSW build times by up to 150× at equivalent recall, measured in AWS’s pgvector benchmark on Aurora PostgreSQL.

Independent academic work backs the broader pattern. Purdue’s 2024 ICDE paper benchmarked HNSW, IVF FLAT, and IVF PQ on 10M-vector datasets (SIFT10M, Deep10M) and found pgvector’s index choices remain competitive with specialized vector databases at scale.

What hasn’t changed: pgvector still scales vertically, the HNSW index still needs to fit in RAM (or disk via pgvectorscale), and large index builds are memory-intensive. The “you probably don’t need a vector database” argument has become defensible but only if you understand the operational ceiling. That’s the same trade-off behind the managed-vs-self-hosted decision.

What breaks first in pgvector production

The six most common production failures include HNSW index builds running out of RAM, sequential scans replacing index scans silently, implicit type mismatches disabling indexes, pre-filter/post-filter mismatches returning wrong results, concurrent writes contending with index maintenance, and cold-cache p99 spikes after deploys.

pgvector production query optimization comparison showing slow sequential scan versus optimized HNSW index scan with vector_cosine_ops

Almost every pgvector incident we’ve seen in production traces back to one of six failure modes. They’re predictable enough that you can build them into a pre-launch checklist.

1. Memory wall during HNSW builds

Postgres ships with maintenance_work_mem=64MB, fine for B-tree indexes, catastrophic for HNSW. Building an HNSW index over a few million 1536-dimensional vectors at the default can run 10–50× slower than necessary, and on larger tables, just fail.

Fix: bump maintenance_work_mem to 8–16 GB before the build, then revert.

2. Silent seq scans

This is the single most common pgvector regression. The query runs, results come back, latency is awful, and nothing in your logs says “index missed.” It happens when the query operator doesn’t match the index’s ops class, when there’s an implicit type cast on the vector column, or when no index exists at all.

Fix: run EXPLAIN ANALYZE on every similarity query in CI.

EXPLAIN ANALYZE
SELECT id, content
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;

Before the fix, here’s the plan from Google’s published pgvector benchmark on a 30K-row test dataset:

Limit  (cost=...)
  ->  Sort
        Sort Key: ((embedding <=> $1))
        ->  Seq Scan on documents

Execution Time: 226.738 ms

After creating the right index with a matching ops class:

CREATE INDEX documents_embedding_hnsw_idx
  ON documents USING hnsw (embedding vector_cosine_ops);

Limit  (cost=...)
  ->  Index Scan using documents_embedding_hnsw_idx on documents
        Order By: (embedding <=> $1)

Execution Time: 0.982 ms

Same query, same data: a 230× speedup at 30K rows. The gap widens at scale. At million-row tables, seq scans cross into the multi-second range, while HNSW typically holds at 5–20ms p50 in published 1M-vector benchmarks.

The fix is matching vector_cosine_ops (in the index) to the <=> operator (in the query). If you index with vector_l2_ops and query with <=>, Postgres won’t use the index. It’ll scan every row, silently.

3. Implicit type mismatches

A BIGINT column compared to a NUMERIC parameter quietly disables the index. No error. No warning. Just slow queries.

Fix: assert types explicitly in your ORM or query builder.

4. The pre-filter / post-filter trap

A query like WHERE status = ‘published’ ORDER BY embedding <=> $1 LIMIT 10 looks reasonable. But if Postgres applies the vector search first and the filter second, you might get 3 published results out of 10 and miss hundreds of better matches further out.

Fix: over-fetch with a CTE, then filter, or use partial indexes for high-selectivity columns.

5. Write contention

HNSW updates the graph on every insert. Under sustained write load, that lock contention shows up as p95 latency spikes on reads.

Fix: batch inserts, or move write-heavy paths to a replica with eventual indexing.

6. Cold cache after deploys

The first request after a restart pays the cost of warming the index pages from disk. p99 latency can 10× temporarily.

Fix: run pg_prewarm in your deploy script for the index relation.

Every one of these has a known fix. Production pgvector is mostly about knowing where to look first.

pgvector Production Tuning: The parameters that actually matter

Five parameters drive 90% of pgvector production performance: maintenance_work_mem (8–16 GB for HNSW builds), ef_construction (256–512 for production-grade indexes), ef_search (10–200, tunable per query), m (8–32, default 16), and shared_buffers (25–40% of total RAM).

Plenty of tuning guides list every HNSW knob in the pgvector docs. Five parameters do almost all the work: three Postgres-level and two pgvector-specific.

Parameter	Default	Production Target	What It Does
`maintenance_work_mem`	64MB	8–16 GB	Speeds HNSW index builds 10–50×
`ef_construction`	64	256–512	Higher values improve recall during build, but slow the build
`ef_search`	40	10–200 (dynamic)	Recall vs latency trade-off at query time
`m`	16	8–32	Graph connections per node, memory vs recall trade-off
`shared_buffers`	128MB	25–40% of RAM	Keeps hot index pages in memory

The two parameters that actually change application behavior are ef_search and m. Everything else has a near-universal answer.

`ef_search` is your real-time dial.

It’s tunable per query, which means different parts of your application can run with different recall budgets. A common pattern from teams running pgvector at scale:

ef_search = 10 for product search where speed beats precision
ef_search = 40 for recommendations where users tolerate ~50ms latency
ef_search = 100–200 for content similarity or RAG, where missing the right chunk costs more than the latency

You can set this per session:

SET hnsw.ef_search = 100;

Run the query, then revert.

`m` controls how dense the HNSW graph is.

Higher m means better recall but more RAM per vector. The default of 16 works for most workloads up to ~5M vectors at 1536 dimensions. Above that, drop to 12 if RAM is tight, or push to 24–32 if accuracy is critical and you have headroom.

Recent research even argues m does most of the work. A 2024 paper, “Down with the Hierarchy”, found HNSW’s hierarchical layers add less to tail latency than the algorithm’s name implies.

Three Postgres-level rules worth keeping straight:

maintenance_work_mem is per-build. Bump it during index creation, then revert. Don’t leave it at 16 GB system-wide.
Set shared_buffers to 25–40% of RAM. Below that, the HNSW index gets evicted under read load. Above that, the OS page cache loses ground.
work_mem matters for hybrid queries that combine vector search with sorts or joins. 64–128 MB is usually enough.

Tune for the workload you actually run, not the one in someone else’s benchmark.

pgvector Production Costs: 1M to 50M+ Vectors

Monthly cost for 1M vectors at 1024 dimensions ranges from $30 (Neon serverless) to $260 (RDS) to $80 (Pinecone Serverless). At 50M vectors, the gap widens dramatically. Self-hosted Postgres with pgvectorscale runs ~$835/month vs Pinecone’s $3,241–3,889/month, a 75–79% savings.

The most defensible argument for pgvector in 2026 isn’t latency. It’s cost.

Here’s what running production vector workloads actually costs across the most common deployment paths.

Scale	pgvector (managed)	pgvector + pgvectorscale (self-hosted)	Pinecone	Qdrant Cloud
1M @ 1024d	$30 (Neon) to $260 (RDS)	—	$50–80 (Serverless)	$65–102
10M @ 1024d	~$300–450	~$400–600 (r6g.4xlarge)	~$70–150 (Serverless)	—
50M @ 768d	—	~$835	$3,241 (s1), $3,889 (p2)	—

Scale	pgvector (managed)	pgvector + pgvectorscale (self-hosted)	Pinecone	Qdrant Cloud
1M @ 1024d	$30 (Neon) to $260 (RDS)	—	$50–80 (Serverless)	$65–102
10M @ 1024d	~$300–450	~$400–600 (r6g.4xlarge)	~$70–150 (Serverless)	—
50M @ 768d	—	~$835	$3,241 (s1), $3,889 (p2)	—

Numbers come from public benchmarks, including Timescale’s pgvectorscale post, Supabase, and the Vecstore migration writeup, plus current vendor pricing pages.

The 10M tier is calculated from AWS, Neon, and Pinecone pricing. Actual production cost varies with query volume, replication factor, and instance reservation strategy.

The Pinecone vs pgvector breakeven

Pinecone scales nearly linearly with vector count and query volume. More storage and traffic means more cost.

Self-hosted pgvector amortizes a fixed instance cost across whatever else you run on it. The crossover happens around 5–10M vectors:

Below ~5M vectors: Pinecone Serverless is usually cheaper than running a production-grade Postgres instance just for vectors.
Above ~10M vectors: pgvector + pgvectorscale starts winning, and the gap widens linearly with scale.
Between 5M and 10M: It depends on your QPS, replication factor, and whether the Postgres instance is also serving non-vector workloads, which it usually is.

That last point matters most. Most teams running pgvector aren’t running a separate database just for vectors. They’re adding vector search to an existing application database.

The marginal cost of adding pgvector to an instance you already pay for is close to zero until you outgrow it, which moves the breakeven down sharply.

What changes the cost curve

Three factors explain why pricing diverges across vendors and scales:

Embedding dimensions: 1536-dim vs 768-dim isn’t just storage. It doubles RAM, halves QPS, and pushes you into a larger instance class. Often the single biggest cost lever. Published 100M-scale production analysis shows HNSW indexes scaling non-linearly past that mark, with index size ballooning from ~30 GB at 10M/768d to unsustainable footprints at 120M/1536d.
Recall target: Higher ef_search means more graph traversal per query. A workload tuned for 99% recall costs 2–3× more than one tuned for 90%.
Replication and HA: Every read replica adds compute, but rarely linearly. Managed services charge for it differently than self-hosted clusters do.

The 50M crossover

Self-hosted Postgres + pgvectorscale runs at ~$835/month for 50M vectors.

Pinecone’s storage-optimized s1 tier runs $3,241/month for the same workload, 75% more. Performance-optimized p2 widens the gap further.

The same pattern shows up everywhere Postgres scales: RDS economics break early once you cross a few hundred GB.

Two cost levers most teams underuse

halfvec quantization: Moving from 32-bit vector to 16-bit halfvec halves index size and RAM with ~99% accuracy retained. One published case study found this saved $40K/year at the 50M-vector tier.
Dimensionality reduction: Running PCA from 1536 → 768 dimensions retains ~97% accuracy at half the storage. Useful when your embedding model gives you more dimensions than you need.

For deeper breakdowns:

Our RDS vs self-hosted Postgres cost comparison covers the underlying economics.
For pgvector specifically, SelfHost vs Neon breaks down the serverless trade-offs.

Hybrid retrieval: pgvector’s underused production superpower

Combining vector similarity search with SQL pre-filters and full-text search inside a single query is pgvector’s biggest advantage over dedicated vector DBs. A two-stage pattern, ANN top-N candidates followed by exact re-ranking with metadata filters, typically beats vector-only search by 10× on filtered workloads.

The single biggest advantage pgvector has in production over dedicated vector databases isn’t latency or cost. It’s that filters and full-text search live in the same query as the vector search.

Most dedicated vector databases primarily rely on post-filtering: search first, filter second. Postgres can do it the other way around.

That distinction matters more than it sounds. A vector-only search over 10M embeddings, post-filtered to tenant_id = 42, can miss the right matches entirely if they fall outside the top-K returned by the vector pass.

Pre-filtering to the relevant tenant first, then searching the smaller candidate set, is both faster and more accurate.

The two-stage pattern

Over-fetch ANN candidates with a metadata filter, then re-rank with exact distance and any business logic such as popularity, recency, or permissions:

WITH candidates AS (
  SELECT id, content, embedding <=> $1 AS distance
  FROM documents
  WHERE tenant_id = $2
    AND status = 'published'
  ORDER BY embedding <=> $1
  LIMIT 100
)
SELECT id, content
FROM candidates
WHERE content_tsv @@ plainto_tsquery($3)
ORDER BY
  (1 - distance) * 0.7
  + ts_rank(content_tsv, plainto_tsquery($3)) * 0.3 DESC
LIMIT 10;

This is hybrid search: vector similarity weighted with Postgres full-text search.

For workloads where exact term matching matters, including product SKUs, named entities, or code identifiers, the FTS contribution catches what embeddings miss.

Reciprocal Rank Fusion is a more principled scoring strategy if simple weighting doesn’t fit your data.

Multi-tenancy comes free

Postgres Row-Level Security policies apply to vector queries the same way they apply to anything else.

A user can only see their own embeddings without writing a line of authorization logic in your application code.

Replicating that pattern in a dedicated vector database means application-level filtering, which post-filters again with all the same problems.

The “build it yourself” critique of pgvector hybrid search isn’t wrong. It’s just that “build it yourself” in Postgres means writing a SQL query.

When pgvector production outgrows itself: pgvectorscale and beyond

Most teams should consider migration paths above 10M vectors. The graceful staircase: vanilla pgvector (<5M) → halfvec quantization (5–10M) → pgvectorscale with StreamingDiskANN (10–50M+) → dedicated vector DB or sharding (50M+). pgvectorscale delivers 28× lower p95 latency and 16× higher QPS than Pinecone s1 at 50M scale.

The “stay on pgvector vs leave for a dedicated vector DB” question is usually framed as binary. It isn’t.

There’s a graceful staircase between vanilla pgvector and Pinecone, and most teams climb it instead of jumping.

The staircase

Vanilla pgvector with HNSW: Handles up to ~5M vectors at 1536-dim cleanly.
Add halfvec quantization: Covers the 5–10M range. 16-bit floats halve RAM while retaining ~99% accuracy.
Switch to pgvectorscale: Covers the 10–50M range. StreamingDiskANN keeps performance high without requiring everything in RAM.
Migrate or shard: Past 50M vectors, billion-scale traffic, or multi-region active-active setups, dedicated vector DBs or Postgres sharding with Citus become genuinely necessary.

Most teams that “outgrow pgvector” outgrow step 1 and never move up.

What pgvectorscale actually is

pgvectorscale is a separate Postgres extension from Timescale that builds on pgvector.

It adds:

StreamingDiskANN
Statistical binary quantization
Smarter query planner integration

At 50M vectors, Timescale’s published benchmarks show:

28× lower p95 latency
16× higher QPS than Pinecone’s storage-optimized s1 tier

One honest caveat

pgvectorscale isn’t available on AWS RDS.

If you’re locked into RDS, your staircase ends at step 2 unless you switch providers. See SelfHost vs Aiven for which managed Postgres options actually ship pgvectorscale.

That’s the trade-off behind BYOC managed pgvector: keep the data plane, gain the extension.

When migration is genuinely the right call

Three signals matter most:

Sustained write contention that pgvectorscale’s batched index updates can’t absorb
Multi-region active-active setups where Postgres logical replication lag breaks your SLA
Billion-scale workloads where horizontal sharding becomes cheaper than vertical scaling

If none of those apply, you probably haven’t outgrown Postgres yet.

The pgvector production checklist (before you ship)

A minimum pgvector production checklist: build indexes with CREATE INDEX CONCURRENTLY, set maintenance_work_mem to 8–16 GB during builds, enable pg_stat_statements, warm caches with pg_prewarm after deploys, version embedding models alongside vectors, and verify index use with EXPLAIN ANALYZE on every similarity query.

If you’re shipping pgvector to production, run this list before launch.

Each item maps to a specific failure mode. Cheap to apply, expensive to skip.

1. Build indexes with `CREATE INDEX CONCURRENTLY`

CREATE INDEX CONCURRENTLY documents_embedding_idx ON documents USING hnsw (embedding vector_cosine_ops);

The non-concurrent form locks writes for the entire build. On a busy table, that’s an outage.

2. Bump `maintenance_work_mem` for the build, then revert

SET maintenance_work_mem = '16GB';

-- CREATE INDEX...

RESET maintenance_work_mem;

The default 64MB setting causes builds to run 10–50× slower.

3. Enable `pg_stat_statements`

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Without it, you can’t tell which similarity query is slow.

4. Warm the index after every deploy

SELECT pg_prewarm('documents_embedding_idx');

This skips the cold-cache p99 spike on the first requests after deploy.

Version the embedding model alongside the vector

ALTER TABLE documents
  ADD COLUMN embedding_model TEXT
    NOT NULL DEFAULT 'text-embedding-3-small';

When you swap models in six months, this saves a multi-day re-embed audit.

6. Run `EXPLAIN ANALYZE` on every similarity query in CI

Confirm you see:

Index Scan using documents_embedding_idx

Not:

Seq Scan

Fix the operator and ops-class mismatch before merge.

7. Test backup and restore on the vector table

HNSW indexes restore cleanly from base backups. Verify that before you need it in production.

8. Document `ef_search` per use case in the migration

What’s optimal for product search isn’t optimal for RAG.

Store the chosen value as a comment in the DDL so the next engineer doesn’t have to guess.

Final Thoughts

pgvector in production in 2026 is genuinely viable for most teams.

The criticisms from 2024 mostly aged out. HNSW shipped. Parallel builds shipped. Quantization shipped. pgvectorscale shipped.

What hasn’t aged out is the operational reality: this is still a database extension, and databases require operational discipline.

The difference is that the discipline is now well-understood, and the failure modes have known fixes.

The case for keeping pgvector is straightforward.

Your vectors live next to your application data. Your filters and full-text search live in the same query as your similarity search. Your authorization story is whatever Postgres already gives you. Your cost curve bends sub-linearly past 10M vectors.

Dedicated vector databases solve a different optimization problem.

If you want pgvector’s flexibility without writing the runbook, BYOC pgvector keeps the data plane in your cloud and the operational expertise on someone else’s pager.

Frequently Asked questions

Is pgvector better than Qdrant or Pinecone for production?

It depends on the constraint.
Qdrant is faster on raw filtered search at large scale. Pinecone gives you zero ops and managed scaling.
pgvector wins when your vectors live alongside structured data, when you need transactional consistency, when SQL filters dominate your query patterns, and in the under-10M vector range, where the latency gap is invisible against LLM generation time.

Does PostgreSQL support horizontal scaling for pgvector?

Not natively.
Postgres scales vertically: bigger instances, more RAM, more CPU.
Sharding HNSW indexes across nodes requires application-level work. The two practical paths are:
Citus for hash-distributed Postgres
Partitioning by tenant_id with per-partition indexes
Most teams hit the vertical ceiling well before they need either.

What’s the maximum vector dimension pgvector can handle?

Around 2,000 dimensions practically.
The hard limit comes from Postgres’s 8KB page size. A 1536-dim OpenAI embedding consumes ~6KB per row, leaving little headroom for metadata.
PCA from 1536 to 768 dimensions typically retains ~97% accuracy while halving both memory and storage cost.

Can pgvector replace Pinecone for production RAG?

For most B2B workloads under 10M vectors, yes.
Vector search latency is rarely the bottleneck. Embedding generation runs 100–300ms, and LLM inference runs 500ms–3s, while pgvector with HNSW returns results in 5–20ms.
The 5ms vs 12ms gap between pgvector and Pinecone is invisible in the end-to-end experience.

How do I know when to migrate from pgvector to a dedicated vector DB?

Four signals matter most:
Sustained write contention that pgvectorscale’s batched updates can’t absorb
RAM headroom dropping below 20% during peak load
p95 latency creeping above your SLA
Scale projections crossing 50M vectors within six months
If you’re worried but none apply yet, you probably haven’t outgrown Postgres. See how SelfHost vs DigitalOcean compares before jumping ship.

Back to blog

Table of Contents