lakebase_vector

Important

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.

The lakebase_vector extension adds approximate nearest-neighbor (ANN) vector search to Lakebase via the lakebase_ann index type. It is a drop-in companion to pgvector: the same vector types, distance operators, and query syntax work without modification.

Install

First, enable Lakebase Search in your project settings. Then install the extension:

CREATE EXTENSION IF NOT EXISTS lakebase_vector CASCADE;

The CASCADE keyword automatically installs pgvector as a dependency.

Quick start

-- Create a table with a vector column
CREATE TABLE items (id BIGSERIAL PRIMARY KEY, embedding VECTOR(3));

-- Insert sample data
INSERT INTO items (embedding)
SELECT ARRAY[random(), random(), random()]::real[]
FROM generate_series(1, 1000);

-- Create a lakebase_ann index
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops);

-- Query using standard pgvector distance operators
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Configure the index

Set build_mode at index creation to control the accuracy/speed tradeoff:

  • standard (default): optimizes for recall. Use for most workloads.
  • fast: builds faster at lower recall. Use when build time matters more than search quality.
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops) WITH (build_mode = 'fast');

Build indexes concurrently

Use CREATE INDEX CONCURRENTLY to build without locking the table, then REINDEX CONCURRENTLY to rebuild without downtime:

CREATE INDEX CONCURRENTLY items_embedding_ann ON items
  USING lakebase_ann (embedding vector_l2_ops);

REINDEX INDEX CONCURRENTLY items_embedding_ann;

Tune search accuracy

Before tuning, call lakebase_ann_index_info(index_name) to get the index's lists, default_probes, and default_epsilon values.

Set lakebase_ann.probes at query time to control the accuracy/speed tradeoff. Higher values improve recall but slow queries.

Before setting lakebase_ann.probes, call lakebase_ann_index_info to find your lists array. Set one probe value per list entry:

lists from index info probes to set
[] (empty)
[222] '22'
[3333, 33333] '33, 333'

Note

The lakebase_ann.probes parameter requires one value per entry in lists. When the lists array is empty (which happens on small tables where the index builder creates no IVF partitions), don't set probes. Setting a value when the lists array is empty causes an error. IVF partitions appear once your dataset is large enough for the index builder to partition it.

-- Check your index's lists length first
SELECT lakebase_ann_index_info('items_embedding_ann');

-- Set probes matching the lists array (example: one partition)
SET lakebase_ann.probes TO '22';
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 10;

lakebase_ann.epsilon controls the re-ranking margin. The default value of 1.9 works well for most workloads.

SET lakebase_ann.epsilon TO '1.5';

Operator classes

Distance metric Operator class Query operator
L2 (Euclidean) vector_l2_ops <->
Negative inner product vector_ip_ops <#>
Cosine similarity vector_cosine_ops <=>

Choose the operator class that matches how your embeddings were trained, and use the same metric for the index and the query:

  • vector_cosine_ops (<=>) is cosine similarity. Use it for most text embeddings. This is the most common choice.
  • vector_l2_ops (<->) is Euclidean (L2) distance. Use it when absolute spatial distance matters and vectors are not normalized.
  • vector_ip_ops (<#>) is negative inner product. Use it when vectors are pre-normalized to unit length. For unit vectors, inner product equals cosine similarity and is typically faster.

Index options reference

Option Type Default Description
build_mode string 'standard' Controls the accuracy/speed tradeoff at index build time. 'standard' optimizes for recall; 'fast' builds faster with lower recall.

GUC reference

Parameter Type Default Description
lakebase_ann.probes integer[] (unset) Array of per-partition probe counts, one value per entry in lists. Higher values improve recall at the cost of query speed. Check lakebase_ann_index_info for the lists length to determine how many values to set.
lakebase_ann.epsilon float 1.9 Re-ranking margin. Valid range: 0.0 to 4.0.

Utility functions

Function Returns Description
lakebase_ann_prewarm(regclass) void Loads an index into memory to eliminate cold-start latency on the first query.
lakebase_ann_index_info(regclass) text Returns index metadata as text, including lists, default_probes, and default_epsilon.

Next steps