Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.
The lakebase_vector extension adds approximate nearest-neighbor (ANN) vector search to Lakebase via the lakebase_ann index type. It is a drop-in companion to pgvector: the same vector types, distance operators, and query syntax work without modification.
Install
First, enable Lakebase Search in your project settings. Then install the extension:
CREATE EXTENSION IF NOT EXISTS lakebase_vector CASCADE;
The CASCADE keyword automatically installs pgvector as a dependency.
Quick start
-- Create a table with a vector column
CREATE TABLE items (id BIGSERIAL PRIMARY KEY, embedding VECTOR(3));
-- Insert sample data
INSERT INTO items (embedding)
SELECT ARRAY[random(), random(), random()]::real[]
FROM generate_series(1, 1000);
-- Create a lakebase_ann index
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops);
-- Query using standard pgvector distance operators
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Configure the index
Set build_mode at index creation to control the accuracy/speed tradeoff:
standard(default): optimizes for recall. Use for most workloads.fast: builds faster at lower recall. Use when build time matters more than search quality.
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops) WITH (build_mode = 'fast');
Build indexes concurrently
Use CREATE INDEX CONCURRENTLY to build without locking the table, then REINDEX CONCURRENTLY to rebuild without downtime:
CREATE INDEX CONCURRENTLY items_embedding_ann ON items
USING lakebase_ann (embedding vector_l2_ops);
REINDEX INDEX CONCURRENTLY items_embedding_ann;
Tune search accuracy
Before tuning, call lakebase_ann_index_info(index_name) to get the index's lists, default_probes, and default_epsilon values.
Set lakebase_ann.probes at query time to control the accuracy/speed tradeoff. Higher values improve recall but slow queries.
Before setting lakebase_ann.probes, call lakebase_ann_index_info to find your lists array. Set one probe value per list entry:
lists from index info |
probes to set |
|---|---|
[] (empty) |
|
[222] |
'22' |
[3333, 33333] |
'33, 333' |
Note
The lakebase_ann.probes parameter requires one value per entry in lists. When the lists array is empty (which happens on small tables where the index builder creates no IVF partitions), don't set probes. Setting a value when the lists array is empty causes an error. IVF partitions appear once your dataset is large enough for the index builder to partition it.
-- Check your index's lists length first
SELECT lakebase_ann_index_info('items_embedding_ann');
-- Set probes matching the lists array (example: one partition)
SET lakebase_ann.probes TO '22';
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 10;
lakebase_ann.epsilon controls the re-ranking margin. The default value of 1.9 works well for most workloads.
SET lakebase_ann.epsilon TO '1.5';
Operator classes
| Distance metric | Operator class | Query operator |
|---|---|---|
| L2 (Euclidean) | vector_l2_ops |
<-> |
| Negative inner product | vector_ip_ops |
<#> |
| Cosine similarity | vector_cosine_ops |
<=> |
Choose the operator class that matches how your embeddings were trained, and use the same metric for the index and the query:
vector_cosine_ops(<=>) is cosine similarity. Use it for most text embeddings. This is the most common choice.vector_l2_ops(<->) is Euclidean (L2) distance. Use it when absolute spatial distance matters and vectors are not normalized.vector_ip_ops(<#>) is negative inner product. Use it when vectors are pre-normalized to unit length. For unit vectors, inner product equals cosine similarity and is typically faster.
Index options reference
| Option | Type | Default | Description |
|---|---|---|---|
build_mode |
string | 'standard' |
Controls the accuracy/speed tradeoff at index build time. 'standard' optimizes for recall; 'fast' builds faster with lower recall. |
GUC reference
| Parameter | Type | Default | Description |
|---|---|---|---|
lakebase_ann.probes |
integer[] | (unset) | Array of per-partition probe counts, one value per entry in lists. Higher values improve recall at the cost of query speed. Check lakebase_ann_index_info for the lists length to determine how many values to set. |
lakebase_ann.epsilon |
float | 1.9 |
Re-ranking margin. Valid range: 0.0 to 4.0. |
Utility functions
| Function | Returns | Description |
|---|---|---|
lakebase_ann_prewarm(regclass) |
void | Loads an index into memory to eliminate cold-start latency on the first query. |
lakebase_ann_index_info(regclass) |
text | Returns index metadata as text, including lists, default_probes, and default_epsilon. |