Tutorial: Indexing Guide¶

Indexes control the tradeoff between recall, latency, memory, disk usage, and build time.

The default choice¶

Start with a flat metric index:

collection.build_index("FLAT-L2")

Flat search is exhaustive and simple. It is a good baseline for correctness, small collections, and evaluation.

A practical first decision:

Collection size and goal	Start with
development, tests, evaluation	`FLAT-IP`, `FLAT-L2`, or `FLAT-COS`
low-latency online ANN	`HNSW-IP`, `HNSW-L2`, or `HNSW-Cos`
large collections with explicit probe tuning	`IVF-L2`, `SPANN-L2`, `IVF-COS`, or `SPANN-COS`
memory pressure from graph indexes	`DiskANN-IP`, `DiskANN-L2`, or `DiskANN-Cos`
lower memory or disk footprint	SQ8, PQ, RaBitQ, or PolarVec variants
binary vectors	`FLAT-HAMMING-BINARY`, `FLAT-JACCARD-BINARY`, or IVF binary variants
coordinates, profiles, or distributions	start with the matching domain `FLAT-*` metric

Metric names¶

Index suffix	Metric	Best for
`*-IP`	Inner product	normalized embeddings, maximum-score retrieval
`-L2`	Squared L2 distance	Euclidean-distance embeddings
`-COS` or `-Cos`	Cosine similarity	embeddings where angular similarity matters
`-L1`	Manhattan distance	robust numeric and sensor features
`-HAVERSINE`	Haversine distance in meters	GeoJSON-order coordinates
`-CORRELATION`	Pearson correlation distance	aligned profiles and time buckets
`-HELLINGER`	Hellinger distance	non-negative distributions
`-WASSERSTEIN`	Wasserstein-1D distance	equal-width ordered histograms
`-JENSEN-SHANNON`	Jensen–Shannon distance	probability and topic distributions
`-CHEBYSHEV`	Chebyshev/L∞ distance	maximum-deviation constraints
`-CANBERRA`	Canberra distance	spectra and sparse abundance data
`-BRAY-CURTIS`	Bray–Curtis distance	abundance and compositional profiles
`-HAMMING-BINARY`	Hamming distance	binary vectors
`-JACCARD-BINARY`	Jaccard distance	binary sets
`-TANIMOTO-BINARY`	Binary Tanimoto/Jaccard distance	molecular fingerprints
`-DICE-BINARY`	Sørensen-Dice distance	fingerprints and deduplication

Choose the metric that matches how your embedding model was trained.

Metric guidance:

use cosine when your embedding model documentation recommends cosine;
use inner product when embeddings are normalized and maximum score retrieval is desired;
use L2 when Euclidean distance is meaningful for your model;
use Hamming or Jaccard only for binary vectors or binary-set style features.

See Domain-aware distance metrics for input validation, aliases, result units, and the complete compatibility matrix. New domain metrics support Flat and numeric domain metrics also support HNSW; they are not silently routed through IVF, SPANN, DiskANN, or quantizers.

Index families¶

Family	Example	Strength	Notes
Flat	`FLAT-L2`	highest recall and simplest behavior	Latency grows with collection size.
HNSW	`HNSW-L2`	low-latency ANN	Use `nprobe` as search breadth (`ef_search`).
IVF	`IVF-L2`	large collections with tunable probes	Requires `n_clusters`; use `nprobe` at search time.
SPANN	`SPANN-L2`	partition ANN with boundary replicas	Requires `n_clusters`; use `nprobe` at search time.
DiskANN	`DiskANN-L2`	disk-friendly graph ANN	Useful when memory pressure matters.
Quantized flat	`FLAT-IP-SQ8`, `FLAT-L2-PQ`, `FLAT-IP-RABITQ`	lower memory footprint	Some variants use two-pass search to preserve quality.

Build lifecycle¶

Indexes are persisted with the collection. Reopening the collection reloads the index metadata and index files where applicable.

After a large initial load:

with collection.insert_session() as session:
    for ids, vectors, fields in embedding_batches:
        session.add(ids=ids, vectors=vectors, fields=fields, batch_size=1000)

collection.build_index("HNSW-L2")
collection.checkpoint()

After many incremental writes, rebuild or switch index modes when recall or latency has drifted from your target:

collection.build_index("HNSW-L2")
collection.commit()

LynseDB can insert into existing graph indexes for supported paths, but a controlled rebuild after large changes is still the easiest way to re-baseline performance.

Build and remove indexes¶

collection.build_index("HNSW-L2")
print(collection.index_mode)

collection.remove_index()
print(collection.index_mode)

Removing an index returns the collection to flat search.

IVF and SPANN parameters¶

IVF and SPANN split vectors into coarse clusters. n_clusters controls how many clusters are built.

collection.build_index("IVF-L2", n_clusters=256)
collection.build_index("SPANN-L2", n_clusters=256)

Rules:

n_clusters is used only for IVF and SPANN indexes.
For indexes other than IVF/SPANN, n_clusters is allowed and ignored by the Python API.
For IVF and SPANN indexes, n_clusters must be greater than zero.
More clusters usually reduce scanned vectors per query but can require higher nprobe for recall.

Search with:

result = collection.search(query, k=10, nprobe=20)

For IVF and SPANN, nprobe is the number of clusters to scan. Higher values improve recall and increase latency.

Starting values:

n_clusters=64 for small experiments;
n_clusters=256 or 1024 for larger collections;
nprobe=10 as a default search starting point;
increase nprobe until recall is acceptable against a flat baseline.

HNSW search breadth¶

For HNSW, nprobe is used as the search beam width:

collection.build_index("HNSW-L2")
result = collection.search(query, k=10, nprobe=64)

Higher nprobe generally improves recall and increases latency. Flat, PQ, RaBitQ, PolarVec, and named vector-field searches ignore nprobe.

Start with nprobe=32 or 64 for HNSW evaluation, then tune down for latency or up for recall.

Approximate flat distance rounding¶

For flat IP, L2, and cosine paths, approx=True enables metric-specific distance rounding controlled by eps.

result = collection.search(query, k=10, approx=True, eps=1e-4)

Hamming and Jaccard binary metrics ignore approx=True and always use the exact binary-distance path.

Use approx=True only after measuring quality on your own evaluation set.

Named vector indexes¶

Each named vector field has its own metric, dimension, and index:

collection.create_vector_field("image", dim=512, metric="l2")
collection.add_named_vectors("image", image_vectors, ids=image_ids)
collection.build_index("HNSW-L2", field_name="image")

result = collection.search(image_query, k=10, vector_field="image")

Remove only that field's index:

collection.remove_index(field_name="image")

Rules for named fields:

create the field before adding named vectors;
add named vectors only for IDs that already exist in the primary collection;
choose the field metric at creation time;
build or remove the named field index with field_name=...;
where filters still use the row metadata fields.

Quantized indexes¶

Quantized indexes reduce memory or disk pressure. They are most useful when vector bandwidth or index size is the bottleneck.

Family	Examples	When to try
SQ8	`FLAT-L2-SQ8`, `HNSW-L2-SQ8`, `IVF-COS-SQ8`	You want scalar quantization with familiar index families.
PQ	`FLAT-L2-PQ`, `FLAT-IP-PQ8`, `FLAT-IP-PQ16`	You want product quantization and can evaluate recall tradeoffs.
RaBitQ	`FLAT-IP-RABITQ`, `FLAT-L2-RABITQ`	You want aggressive binary-style compression.
PolarVec	`FLAT-IP-POLARVEC4`, `FLAT-L2-POLARVEC`	You want training-free multi-bit quantization.

Quantized indexes should be evaluated against a flat baseline. Use the same queries, filters, and k values your application uses.

Binary indexes¶

Binary indexes are for binary vectors or binary-set style representations:

collection.build_index("FLAT-HAMMING-BINARY")
collection.build_index("IVF-JACCARD-BINARY", n_clusters=256)
collection.build_index("FLAT-TANIMOTO-BINARY")
collection.build_index("FLAT-DICE-BINARY")

For Hamming, Jaccard/Tanimoto, and Dice, lower distance is better. Flat search uses a lazily built one-bit-per-dimension hot representation. approx and eps do not change binary-distance search behavior.

Practical tuning workflow¶

Build FLAT-* first and record quality on an evaluation set.
Try HNSW-* for low-latency online search.
Try IVF-* when you need more explicit recall/latency tuning.
Use quantized variants when memory or disk footprint is the bottleneck.
Always compare recall against the flat baseline before deploying.

Evaluation loop:

def evaluate(index_mode, *, n_clusters=None, nprobe=10):
    collection.build_index(index_mode, n_clusters=n_clusters)
    result = collection.search(query, k=10, nprobe=nprobe)
    return result.ids.tolist()

baseline = evaluate("FLAT-L2")
hnsw = evaluate("HNSW-L2", nprobe=64)
ivf = evaluate("IVF-L2", n_clusters=256, nprobe=20)

print(baseline)
print(hnsw)
print(ivf)

Use your own relevance labels when possible. If you do not have labels yet, measure overlap with the flat baseline as a first recall proxy.

Supported index names¶

All index names are case-insensitive. The examples below show the supported spellings accepted by build_index().

Dense indexes:

collection.build_index("FLAT-IP")
collection.build_index("FLAT-L2")
collection.build_index("FLAT-COS")
collection.build_index("FLAT-COSINE")
collection.build_index("FLAT-IP-SQ8")
collection.build_index("FLAT-L2-SQ8")
collection.build_index("FLAT-COS-SQ8")
collection.build_index("FLAT-COSINE-SQ8")

collection.build_index("HNSW-IP")
collection.build_index("HNSW-L2")
collection.build_index("HNSW-COS")
collection.build_index("HNSW-COSINE")
collection.build_index("HNSW-IP-SQ8")
collection.build_index("HNSW-L2-SQ8")
collection.build_index("HNSW-COS-SQ8")
collection.build_index("HNSW-COSINE-SQ8")

collection.build_index("DiskANN-IP")
collection.build_index("DiskANN-L2")
collection.build_index("DiskANN-COS")
collection.build_index("DiskANN-COSINE")
collection.build_index("DiskANN-IP-SQ8")
collection.build_index("DiskANN-L2-SQ8")
collection.build_index("DiskANN-COS-SQ8")
collection.build_index("DiskANN-COSINE-SQ8")

collection.build_index("IVF-IP", n_clusters=256)
collection.build_index("IVF-L2", n_clusters=256)
collection.build_index("IVF-COS", n_clusters=256)
collection.build_index("IVF-COSINE", n_clusters=256)
collection.build_index("IVF-IP-SQ8", n_clusters=256)
collection.build_index("IVF-L2-SQ8", n_clusters=256)
collection.build_index("IVF-COS-SQ8", n_clusters=256)
collection.build_index("IVF-COSINE-SQ8", n_clusters=256)

Domain numeric indexes:

collection.build_index("FLAT-L1")
collection.build_index("FLAT-HAVERSINE")
collection.build_index("FLAT-CORRELATION")
collection.build_index("FLAT-HELLINGER")
collection.build_index("FLAT-WASSERSTEIN")
collection.build_index("FLAT-JENSEN-SHANNON")
collection.build_index("FLAT-CHEBYSHEV")
collection.build_index("FLAT-CANBERRA")
collection.build_index("FLAT-BRAY-CURTIS")

collection.build_index("HNSW-L1")
collection.build_index("HNSW-HAVERSINE")
collection.build_index("HNSW-CORRELATION")
collection.build_index("HNSW-HELLINGER")
collection.build_index("HNSW-WASSERSTEIN")
collection.build_index("HNSW-JENSEN-SHANNON")
collection.build_index("HNSW-CHEBYSHEV")

Canberra and Bray–Curtis are intentionally Flat-only until metric-specific ANN recall has been validated.

Flat quantized variants:

collection.build_index("FLAT-IP-PQ")
collection.build_index("FLAT-L2-PQ")
collection.build_index("FLAT-COS-PQ")
collection.build_index("FLAT-COSINE-PQ")
collection.build_index("FLAT-IP-PQ8")
collection.build_index("FLAT-IP-PQ16")
collection.build_index("FLAT-L2-PQ8")
collection.build_index("FLAT-COS-PQ8")
collection.build_index("FLAT-IP-RABITQ")
collection.build_index("FLAT-L2-RABITQ")
collection.build_index("FLAT-COS-RABITQ")
collection.build_index("FLAT-COSINE-RABITQ")
collection.build_index("FLAT-IP-POLARVEC")
collection.build_index("FLAT-L2-POLARVEC")
collection.build_index("FLAT-COS-POLARVEC")
collection.build_index("FLAT-COSINE-POLARVEC")
collection.build_index("FLAT-IP-POLARVEC3")
collection.build_index("FLAT-IP-POLARVEC4")
collection.build_index("FLAT-IP-POLARVEC8")

PQ accepts FLAT-{IP,L2,COS,COSINE}-PQ and FLAT-{IP,L2,COS,COSINE}-PQ<N>, where <N> is the requested number of subspaces. If <N> is omitted, LynseDB chooses an automatic subspace count.

PolarVec accepts FLAT-{IP,L2,COS,COSINE}-POLARVEC and FLAT-{IP,L2,COS,COSINE}-POLARVEC<N>, where <N> is a bit width from 1 to 8. If <N> is omitted or invalid, LynseDB uses the default bit width.

Binary variants:

collection.build_index("FLAT-HAMMING-BINARY")
collection.build_index("FLAT-HAMMING")
collection.build_index("FLAT-JACCARD-BINARY")
collection.build_index("FLAT-JACCARD")
collection.build_index("FLAT-TANIMOTO-BINARY")
collection.build_index("FLAT-TANIMOTO")
collection.build_index("FLAT-DICE-BINARY")
collection.build_index("FLAT-DICE")
collection.build_index("IVF-HAMMING-BINARY", n_clusters=256)
collection.build_index("IVF-HAMMING", n_clusters=256)
collection.build_index("IVF-JACCARD-BINARY", n_clusters=256)
collection.build_index("IVF-JACCARD", n_clusters=256)

Troubleshooting index choices¶

Symptom	Try
Results differ too much from expected neighbors	Compare against `FLAT-*`; increase `nprobe`; rebuild index.
IVF recall is low	Increase `nprobe`, reduce `n_clusters`, or improve training data coverage.
HNSW latency is high	Lower `nprobe`, try IVF, or use a more selective `where` filter.
Memory use is high	Try DiskANN or quantized variants; reduce unnecessary named vector fields.
Index build is slow	Build after bulk ingestion, not after every small batch; monitor `/metrics` in server mode.
Binary index scores look inverted	Remember Hamming, Jaccard/Tanimoto, and Dice are lower-is-better distances.