Domain-Aware Distance Metrics¶
LynseDB can search coordinates, binary fingerprints, aligned profiles, and probability distributions without converting them into semantic embeddings. The metric is persisted with the index and applies consistently to exact search, filters, batch search, and range search.
Metric selection¶
| Metric | Accepted names | Input contract | Typical uses |
|---|---|---|---|
| Inner product | ip, dot |
finite numeric vectors | normalized embeddings, maximum-score retrieval |
| Squared L2 | l2, euclidean |
finite numeric vectors | embeddings and geometric features |
| Cosine distance | cos, cosine |
numeric vectors | angular embedding similarity |
| Manhattan | l1, manhattan, cityblock |
finite numeric vectors | anomaly, sensor, and tabular feature matching |
| Haversine | haversine, geo |
[longitude_degrees, latitude_degrees] |
nearby POI, fleet, and device search |
| Correlation | correlation, pearson |
aligned, fixed-length numeric profiles | sensor curves, behavior profiles, gene expression |
| Hellinger | hellinger |
non-negative weights or probabilities | model outputs, topics, preference distributions |
| Wasserstein-1D | wasserstein, emd |
non-negative mass in equal-width ordered bins | histograms, latency and forecast distributions |
| Jensen–Shannon | jensen_shannon, js |
non-negative weights or probabilities | probability models, topics, prediction comparison |
| Chebyshev | chebyshev, linf |
finite numeric vectors | industrial tolerances, maximum-deviation constraints |
| Canberra | canberra |
numeric vectors | spectra, sparse abundance and compositional features |
| Bray–Curtis | bray_curtis |
finite, non-negative abundance vectors | ecology, spectra and compositional profiles |
| Hamming | hamming |
values thresholded at > 0.5 |
bit signatures and error patterns |
| Jaccard/Tanimoto | jaccard, tanimoto |
values thresholded at > 0.5 |
molecular fingerprints, sets, genomic sketches |
| Sørensen-Dice | dice, sorensen |
values thresholded at > 0.5 |
fingerprints and near-duplicate detection |
All metrics except inner product return lower-is-better distances. Cosine
returns 1 - cosine_similarity, so identical non-zero vectors have distance
zero. Inner product retains its conventional higher-is-better score for API
compatibility.
Index compatibility¶
| Metric family | Flat | HNSW | IVF/SPANN/DiskANN | Quantized variants |
|---|---|---|---|---|
| IP, L2, cosine | yes | yes | yes | supported variants |
| L1, Haversine, correlation, Hellinger, Wasserstein-1D | yes | yes | not yet | not yet |
| Jensen–Shannon, Chebyshev | yes | yes | not yet | not yet |
| Canberra, Bray–Curtis | yes | not yet | not yet | not yet |
| Hamming, Jaccard | packed hot path | not yet | IVF binary variants | binary only |
| Tanimoto, Dice | packed hot path | not yet | not yet | binary only |
Start with Flat as the exact quality baseline. Use HNSW for supported domain metrics after measuring recall on representative queries. LynseDB rejects unsupported index/metric combinations instead of silently substituting another metric.
Reusable example setup¶
Every metric uses the same collection workflow: create a collection, add
equal-length vectors, commit, and build an index whose suffix selects the
metric. The examples below share this local-client setup; an HTTP
VectorDBClient uses the same methods.
import numpy as np
import lynse
client = lynse.VectorDBClient(uri="./lynsedb-distance-demo")
db = client.create_database("distance_demo", drop_if_exists=True)
def metric_collection(name, vectors, index_mode):
vectors = np.asarray(vectors, dtype=np.float32)
collection = db.require_collection(
name,
dim=vectors.shape[1],
default_index=None,
drop_if_exists=True,
)
collection.add(
ids=[f"{name}-{i}" for i in range(len(vectors))],
vectors=vectors,
)
collection.commit()
collection.build_index(index_mode)
return collection
Passing default_index=None is important in these examples: it prevents the
new collection from first building the default FLAT-IP index. For an existing
collection, call build_index(...) with the desired mode to replace its active
index.
Core vector distances¶
Inner product¶
Inner product ranks larger scores first. It is commonly used with normalized embeddings, where it produces the same ordering as cosine similarity:
embeddings = metric_collection(
"embeddings",
[
[1.0, 0.0, 0.0],
[0.8, 0.6, 0.0],
[0.0, 0.0, 1.0],
],
"FLAT-IP",
)
result = embeddings.search([1.0, 0.0, 0.0], k=2)
print(result.ids)
print(result.distances) # larger is more similar for IP
Unlike every other metric on this page, inner product is a similarity score, not a lower-is-better distance. Vector magnitude affects the score; normalize vectors first when magnitude should not affect ranking. HNSW, IVF, SPANN, DiskANN, and quantized IP variants are also available.
Squared L2 distance¶
Squared L2 measures squared Euclidean distance:
points = metric_collection(
"points_l2",
[[0.0, 0.0], [1.0, 1.0], [3.0, 4.0]],
"FLAT-L2",
)
result = points.search([0.0, 1.0], k=2)
print(result.ids)
print(result.distances) # 1.0 for both [0, 0] and [1, 1]
LynseDB returns the squared value and does not take the final square root.
Ranking is identical to Euclidean distance, but range-search thresholds must be
squared as well. For example, an Euclidean radius of 2.0 requires a squared
L2 threshold of 4.0.
Cosine distance¶
Cosine distance compares direction while ignoring positive vector magnitude:
directions = metric_collection(
"directions",
[[1.0, 0.0], [10.0, 0.0], [0.0, 1.0]],
"FLAT-COS",
)
result = directions.search([2.0, 0.0], k=3)
print(result.ids)
print(result.distances)
The returned value is 1 - cosine_similarity; identical directions have
distance zero. Use non-zero vectors when cosine direction is meaningful. For
larger collections, HNSW-Cos, IVF-COS, SPANN-COS, and DiskANN-Cos are
available.
Manhattan distance¶
Manhattan, or L1, sums absolute component differences. Compared with squared L2, one large component error has less influence relative to several smaller errors:
features = metric_collection(
"features_l1",
[[1.0, 2.0, 3.0], [1.0, 4.0, 3.0], [3.0, 3.0, 3.0]],
"FLAT-L1",
)
result = features.search([1.0, 3.0, 3.0], k=3)
print(result.ids)
print(result.distances)
The distance is lower-is-better and unbounded. Use HNSW-L1 for approximate
search on larger collections and tune its search breadth with nprobe.
Use the newly added metrics¶
Jensen–Shannon distance¶
Use Jensen–Shannon when each vector represents a probability distribution or non-negative counts. LynseDB normalizes every vector internally, so counts do not have to sum to one.
topic_counts = [
[70, 20, 10], # mostly database
[10, 20, 70], # mostly frontend
[34, 33, 33], # mixed
]
topics = metric_collection(
"topics",
topic_counts,
"FLAT-JENSEN-SHANNON",
)
result = topics.search([60, 30, 10], k=2)
print(result.ids)
print(result.distances)
The result is sqrt(JS divergence) using natural logarithms. Zero means the
normalized distributions are identical, and the maximum is sqrt(ln(2))
(about 0.8326). Negative or non-finite components are invalid. Two all-zero
vectors compare as zero; one all-zero vector and one positive-mass vector are
maximally distant.
For a larger collection, switch to HNSW and tune search breadth with nprobe:
Use FLAT-JENSEN-SHANNON as the exact baseline when measuring HNSW recall.
Chebyshev distance¶
Chebyshev, also called L-infinity distance, is the largest absolute difference in any one component. It is a useful fit when a candidate should be considered far away as soon as one tolerance is badly violated.
measurements = metric_collection(
"measurements",
[
[20.0, 50.0, 100.0],
[20.2, 49.9, 100.1],
[21.5, 50.0, 100.0],
],
"FLAT-CHEBYSHEV",
)
result = measurements.search([20.1, 50.0, 100.0], k=2)
print(result.ids)
print(result.distances)
Distances are in the same units as the vector components and are unbounded. If components use different units or scales, standardize them or divide each component by its accepted tolerance before insertion. HNSW is also available:
measurements.build_index("HNSW-CHEBYSHEV")
result = measurements.search([20.1, 50.0, 100.0], k=10, nprobe=64)
Canberra distance¶
Canberra sums the relative difference of every component. It emphasizes a change near zero, making it useful for spectra and sparse abundance features:
spectra = metric_collection(
"spectra",
[
[0.0, 0.10, 0.00, 0.90],
[0.0, 0.12, 0.00, 0.88],
[0.2, 0.00, 0.30, 0.50],
],
"FLAT-CANBERRA",
)
result = spectra.search([0.0, 0.11, 0.0, 0.89], k=2)
print(result.ids)
print(result.distances)
For each component LynseDB evaluates
abs(a - b) / (abs(a) + abs(b)); a 0 / 0 component contributes zero. For a
d-dimensional finite vector, the distance lies between zero and d.
Canberra currently supports exact Flat search only; HNSW-CANBERRA and
quantized variants are rejected.
Bray–Curtis distance¶
Bray–Curtis compares the total absolute difference with the total abundance. Use non-negative vectors whose dimensions represent the same species, bins, or features:
abundance = metric_collection(
"abundance",
[
[12, 4, 0, 8],
[11, 5, 0, 8],
[1, 0, 15, 2],
],
"FLAT-BRAY-CURTIS",
)
result = abundance.search([12, 5, 0, 7], k=2)
print(result.ids)
print(result.distances)
For non-negative inputs the distance is in the range 0..1: zero means equal
abundance profiles and one means no shared positive abundance. Two all-zero
vectors compare as zero. Bray–Curtis currently supports exact Flat search only;
HNSW-BRAY-CURTIS and quantized variants are rejected.
Use a new metric on a named vector field¶
The lowercase accepted names are useful when defining a named vector field; the uppercase strings select its index implementation:
item_ids = ["item-0", "item-1", "item-2"]
items = db.require_collection(
"items",
dim=2,
default_index=None,
drop_if_exists=True,
)
items.add(
ids=item_ids,
vectors=[[1, 0], [0, 1], [1, 1]],
)
items.commit()
items.create_vector_field(
"topic_distribution",
dim=3,
metric="jensen_shannon",
)
items.add_named_vectors(
"topic_distribution",
vectors=topic_counts,
ids=item_ids,
)
items.build_index(
"FLAT-JENSEN-SHANNON",
field_name="topic_distribution",
)
result = items.search(
[60, 30, 10],
k=10,
vector_field="topic_distribution",
)
The IDs passed to add_named_vectors must already exist in the primary
collection. The field's metric is fixed when the field is created.
Geospatial coordinates¶
Haversine uses GeoJSON coordinate order and returns meters:
places = db.require_collection(
"places",
dim=2,
default_index=None,
drop_if_exists=True,
)
places.add(
ids=["shanghai", "beijing"],
vectors=[
[121.4737, 31.2304],
[116.4074, 39.9042],
],
)
places.commit()
places.build_index("HNSW-HAVERSINE")
nearest = places.search([121.50, 31.24], k=5, nprobe=64)
The dimension must be exactly two. Latitude must be between -90 and 90 degrees. Haversine models Earth as a mean-radius sphere; applications requiring ellipsoidal survey precision should re-rank the returned candidates with their preferred geodesic library.
Binary fingerprints¶
Flat binary search lazily converts stored rows into packed u64 words. The hot
representation uses one bit per dimension—up to 32x smaller than float32—while the
original mmap remains the durable source of truth. The packed cache is rebuilt
after writes and does not create another full-size vector copy.
All binary metrics interpret a component greater than 0.5 as one and any
other component as zero. Supplying explicit zero/one vectors makes that
contract easiest to see:
binary_vectors = [
[1, 1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 1, 0, 1, 1, 0],
]
binary_query = [1, 1, 0, 0, 1, 0, 0, 0]
hamming = metric_collection(
"fingerprints_hamming",
binary_vectors,
"FLAT-HAMMING-BINARY",
)
hamming_result = hamming.search(binary_query, k=3)
jaccard = metric_collection(
"fingerprints_jaccard",
binary_vectors,
"FLAT-JACCARD-BINARY",
)
jaccard_result = jaccard.search(binary_query, k=3)
tanimoto = metric_collection(
"fingerprints_tanimoto",
binary_vectors,
"FLAT-TANIMOTO-BINARY",
)
tanimoto_result = tanimoto.search(binary_query, k=3)
dice = metric_collection(
"fingerprints_dice",
binary_vectors,
"FLAT-DICE-BINARY",
)
dice_result = dice.search(binary_query, k=3)
print(hamming_result.distances)
print(jaccard_result.distances)
print(tanimoto_result.distances)
print(dice_result.distances)
- Hamming counts positions whose bits differ, so its range is
0..dimension. - Jaccard returns
1 - intersection / union. - Tanimoto is the domain name for binary Jaccard and produces the same values.
- Dice returns
1 - 2 * intersection / (count_a + count_b).
For Jaccard/Tanimoto and Dice, two empty bit sets compare at distance zero.
These metrics measure fingerprint similarity, not exact molecular substructure
matching. Hamming and Jaccard also have IVF-HAMMING-BINARY and
IVF-JACCARD-BINARY variants; Tanimoto and Dice are currently Flat-only.
Aligned profiles¶
Correlation distance is 1 - Pearson r. It ignores uniform shifts and scales,
which is useful when profile shape matters more than absolute magnitude:
profiles = metric_collection(
"profiles",
[
[1, 2, 3, 4, 5],
[10, 20, 30, 40, 50],
[5, 4, 3, 2, 1],
],
"FLAT-CORRELATION",
)
result = profiles.search([2, 4, 6, 8, 10], k=3)
print(result.ids)
print(result.distances)
Inputs must represent aligned positions or time buckets. This metric is not
Dynamic Time Warping and does not compensate for time shifts. Two identical
constant profiles have distance zero; other comparisons involving a constant
profile return distance one instead of NaN. Use HNSW-CORRELATION with
nprobe for approximate search on larger collections.
Probability distributions and histograms¶
Hellinger accepts non-negative counts or probabilities and normalizes each row internally. Its result lies between zero and one:
topic_weights = metric_collection(
"topic_weights",
[[70, 20, 10], [10, 20, 70], [34, 33, 33]],
"FLAT-HELLINGER",
)
result = topic_weights.search([60, 30, 10], k=3)
print(result.ids)
print(result.distances)
Counts do not need to sum to one. Negative inputs are invalid. Hellinger is a
good default when square-root geometry for probability distributions is
appropriate; HNSW-HELLINGER provides the approximate alternative.
Wasserstein-1D also normalizes each row, but preserves the order of bins. Moving mass by two bins costs twice as much as moving it by one bin:
latency_histograms = metric_collection(
"latency_histograms",
[
[80, 15, 5, 0],
[10, 70, 15, 5],
[0, 5, 15, 80],
],
"FLAT-WASSERSTEIN",
)
result = latency_histograms.search([70, 20, 10, 0], k=3)
print(result.ids)
print(result.distances)
Wasserstein assumes equal-width ordered bins and reports distance in bin-width
units. Inputs must be non-negative, but do not need to be pre-normalized.
Resample unequal histograms to a shared bin grid before insertion. Use
HNSW-WASSERSTEIN with nprobe when approximate search is appropriate.
Performance guidance¶
- L1 uses native AVX2 or NEON kernels.
- Chebyshev, Canberra, and Bray–Curtis use native AVX2 or NEON kernels and do not allocate per candidate.
- Flat binary metrics use packed words and hardware population counts.
- Haversine performs only the two-dimensional great-circle calculation and is well suited to HNSW when collections grow.
- Correlation, Hellinger, and Wasserstein scan rows without allocating temporary vectors. They favor memory efficiency; Hellinger's square roots make it more compute-intensive than L1, L2, or cosine.
- Jensen–Shannon uses AVX2/NEON vector logarithms. Exact Flat search lazily
caches inverse mass and entropy in two
float32values per row, then applies the entropy identity so the hot scan evaluates only one logarithm per dimension. Its dual-row kernel ranks squared distances and takes square roots only for the final top-k; very small distances are recomputed with the stablefloat64path. The cache adds eight bytes per row and no per-candidate allocation. - Use the reproducible benchmark with, for example,
python benchmarks/flat_search_bench.py --index-mode FLAT-TANIMOTO-BINARY --dim 2048on deployment hardware.