Tutorial: Metadata Filter Cookbook¶

Metadata filters are standard SQL-style where strings used by search and query methods. This cookbook shows practical filter shapes and how to store fields so those filters stay simple.

Example data¶

import numpy as np
import lynse

client = lynse.VectorDBClient(uri="./filter-cookbook")
db = client.create_database("filters", drop_if_exists=True)
collection = db.require_collection("docs", dim=4, drop_if_exists=True)

collection.add(
    ids=["acme-doc-1", "acme-draft-2", "globex-doc-3"],
    vectors=[
        [0.10, 0.20, 0.30, 0.40],
        [0.11, 0.19, 0.29, 0.39],
        [0.80, 0.10, 0.20, 0.10],
    ],
    fields=[
        {
            "tenant": "acme",
            "lang": "en",
            "rank": 1,
            "published": True,
            "tags": ["vector", "docs"],
            "created_at": "2026-06-01",
        },
        {
            "tenant": "acme",
            "lang": "en",
            "rank": 2,
            "published": False,
            "tags": ["draft"],
            "created_at": "2026-06-03",
        },
        {
            "tenant": "globex",
            "lang": "fr",
            "rank": 3,
            "published": True,
            "tags": ["archive", "docs"],
            "created_at": "2026-06-05",
        },
    ],
)

collection.build_index("FLAT-L2")
query = np.array([0.10, 0.20, 0.30, 0.40], dtype=np.float32)

Equality¶

collection.search(query, k=10, where="tenant = 'acme'", return_fields=True)
collection.query(where="lang = 'fr'")

Use equality for tenant, language, source, status, and exact categories.

Numeric ranges¶

collection.search(query, k=10, where="rank >= 1 AND rank <= 2")
collection.query(where="rank < 3")

Keep numeric fields numeric. Avoid storing numbers as strings if you need range filters.

Booleans¶

collection.search(query, k=10, where="published = true")
collection.query(where="published = false")

Use lowercase true and false in filter strings.

Arrays and tags¶

Use CONTAINS for array membership:

collection.search(query, k=10, where="tags CONTAINS 'docs'")
collection.query(where="tags CONTAINS 'archive'")

This is a good shape for tags, labels, permissions, or feature flags.

IN lists¶

collection.search(query, k=10, where="rank IN (1, 3)")
collection.query(where="lang IN ('en', 'fr')")

Use IN when your application has a short allow-list. For a long list of known IDs, prefer filter_ids.

Dates and times¶

Store dates and times as ISO-8601 strings:

collection.search(
    query,
    k=10,
    where="created_at >= '2026-06-01' AND created_at <= '2026-06-30'",
)

Consistent ISO strings sort lexicographically in chronological order.

Compound filters¶

where = "tenant = 'acme' AND lang = 'en' AND published = true"
result = collection.search(query, k=10, where=where, return_fields=True)

Use AND for common pre-filters. It is usually the most predictable way to reduce candidate sets.

Use OR for simple alternatives:

collection.query(where="lang = 'en' OR lang = 'fr'")

Quoted field names¶

Simple identifiers can be unquoted:

where = "tenant = 'acme'"

Quote field names that contain punctuation, spaces, or reserved words:

where = "\"document.lang\" = 'en'"
collection.search(query, k=10, where=where)

Filter then retrieve vectors¶

rows = collection.query_vectors(where="tenant = 'acme' AND published = true")
print(rows.ids)
print(rows.vectors.shape)
print(rows.fields)

This is useful for exports, evaluation sets, and offline analysis.

Filter with vector search¶

result = collection.search(
    query,
    k=5,
    where="tenant = 'acme' AND tags CONTAINS 'docs'",
    return_fields=True,
)
print(result.to_list())

Filter with text and hybrid search¶

text = collection.bm25_search(
    "docs",
    k=5,
    text_fields=["tags"],
    where="tenant = 'acme'",
    return_fields=True,
)

hybrid = collection.hybrid_search(
    vector=query,
    text="docs",
    text_fields=["tags"],
    where="tenant = 'acme'",
    k=5,
    return_fields=True,
)

`filter_ids` instead of a filter expression¶

When you already know the IDs, use filter_ids:

rows = collection.query(filter_ids=[1, 2, 3])
vectors = collection.query_vectors(filter_ids=[1, 2, 3])

This avoids building a long id IN (...) style expression.

Empty query behavior¶

These calls return empty results:

collection.query()
collection.query_vectors()

This prevents accidental full scans. Pass a where expression or explicit filter_ids.

Field design checklist¶

Use low-to-medium-cardinality fields for frequent filters.
Keep data types stable within each field.
Use booleans for visibility and published flags.
Use ISO date strings for time ranges.
Use arrays plus CONTAINS for tags.
Keep raw text fields short enough for your text-search and result payload needs.
Use search_profile() when filter behavior or latency is surprising.