Cluster Deployment and Maintenance¶
LynseDB cluster mode is a lightweight sharding layer for remote HTTP
deployments. You start several ordinary LynseDB servers as shard nodes, then
start a coordinator in front of them. For high availability, you may run extra
standby coordinators that use the same metadata authority. Applications keep using
VectorDBClient("http://coordinator:7637") just like a single remote server.
Use cluster mode when one server is not enough for your data size or query throughput, or when you want a primary plus replica layout for shard failover. For a first production deployment, start with one coordinator and at least two shard groups. Each shard group should have one primary and, if you need failover, one replica.
Architecture¶
Python client
|
v
Coordinator :7637
|
+-- shard group sg0
| +-- primary http://10.0.0.11:7638
| +-- replica http://10.0.0.12:7638
|
+-- shard group sg1
+-- primary http://10.0.0.21:7638
+-- replica http://10.0.0.22:7638
The coordinator owns cluster metadata and request routing:
- collection metadata is stored on the metadata owner shard(s);
- IDs are mapped to shard groups with stable hash buckets;
- writes are routed to the owning shard group;
- searches fan out to all shard groups and are merged by the coordinator;
- writes are mirrored to active replicas when
write_mirror_replicasis true; - a replica that misses a write is marked
staleand is not auto-promoted.
Shard nodes are normal LynseDB HTTP servers. They also start an internal custom
RPC listener automatically. The coordinator derives the internal RPC port from
the shard HTTP port: http_port + 10000, or http_port - 10000 for high
ports. For example, shard HTTP port 7638 uses internal RPC port 17638.
Search, batch search, add, upsert, delete, and restore use RPC when available
and fall back to HTTP when RPC is unavailable.
Before You Start¶
Install LynseDB on every node:
Plan these pieces before production:
| Item | Recommendation |
|---|---|
| Coordinator | Run one active coordinator for the simplest setup. For HA, run standby coordinators with the same metadata owner shard(s). |
| Shards | Run one LynseDB server per primary or replica data directory. |
| Data directories | Give every shard process its own persistent directory. Do not share one directory between nodes. |
| Network | Let clients reach the coordinator. Let the coordinator reach every shard HTTP port and derived RPC port. |
| Auth | Use --api-key on shards and --shard-api-key on the coordinator. Protect the coordinator with a private network or reverse proxy. |
| Backups | Back up every shard data directory. Metadata owner shard data directories include coordinator metadata. |
The coordinator currently does not enforce client authentication itself. If the coordinator is reachable outside a trusted network, put it behind a reverse proxy, gateway, firewall, or service mesh that provides authentication.
Coordinator Metadata Mental Model¶
Coordinator metadata includes routing tables, generated ID allocation, shard promotions, replica state, and the active coordinator lease. It is stored on metadata owner shard(s), so cluster mode does not need a shared disk.
By default, the coordinator infers metadata owners from cluster.json. If the
config has three or more shard primaries, the first three primaries are used as
replicated metadata owners with majority writes and automatic repair. Smaller
clusters use the first primary as a single metadata owner.
Use --metadata-owners only when you want explicit control:
| Owners | Behavior |
|---|---|
| omitted | Use the first three shard primaries when available; otherwise use the first primary. |
| one URI | Store coordinator metadata on that shard through internal RPC. |
| three or more URIs | Store coordinator metadata with majority CAS and repair stale or missing owners from the committed majority. |
--cluster-state is only a local cache path for the coordinator process. Each
coordinator can use its own local path, even if the path text is the same on
different machines. The authoritative metadata lives on the metadata owner
shard(s).
Quick Local Cluster¶
This example starts two shard groups on one machine. It is the easiest way to learn the moving parts before deploying on separate hosts.
Create a cluster config:
{
"bucket_count": 4096,
"write_mirror_replicas": true,
"shard_groups": [
{
"name": "sg0",
"primary": "http://127.0.0.1:7638",
"replicas": [
"http://127.0.0.1:7639"
]
},
{
"name": "sg1",
"primary": "http://127.0.0.1:7640",
"replicas": [
"http://127.0.0.1:7641"
]
}
]
}
Start the shard nodes in separate terminals:
lynse serve --host 127.0.0.1 --port 7638 --data-dir ./data/sg0-primary
lynse serve --host 127.0.0.1 --port 7639 --data-dir ./data/sg0-replica
lynse serve --host 127.0.0.1 --port 7640 --data-dir ./data/sg1-primary
lynse serve --host 127.0.0.1 --port 7641 --data-dir ./data/sg1-replica
Start the coordinator:
lynse serve \
--role coordinator \
--host 127.0.0.1 \
--port 7637 \
--cluster-config ./cluster.json \
--cluster-state ./cluster_state.cache.json
Check that the coordinator is running:
Use the coordinator from Python:
from lynse import VectorDBClient
client = VectorDBClient("http://127.0.0.1:7637")
db = client.create_database("demo", drop_if_exists=True)
collection = db.require_collection("docs", dim=4)
collection.add(
ids=["doc-1", "doc-2"],
vectors=[
[0.1, 0.2, 0.3, 0.4],
[0.4, 0.3, 0.2, 0.1],
],
fields=[
{"title": "first"},
{"title": "second"},
],
)
collection.commit()
result = collection.search([0.1, 0.2, 0.3, 0.4], k=2, return_fields=True)
print(result)
Production Layout¶
In production, run each shard primary and replica on separate machines or separate failure domains.
Example:
{
"bucket_count": 4096,
"write_mirror_replicas": true,
"shard_groups": [
{
"name": "sg0",
"primary": "http://10.0.0.11:7638",
"replicas": [
"http://10.0.0.12:7638"
]
},
{
"name": "sg1",
"primary": "http://10.0.0.21:7638",
"replicas": [
"http://10.0.0.22:7638"
]
}
]
}
Start each shard:
lynse serve \
--host 0.0.0.0 \
--port 7638 \
--data-dir /var/lib/lynsedb/shard \
--api-key "${LYNSE_SHARD_API_KEY}"
Start the coordinator:
lynse serve \
--role coordinator \
--host 0.0.0.0 \
--port 7637 \
--cluster-config /etc/lynsedb/cluster.json \
--cluster-state /var/lib/lynsedb/coordinator/cluster_state.cache.json \
--shard-api-key "${LYNSE_SHARD_API_KEY}" \
--health-interval-secs 1.0 \
--health-failures 3
For coordinator HA, run another coordinator with the same cluster.json. If
--metadata-owners is omitted, every coordinator infers the same metadata owner
set from the shard primaries:
lynse serve \
--role coordinator \
--host 0.0.0.0 \
--port 7637 \
--cluster-config /etc/lynsedb/cluster.json \
--cluster-state /var/lib/lynsedb/coordinator/cluster_state.cache.json \
--coordinator-uri http://node-a:7637 \
--shard-api-key "${LYNSE_SHARD_API_KEY}"
lynse serve \
--role coordinator \
--host 0.0.0.0 \
--port 7637 \
--cluster-config /etc/lynsedb/cluster.json \
--cluster-state /var/lib/lynsedb/coordinator/cluster_state.cache.json \
--coordinator-uri http://node-b:7637 \
--shard-api-key "${LYNSE_SHARD_API_KEY}"
In this mode, cluster_state.cache.json is only a local cache on each
coordinator. The authoritative metadata is stored on the inferred metadata
owner shard(s). With the two-shard example above, the first primary
http://10.0.0.11:7638 is the single metadata owner. With three or more shard
primaries, the first three primaries are used automatically and a stale or
restarted metadata owner is repaired from the committed majority.
If you prefer to pin a single metadata owner explicitly, add:
If you prefer to pin replicated metadata owners explicitly, pass three or more owners:
lynse serve \
--role coordinator \
--host 0.0.0.0 \
--port 7637 \
--cluster-config /etc/lynsedb/cluster.json \
--cluster-state /var/lib/lynsedb/coordinator/cluster_state.cache.json \
--metadata-owners http://10.0.0.11:7638,http://10.0.0.12:7638,http://10.0.0.21:7638 \
--coordinator-uri http://node-a:7637 \
--shard-api-key "${LYNSE_SHARD_API_KEY}"
On later coordinator restarts, keep using the same cluster.json and
--metadata-owners value if you set one. --cluster-state remains only the
local cache path. You may pass both --cluster-config and --cluster-state,
but existing metadata on the owner shard(s) is the source of truth after the
first start.
Configuration Reference¶
Cluster config fields:
| Field | Default | Description |
|---|---|---|
bucket_count |
4096 |
Number of routing buckets assigned to shard groups when a collection is created. Keep this value stable after data exists. |
write_mirror_replicas |
true |
Mirror writes to replicas whose state is active. |
shard_groups |
required | List of shard groups. Each group needs a primary URI and may have replicas. |
state_path |
cluster_state.cache.json |
Optional local coordinator metadata cache path when --cluster-state is not provided. |
shard_api_key |
none | Optional key used by the coordinator when forwarding requests to shards. |
metadata_owners or metadata.owners |
inferred from shard primaries | Optional metadata owner shard URI(s). Omit to use the first three primaries when available, or the first primary for small clusters. Use one URI to force single-owner metadata, or 3+ URIs for replicated metadata. |
Replica entries can be simple strings:
Or objects with an explicit state:
Coordinator CLI flags:
| Flag | Description |
|---|---|
--role coordinator |
Start coordinator mode instead of a normal shard server. |
--cluster-config |
JSON config used to seed metadata on first start and infer the default metadata owners. |
--cluster-state |
Local coordinator metadata cache path. It does not need to be shared between machines. |
--shard-api-key |
API key sent to shard nodes as Authorization: Bearer .... |
--metadata-owners |
Optional comma-separated metadata owner shard HTTP URIs. Omit to infer owners from shard primaries; provide 3+ URIs for replicated metadata. |
--coordinator-uri |
Advertised URI for this coordinator. Other coordinators proxy to this address when this process is leader. |
--coordinator-id |
Stable coordinator ID. Defaults to --coordinator-uri; most deployments do not need to set it. |
--coordinator-lease-secs |
Leader lease duration. Lower values detect coordinator failure faster but are more sensitive to storage or scheduling stalls. |
--request-timeout-secs |
Timeout for coordinator-to-shard requests. |
--health-interval-secs |
Interval between shard health probes. |
--health-failures |
Consecutive failed probes before a node is considered unhealthy. |
Environment variables are also supported:
export LYNSE_ROLE=coordinator
export LYNSE_CLUSTER_CONFIG=/etc/lynsedb/cluster.json
export LYNSE_CLUSTER_STATE=/var/lib/lynsedb/coordinator/cluster_state.cache.json
export LYNSE_SHARD_API_KEY=your_shard_key
export LYNSE_CLUSTER_METADATA_OWNERS=http://10.0.0.11:7638
export LYNSE_COORDINATOR_URI=http://node-a:7637
export LYNSE_HEALTH_INTERVAL_SECS=1.0
export LYNSE_HEALTH_FAILURES=3
How Routing Works¶
When a collection is created, the coordinator stores its routing table in the metadata authority. Each bucket points to one shard group.
For explicit string or integer IDs, LynseDB hashes the external ID and routes
the item to the owning shard group. For records that need generated integer
IDs, the coordinator allocates IDs and then routes them. This means clients can
continue to use normal add, upsert, delete, restore, search, and
query APIs through the coordinator.
Search requests are sent to every shard group. The coordinator asks one healthy node per group, merges the per-shard top results, and returns one result set to the client.
Health and Failover¶
Shard Failover¶
The coordinator probes every primary and replica. A node is marked unhealthy
after --health-failures consecutive failed probes.
If a primary is unhealthy and an active healthy replica exists in the same shard
group, the coordinator promotes that replica. The old primary is moved into the
replica list with state stale.
Check cluster state:
Look for:
primary: current primary URI for each shard group;primary_epoch: increments when a new primary is promoted;- replica
state:activereplicas can receive mirrored writes and can be promoted;stalereplicas cannot be promoted automatically; meta_epoch: increments when coordinator metadata changes.
Important behavior:
- If a primary write fails, the client request fails.
- If a replica write fails, the request can still succeed, but that replica is
marked
stale. - A stale replica may be missing data and must be rebuilt before it is marked active again.
- If a shard group has no healthy active node, reads and writes for that group will fail until a node is restored.
Coordinator Failover¶
Coordinator failover protects the coordinator process itself. It does not copy or repair shard data; shard primary promotion is handled separately by the active coordinator.
When coordinator failover is enabled, one coordinator holds the leader lease and serves as leader. Other coordinators stay online as standbys. A standby can accept client traffic and proxy it to the leader. If the leader stops renewing its lease, a standby takes over, reloads the latest metadata, and starts serving requests directly.
Check coordinator status:
Look for:
role:leader,standby, orsingle;coordinator_uri: the URI this coordinator advertises;leader.leader_uri: the current leader address;leader.lease_epoch: increments when leadership moves to another coordinator;metadata.mode:singleorreplicated;metadata.degraded: whether replicated metadata has unavailable or stale owners.
Maintenance Tasks¶
Restart a Shard¶
For a short restart of a healthy node:
- Stop the shard process.
- Start it again with the same
--data-dir, host, port, and API key. - Check
curl http://shard-host:7638/healthz. - Check
curl http://coordinator:7637/cluster_info.
If the node did not miss writes, it can continue serving. If it missed writes
and is marked stale, rebuild it before relying on it.
Restart the Coordinator¶
The coordinator keeps only a local metadata cache. Restart it with the same
cluster.json and the same --metadata-owners value if you set one:
lynse serve \
--role coordinator \
--host 0.0.0.0 \
--port 7637 \
--cluster-config /etc/lynsedb/cluster.json \
--cluster-state /var/lib/lynsedb/coordinator/cluster_state.cache.json \
--shard-api-key "${LYNSE_SHARD_API_KEY}"
Existing metadata is loaded from the metadata owner shard(s). The cache file can be deleted and rebuilt by the coordinator.
Rebuild a Stale Replica¶
A stale replica should be treated as out of date. Rebuild it from the current primary for that shard group.
Recommended process:
- Stop the stale replica.
- Keep it out of service while rebuilding.
- Take a consistent backup or filesystem copy from the current primary data directory during a maintenance window.
- Restore that copy into the replica data directory.
- Start the replica with the same port and API key.
- Verify the replica health endpoint.
- Update coordinator metadata so that replica state changes from
staletoactive. - Restart or refresh the coordinator so it observes the updated metadata.
Only mark a replica active after you are confident it has the same data as the
current primary.
Add Shards¶
New shard groups are used by collections created after the routing table
includes them. Existing collections keep their original bucket_to_group
mapping and are not automatically rebalanced.
For a new deployment, choose the shard group count before loading large data. For an existing deployment, the safest scaling path is:
- Start a new cluster config with the desired shard groups.
- Create a new database or collection.
- Re-ingest or migrate the data through the coordinator.
- Switch application traffic after validation.
Do not edit bucket_to_group for an existing collection unless you are also
moving the corresponding data and have tested the migration offline.
Back Up a Cluster¶
Back up these items together:
- every shard primary data directory;
- every replica data directory if you want faster replica restore;
- metadata owner shard data directories, which include coordinator metadata;
- the cluster config and service files.
Before a planned backup, run checkpoints through the coordinator:
from lynse import VectorDBClient
client = VectorDBClient("http://coordinator:7637")
db = client.get_database("demo")
collection = db.get_collection("docs")
collection.checkpoint()
For multiple collections, checkpoint each collection. Then snapshot or copy the metadata owner shard data directories and shard data directories as one backup set.
Upgrade¶
Use a maintenance window for upgrades:
- Back up all shard data directories, including metadata owner shards.
- Stop writes at the application layer.
- Checkpoint active collections.
- Upgrade replicas first and start them.
- Upgrade primaries.
- Restart the coordinator.
- Run smoke tests through the coordinator.
- Resume writes.
For large clusters, test the exact version upgrade on a staging copy first.
Monitoring¶
Monitor each shard as a normal LynseDB server:
curl http://10.0.0.11:7638/healthz
curl http://10.0.0.11:7638/readyz
curl http://10.0.0.11:7638/metrics
Monitor the coordinator:
curl http://coordinator:7637/
curl http://coordinator:7637/cluster_info
curl http://coordinator:7637/coordinator_status
Alert on:
- shard health or readiness failures;
- coordinator process down;
- unexpected coordinator leader changes;
- replica state changing to
stale; - unexpected
primary_epochchanges; - disk usage on shard data volumes;
- high query latency or write failures on shards.
Troubleshooting¶
| Symptom | What to check |
|---|---|
| Coordinator will not start | Pass --cluster-config, or pass --metadata-owners explicitly when no config is available. |
cluster config requires at least one shard group |
The config must contain shard_groups or shards with at least one entry. |
| Shard requests return unauthorized | Make sure shards use --api-key and the coordinator uses the same --shard-api-key. |
| Metadata owner is unreachable | Open the derived RPC port from each coordinator to each metadata owner shard. Metadata reads, writes, lease renewals, and owner repair use internal RPC. |
| Data RPC is not used | Open the derived RPC port between coordinator and shard. Ordinary shard data requests can fall back to HTTP, but metadata owner traffic requires RPC. |
metadata.degraded is true |
A replicated metadata owner is unavailable or behind. The coordinator repairs stale owners from the committed majority when they are reachable again. |
Replica becomes stale |
It missed a mirrored write. Rebuild it from the primary before marking it active. |
| Search misses recent writes | Confirm writes were sent to the coordinator, check shard health, then inspect cluster_info for stale or promoted nodes. |
| Existing data does not rebalance after adding a shard | Existing collection routing is fixed. Create a new collection and migrate or re-ingest. |
Operational Checklist¶
Before accepting production traffic:
- shard nodes have persistent data directories;
- metadata owner shard data directories are on persistent storage;
- coordinator-to-shard HTTP and derived RPC ports are reachable;
- shard authentication is configured if the shard network is not fully trusted;
- coordinator access is protected by network policy or a proxy;
/cluster_infoshows the expected primaries and active replicas;- a backup and restore procedure has been tested;
- application clients connect only to the coordinator.