← Ferrosa Suite home · Database home

3-Node Cluster Setup

Run a production-shaped 3-node Ferrosa cluster. This guide covers prerequisites, per-node configuration, formation, verification, and expected cold-start behavior after a full restart.

Developer Preview — Clustering durability caveat: Ferrosa clustering is in active hardening. Data written during cluster formation may have reduced durability until the Raft state machine commits topology and bootstrap streaming completes data redistribution across all replicas. Do not use this configuration for production data until you have run your own durability validation with a representative workload. See Operational Notes for general durability considerations.

Prerequisites

Three machines (or containers) with stable hostnames or IP addresses that can reach each other on TCP port 7000 (internode protocol).
Each node must be able to resolve the other nodes' FERROSA_INTERNODE_BROADCAST addresses at startup and on reconnect. Use hostnames where possible — they survive container IP churn. Static IPs are acceptable for bare-metal deployments.
Port 9042 (CQL) and port 9090 (web console / readiness probe) open for client and operator access on each node.
Shared S3-compatible storage (or a local MinIO instance) configured identically on all three nodes. All nodes must point at the same bucket.
Clocks synchronized (NTP or equivalent). Ferrosa tolerates up to 5 seconds of skew by default (FERROSA_CLOCK_MAX_SKEW_SECS). Larger skew causes Accord transaction validation failures.

How 3-node clustering works

A Ferrosa cluster uses two layers:

Raft consensus (via openraft) for cluster metadata: membership, schema, token assignments, and DDL replication. One node acts as the Raft leader; all DDL and membership changes are serialized through it.
CQL coordinator with a Murmur3 token ring for data routing. Reads and writes are routed to the subset of nodes that own the token range for each partition key, subject to the configured consistency level.

Formation proceeds through three states:

State	Condition	CQL ready?
Standalone	No seeds configured or seeds unreachable	Yes
Forming	Peers connected; waiting for Raft leader election	No
Cluster	Raft leader elected; data routing active	Yes

A Raft quorum of 2 out of 3 nodes is required for leader election and for DDL operations. Data reads and writes at LOCAL_QUORUM also require 2 of 3 replicas.

Per-node configuration

The following environment variables are required on every node. Values differ per node where indicated.

Variable	node1	node2	node3
`FERROSA_INTERNODE_BIND`	`0.0.0.0:7000` (same on all)
`FERROSA_INTERNODE_BROADCAST`	`node1:7000`	`node2:7000`	`node3:7000`
`FERROSA_SEED`	`node2:7000,node3:7000`	`node1:7000,node3:7000`	`node1:7000,node2:7000`
`FERROSA_CLUSTER_NAME`	`my-cluster` (must match across all nodes)
`FERROSA_DATA_DIR`	`/var/lib/ferrosa` (local per-node volume)
`FERROSA_S3_ENDPOINT`	Same S3 endpoint on all nodes
`FERROSA_S3_BUCKET`	Same bucket on all nodes

Set the seeds list to include the other nodes — never list a node as its own seed. Each entry is a hostname:port or ip:port on the internode port (default 7000).

Minimal node1 environment (bare-metal example)

# node1
FERROSA_DATA_DIR=/var/lib/ferrosa
FERROSA_CLUSTER_NAME=my-cluster
FERROSA_INTERNODE_BIND=0.0.0.0:7000
FERROSA_INTERNODE_BROADCAST=node1.internal:7000
FERROSA_SEED=node2.internal:7000,node3.internal:7000
FERROSA_S3_ENDPOINT=https://s3.us-east-1.amazonaws.com
FERROSA_S3_BUCKET=my-ferrosa-cluster
FERROSA_S3_REGION=us-east-1

# node2
FERROSA_DATA_DIR=/var/lib/ferrosa
FERROSA_CLUSTER_NAME=my-cluster
FERROSA_INTERNODE_BIND=0.0.0.0:7000
FERROSA_INTERNODE_BROADCAST=node2.internal:7000
FERROSA_SEED=node1.internal:7000,node3.internal:7000
FERROSA_S3_ENDPOINT=https://s3.us-east-1.amazonaws.com
FERROSA_S3_BUCKET=my-ferrosa-cluster
FERROSA_S3_REGION=us-east-1

# node3
FERROSA_DATA_DIR=/var/lib/ferrosa
FERROSA_CLUSTER_NAME=my-cluster
FERROSA_INTERNODE_BIND=0.0.0.0:7000
FERROSA_INTERNODE_BROADCAST=node3.internal:7000
FERROSA_SEED=node1.internal:7000,node2.internal:7000
FERROSA_S3_ENDPOINT=https://s3.us-east-1.amazonaws.com
FERROSA_S3_BUCKET=my-ferrosa-cluster
FERROSA_S3_REGION=us-east-1

Broadcast address: FERROSA_INTERNODE_BROADCAST is the address other nodes use to connect back to this node. Use a hostname (not an IP) wherever possible so the address remains valid if the container or VM gets a new IP. Ferrosa re-resolves broadcast hostnames on each connection attempt.

Formation procedure

Ferrosa cluster formation is automatic once all three nodes can reach each other. Start all three nodes simultaneously (or within the election timeout window — see cold-start behavior):

Start node1, node2, and node3 with the configuration above.
Each node connects to its seeds and transitions from Standalone to Forming.
Raft leader election runs automatically among the three nodes. No operator action is required.
Once a leader is elected, Ferrosa transitions to Cluster mode and begins routing CQL traffic through the token ring.
Bootstrap streaming redistributes any data written in standalone mode to the correct token owners.

Simultaneous start: Starting all three nodes within a few seconds of each other gives the fastest convergence. If node3 is delayed by more than the election timeout window (~6 seconds at default settings), the other two may elect a leader before node3 joins — which is fine, but node3 will receive a snapshot from the leader on its first connection.

There is no explicit "join" command. Set the seeds, start the node, and formation happens automatically.

Verification

Readiness probe

Check the /readyz endpoint on each node's web console port (default 9090). It returns 200 OK with {"ready":true} once a Raft leader is present and the node is serving CQL traffic:

# Check node1
curl -s http://node1:9090/readyz | python3 -m json.tool

# Expected output when ready:
{"ready": true}

# Expected output while Raft is still converging:
{"ready": false, "waiting_for": "raft_leader", "detail": "no raft leader elected yet"}

HTTP status: /readyz returns 200 when ready, 503 when not ready. Orchestrators should check the HTTP status code, not just the body.

Cluster mode via API

curl -s http://node1:9090/api/cluster/status | python3 -m json.tool
# Shows: {"mode": "Cluster", "role": null, "host_id": "..."}

Cluster status via ferrosa-ctl

ferrosa-ctl status
ferrosa-ctl topology   # Shows token ring with all 3 nodes in Normal state

Peer list via CQL

cqlsh node1 9042
cqlsh> SELECT peer, data_center, rack FROM system.peers;

# Should show 2 rows — the other two nodes

Write a row and read it back across nodes

# On node1: create keyspace with RF=3 for full data distribution
cqlsh node1 9042 -e "
  CREATE KEYSPACE demo WITH replication = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
  };
  CREATE TABLE demo.ping (id text PRIMARY KEY, ts timestamp);
  INSERT INTO demo.ping (id, ts) VALUES ('ok', toTimestamp(now()));
"

# Read from node3 at QUORUM to confirm replication
cqlsh node3 9042 -e "
  CONSISTENCY QUORUM;
  SELECT * FROM demo.ping;
"

Expected cold-start behavior

After a full cluster restart (all three nodes stopped and restarted), Ferrosa nodes go through this sequence:

Listeners bind — CQL (:9042), web (:9090), and internode (:7000) listeners accept connections. The node is not yet ready to serve CQL.
Peer connections form — Nodes dial their seeds and establish internode connections. This typically takes 5–30 seconds depending on container startup ordering.
Raft leader election — openraft pre-vote rounds run until one node wins a quorum of votes. In normal conditions this converges in 3–15 seconds after peer connections form.
DDL path activates — The winning leader swaps the DDL path to Cluster mode. Schema is re-applied from the Raft log. This produces the log line: raft leader elected, swapping DDL path to Cluster.
Cluster mode active — /readyz returns 200. CQL clients can connect and execute queries.

Pre-vote convergence can take up to several minutes in some conditions. A known edge case: if all three nodes have a non-empty Raft log from before the restart, pre-vote rounds may back off for up to ~3 minutes before a leader is elected. This is caused by openraft 0.9’s pre-vote quorum check interacting with the full-cluster restart sequence. During this window, /readyz returns 503 and CQL requests return errors.

The TCP port is bound and containers appear “healthy” by naive TCP probes immediately — this is why the readiness probe exists. Scripts that wait for “all containers healthy” should use /readyz, not a TCP check.

If convergence takes longer than 5 minutes, restart the cluster again. A second restart from the same persisted state usually converges quickly.

Readiness probe reference

The /readyz endpoint is available on the web console port (default 9090) without authentication.

Mode	HTTP status	Condition
Standalone	200	Always ready (no peers)
Pair / Degraded	200	Ready (pair HA or stale reads)
Forming (no Raft)	503	Raft not yet initialized
Forming / Cluster (no leader)	503	Leader election in progress
Cluster (leader present)	200	Ready to serve CQL

Response body when not ready:

{"ready": false, "waiting_for": "raft_leader", "detail": "no raft leader elected yet"}

The waiting_for field always names the blocking condition so log-scraping scripts and operators can distinguish “Raft not initialized” from “election in progress”.

Docker Compose quickstart

The repository’s docker-compose.yml runs a 3-node cluster with a local RustFS (S3-compatible) backend. Use it for local integration testing. It uses /readyz as the healthcheck:

# Start all services
docker compose up -d

# Wait for all three nodes to report ready
for port in 9090 9091 9092; do
  echo -n "node (port $port): "
  until curl -sf http://127.0.0.1:$port/readyz >/dev/null 2>&1; do
    sleep 5; printf ".";
  done
  echo " ready"
done

# Connect to node1
cqlsh 127.0.0.1 9042

Auth in Docker Compose: The default docker-compose.yml runs with FERROSA_AUTH_DISABLED=true for local development convenience. To enable auth, use the included overlay:

docker compose -f docker-compose.yml -f docker-compose.secure.yml up -d

After formation, rotate the default ferrosa_admin password via cqlsh:

cqlsh -u ferrosa_admin -p ferrosa_admin 127.0.0.1 9042
cqlsh> ALTER ROLE ferrosa_admin WITH PASSWORD='your-strong-password';

Troubleshooting

/readyz returns 503 for more than 5 minutes

Check the logs for the blocking condition:

docker logs <node> | grep -E "pre-vote|leader|forming|raft"

Repeated pre-vote round did not reach quorum lines: nodes cannot reach each other on port 7000. Check firewall rules and DNS resolution of the FERROSA_INTERNODE_BROADCAST addresses.
raft not yet initialized in the /readyz body: the background Raft init task has not yet completed. Wait 10–30 seconds after peer connections appear in the logs.
No log output from all three nodes: at least one node has not started. Check docker compose ps.

One node is stuck in Forming after the other two have a leader

The lagging node will receive an InstallSnapshot from the leader and catch up automatically. If it has not caught up after 2 minutes:

docker compose restart node3   # or whichever node is lagging

The restarted node re-dials its seeds, loads the Raft snapshot, and joins the cluster. It does not need a clean data directory.

Cluster mode but reads return errors

Verify that bootstrap streaming has completed by checking that all three nodes are in Normal state:

ferrosa-ctl topology

If a node shows Joining state, bootstrap streaming is still in progress. CQL reads at QUORUM will succeed once all replicas are in Normal state.

Next steps

Getting Started guide → — single-node setup, CQL drivers, configuration reference, and architecture overview.

CQL Compatibility reference → — full list of supported statements, types, and functions.