← Ferrosa Suite home · Database home

3-Node Cluster Setup

Run a production-shaped 3-node Ferrosa cluster. This guide covers prerequisites, per-node configuration, formation, verification, and expected cold-start behavior after a full restart.

Developer Preview — Clustering durability caveat: Ferrosa clustering is in active hardening. Data written during cluster formation may have reduced durability until the Raft state machine commits topology and bootstrap streaming completes data redistribution across all replicas. Do not use this configuration for production data until you have run your own durability validation with a representative workload. See Operational Notes for general durability considerations.

On this page

Prerequisites

How 3-node clustering works

A Ferrosa cluster uses two layers:

  1. Raft consensus (via openraft) for cluster metadata: membership, schema, token assignments, and DDL replication. One node acts as the Raft leader; all DDL and membership changes are serialized through it.
  2. CQL coordinator with a Murmur3 token ring for data routing. Reads and writes are routed to the subset of nodes that own the token range for each partition key, subject to the configured consistency level.

Formation proceeds through three states:

StateConditionCQL ready?
StandaloneNo seeds configured or seeds unreachableYes
FormingPeers connected; waiting for Raft leader electionNo
ClusterRaft leader elected; data routing activeYes

A Raft quorum of 2 out of 3 nodes is required for leader election and for DDL operations. Data reads and writes at LOCAL_QUORUM also require 2 of 3 replicas.

Per-node configuration

The following environment variables are required on every node. Values differ per node where indicated.

Variablenode1node2node3
FERROSA_INTERNODE_BIND 0.0.0.0:7000 (same on all)
FERROSA_INTERNODE_BROADCAST node1:7000 node2:7000 node3:7000
FERROSA_SEED node2:7000,node3:7000 node1:7000,node3:7000 node1:7000,node2:7000
FERROSA_CLUSTER_NAME my-cluster (must match across all nodes)
FERROSA_DATA_DIR /var/lib/ferrosa (local per-node volume)
FERROSA_S3_ENDPOINT Same S3 endpoint on all nodes
FERROSA_S3_BUCKET Same bucket on all nodes

Set the seeds list to include the other nodes — never list a node as its own seed. Each entry is a hostname:port or ip:port on the internode port (default 7000).

Minimal node1 environment (bare-metal example)

# node1
FERROSA_DATA_DIR=/var/lib/ferrosa
FERROSA_CLUSTER_NAME=my-cluster
FERROSA_INTERNODE_BIND=0.0.0.0:7000
FERROSA_INTERNODE_BROADCAST=node1.internal:7000
FERROSA_SEED=node2.internal:7000,node3.internal:7000
FERROSA_S3_ENDPOINT=https://s3.us-east-1.amazonaws.com
FERROSA_S3_BUCKET=my-ferrosa-cluster
FERROSA_S3_REGION=us-east-1
# node2
FERROSA_DATA_DIR=/var/lib/ferrosa
FERROSA_CLUSTER_NAME=my-cluster
FERROSA_INTERNODE_BIND=0.0.0.0:7000
FERROSA_INTERNODE_BROADCAST=node2.internal:7000
FERROSA_SEED=node1.internal:7000,node3.internal:7000
FERROSA_S3_ENDPOINT=https://s3.us-east-1.amazonaws.com
FERROSA_S3_BUCKET=my-ferrosa-cluster
FERROSA_S3_REGION=us-east-1
# node3
FERROSA_DATA_DIR=/var/lib/ferrosa
FERROSA_CLUSTER_NAME=my-cluster
FERROSA_INTERNODE_BIND=0.0.0.0:7000
FERROSA_INTERNODE_BROADCAST=node3.internal:7000
FERROSA_SEED=node1.internal:7000,node2.internal:7000
FERROSA_S3_ENDPOINT=https://s3.us-east-1.amazonaws.com
FERROSA_S3_BUCKET=my-ferrosa-cluster
FERROSA_S3_REGION=us-east-1
Broadcast address: FERROSA_INTERNODE_BROADCAST is the address other nodes use to connect back to this node. Use a hostname (not an IP) wherever possible so the address remains valid if the container or VM gets a new IP. Ferrosa re-resolves broadcast hostnames on each connection attempt.

Formation procedure

Ferrosa cluster formation is automatic once all three nodes can reach each other. Start all three nodes simultaneously (or within the election timeout window — see cold-start behavior):

  1. Start node1, node2, and node3 with the configuration above.
  2. Each node connects to its seeds and transitions from Standalone to Forming.
  3. Raft leader election runs automatically among the three nodes. No operator action is required.
  4. Once a leader is elected, Ferrosa transitions to Cluster mode and begins routing CQL traffic through the token ring.
  5. Bootstrap streaming redistributes any data written in standalone mode to the correct token owners.
Simultaneous start: Starting all three nodes within a few seconds of each other gives the fastest convergence. If node3 is delayed by more than the election timeout window (~6 seconds at default settings), the other two may elect a leader before node3 joins — which is fine, but node3 will receive a snapshot from the leader on its first connection.

There is no explicit "join" command. Set the seeds, start the node, and formation happens automatically.

Verification

Readiness probe

Check the /readyz endpoint on each node's web console port (default 9090). It returns 200 OK with {"ready":true} once a Raft leader is present and the node is serving CQL traffic:

# Check node1
curl -s http://node1:9090/readyz | python3 -m json.tool

# Expected output when ready:
{"ready": true}

# Expected output while Raft is still converging:
{"ready": false, "waiting_for": "raft_leader", "detail": "no raft leader elected yet"}
HTTP status: /readyz returns 200 when ready, 503 when not ready. Orchestrators should check the HTTP status code, not just the body.

Cluster mode via API

curl -s http://node1:9090/api/cluster/status | python3 -m json.tool
# Shows: {"mode": "Cluster", "role": null, "host_id": "..."}

Cluster status via ferrosa-ctl

ferrosa-ctl status
ferrosa-ctl topology   # Shows token ring with all 3 nodes in Normal state

Peer list via CQL

cqlsh node1 9042
cqlsh> SELECT peer, data_center, rack FROM system.peers;

# Should show 2 rows — the other two nodes

Write a row and read it back across nodes

# On node1: create keyspace with RF=3 for full data distribution
cqlsh node1 9042 -e "
  CREATE KEYSPACE demo WITH replication = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
  };
  CREATE TABLE demo.ping (id text PRIMARY KEY, ts timestamp);
  INSERT INTO demo.ping (id, ts) VALUES ('ok', toTimestamp(now()));
"

# Read from node3 at QUORUM to confirm replication
cqlsh node3 9042 -e "
  CONSISTENCY QUORUM;
  SELECT * FROM demo.ping;
"

Expected cold-start behavior

After a full cluster restart (all three nodes stopped and restarted), Ferrosa nodes go through this sequence:

  1. Listeners bind — CQL (:9042), web (:9090), and internode (:7000) listeners accept connections. The node is not yet ready to serve CQL.
  2. Peer connections form — Nodes dial their seeds and establish internode connections. This typically takes 5–30 seconds depending on container startup ordering.
  3. Raft leader election — openraft pre-vote rounds run until one node wins a quorum of votes. In normal conditions this converges in 3–15 seconds after peer connections form.
  4. DDL path activates — The winning leader swaps the DDL path to Cluster mode. Schema is re-applied from the Raft log. This produces the log line: raft leader elected, swapping DDL path to Cluster.
  5. Cluster mode active/readyz returns 200. CQL clients can connect and execute queries.
Pre-vote convergence can take up to several minutes in some conditions. A known edge case: if all three nodes have a non-empty Raft log from before the restart, pre-vote rounds may back off for up to ~3 minutes before a leader is elected. This is caused by openraft 0.9’s pre-vote quorum check interacting with the full-cluster restart sequence. During this window, /readyz returns 503 and CQL requests return errors.

The TCP port is bound and containers appear “healthy” by naive TCP probes immediately — this is why the readiness probe exists. Scripts that wait for “all containers healthy” should use /readyz, not a TCP check.

If convergence takes longer than 5 minutes, restart the cluster again. A second restart from the same persisted state usually converges quickly.

Readiness probe reference

The /readyz endpoint is available on the web console port (default 9090) without authentication.

ModeHTTP statusCondition
Standalone200Always ready (no peers)
Pair / Degraded200Ready (pair HA or stale reads)
Forming (no Raft)503Raft not yet initialized
Forming / Cluster (no leader)503Leader election in progress
Cluster (leader present)200Ready to serve CQL

Response body when not ready:

{"ready": false, "waiting_for": "raft_leader", "detail": "no raft leader elected yet"}

The waiting_for field always names the blocking condition so log-scraping scripts and operators can distinguish “Raft not initialized” from “election in progress”.

Docker Compose quickstart

The repository’s docker-compose.yml runs a 3-node cluster with a local RustFS (S3-compatible) backend. Use it for local integration testing. It uses /readyz as the healthcheck:

# Start all services
docker compose up -d

# Wait for all three nodes to report ready
for port in 9090 9091 9092; do
  echo -n "node (port $port): "
  until curl -sf http://127.0.0.1:$port/readyz >/dev/null 2>&1; do
    sleep 5; printf ".";
  done
  echo " ready"
done

# Connect to node1
cqlsh 127.0.0.1 9042
Auth in Docker Compose: The default docker-compose.yml runs with FERROSA_AUTH_DISABLED=true for local development convenience. To enable auth, use the included overlay:
docker compose -f docker-compose.yml -f docker-compose.secure.yml up -d
After formation, rotate the default ferrosa_admin password via cqlsh:
cqlsh -u ferrosa_admin -p ferrosa_admin 127.0.0.1 9042
cqlsh> ALTER ROLE ferrosa_admin WITH PASSWORD='your-strong-password';

Troubleshooting

/readyz returns 503 for more than 5 minutes

Check the logs for the blocking condition:

docker logs <node> | grep -E "pre-vote|leader|forming|raft"

One node is stuck in Forming after the other two have a leader

The lagging node will receive an InstallSnapshot from the leader and catch up automatically. If it has not caught up after 2 minutes:

docker compose restart node3   # or whichever node is lagging

The restarted node re-dials its seeds, loads the Raft snapshot, and joins the cluster. It does not need a clean data directory.

Cluster mode but reads return errors

Verify that bootstrap streaming has completed by checking that all three nodes are in Normal state:

ferrosa-ctl topology

If a node shows Joining state, bootstrap streaming is still in progress. CQL reads at QUORUM will succeed once all replicas are in Normal state.

Next steps

Getting Started guide → — single-node setup, CQL drivers, configuration reference, and architecture overview.

CQL Compatibility reference → — full list of supported statements, types, and functions.