Push custom logic into the database with WebAssembly-powered UDFs and UDAs. Compute confidence-weighted averages and temperature histograms across millions of IoT rows without shipping raw data to your application.
|
Note
|
Complete the 3-Node Cluster Setup tutorial first, or have a running Ferrosa cluster accessible on localhost:9042.
|
Overview
What are UDFs and UDAs?
A User-Defined Function (UDF) is a scalar function you register with Ferrosa. It takes one or more column values from a single row and returns a single result. Every row in a query result can pass through the function before being returned to the client.
A User-Defined Aggregate (UDA) folds a set of rows down to a single value.
Ferrosa calls a state function once per row to accumulate running state, then calls a final function once to produce the result from that state.
Standard CQL aggregates like COUNT, SUM, and AVG work this way internally; UDAs let you define your own.
In Ferrosa, both UDFs and UDAs are compiled to WebAssembly using the WASM Component Model.
You write your function in Rust, compile it to a .wasm binary with cargo-component, hex-encode the bytes, and register them with a CREATE FUNCTION or CREATE AGGREGATE statement.
The database executes the WASM inside a sandboxed runtime, so custom logic cannot access the host filesystem or network.
Why this matters
Without UDFs and UDAs, any aggregation beyond COUNT, SUM, MIN, MAX, and AVG requires pulling raw rows out of the database and computing in application code.
For an IoT deployment with millions of readings per day, that means transferring megabytes of data over the network for every dashboard refresh.
UDFs and UDAs move the computation to where the data lives. Ferrosa runs the logic on each node, aggregates the partial results across partitions, and returns a single compact answer to your client.
This tutorial demonstrates four functions:
| Name | Type | What it does |
|---|---|---|
|
UDF |
Converts a Fahrenheit temperature to Celsius |
|
UDF |
Returns the first non-null argument; substitutes per-sensor defaults |
|
UDA |
Confidence-weighted average of temperature readings |
|
UDA |
Buckets temperatures into 10-degree Fahrenheit bands |
Prerequisites
You need the following tools installed before building the guest crates.
Rust toolchain
Install Rust from rustup.rs if you do not have it:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup update stable
The WASM target is required:
rustup target add wasm32-unknown-unknown
cargo-component
cargo-component is the build tool for the WASM Component Model.
Install it from crates.io:
cargo install cargo-component
Verify the installation:
cargo component --version
Design the Schema
The IoT sensor scenario
You have three sensors reporting temperature readings at hourly intervals. Each reading carries:
-
A confidence score between 0 and 1, representing the sensor’s self-reported measurement quality.
-
An optional
prioritycolumn that may be null when the sensor firmware version predates the priority field. -
A
default_prioritycolumn that always has a value, used as the fallback.
The goal is to demonstrate UDF and UDA logic against realistic data that has nulls, varying confidence scores, and a range of temperatures that exercises the histogram buckets.
Create the keyspace and table
Connect to Ferrosa and run the schema:
CREATE KEYSPACE IF NOT EXISTS sensor_analytics WITH replication = {
'class': 'SimpleStrategy', 'replication_factor': 1
};
USE sensor_analytics;
CREATE TABLE IF NOT EXISTS sensor_readings (
sensor_id text,
ts timestamp,
temperature double,
confidence double,
priority int,
default_priority int,
PRIMARY KEY (sensor_id, ts)
) WITH CLUSTERING ORDER BY (ts DESC);
The sensor_id column is the partition key, so all readings from one sensor are stored together on the same nodes.
The ts clustering column is ordered descending so the most recent reading is always at the top of the partition.
Register the UDFs
Before running the examples, you must build the guest crates and register the functions.
The schema file contains the full CREATE FUNCTION statements with a placeholder for the WASM hex bytes.
Build the guest crate
From the examples/advanced-aggregation/ directory:
cargo component build --release -p advanced-aggregation-udf
This produces a .wasm file under target/wasm32-unknown-unknown/release/.
Hex-encode the binary
xxd -p -c 0 target/wasm32-unknown-unknown/release/advanced_aggregation_udf.wasm \
> advanced_aggregation_udf.hex
Register to_celsius
Replace 0x<hex> in the statement below with the content of advanced_aggregation_udf.hex:
CREATE OR REPLACE FUNCTION sensor_analytics.to_celsius(f double)
CALLED ON NULL INPUT
RETURNS double
LANGUAGE wasm
AS '0x<hex-encoded wasm bytes>';
CALLED ON NULL INPUT means the function runs even when f is null.
The implementation returns null when given null, preserving the null semantics your application expects.
Alternative: inline AssemblyScript
Instead of building and hex-encoding a precompiled binary, you can write a scalar
function as inline AssemblyScript (TypeScript-like) source and let the server
compile it at definition time with LANGUAGE assemblyscript:
CREATE OR REPLACE FUNCTION sensor_analytics.to_celsius(f double)
CALLED ON NULL INPUT
RETURNS double
LANGUAGE assemblyscript
AS 'export function to_celsius(f: f64): f64 { return (f - 32.0) * 5.0 / 9.0; }';
The exported function name must match the UDF name. Inline AssemblyScript
supports numeric (int, bigint, float, double, smallint, tinyint),
text/ascii (AS string), and blob (AS Uint8Array) argument and return
types. It requires the server to be built with the asc-udf feature and the asc
toolchain bundle to be available (see examples/asc-poc/build-bundle.sh);
collection and temporal types still require the precompiled LANGUAGE wasm form.
Register coalesce
CREATE OR REPLACE FUNCTION sensor_analytics.coalesce(val int, fallback int)
CALLED ON NULL INPUT
RETURNS int
LANGUAGE wasm
AS '0x<hex-encoded wasm bytes>';
Register weighted_avg
CREATE OR REPLACE AGGREGATE sensor_analytics.weighted_avg(double, double)
SFUNC weighted_avg_state
STYPE tuple<double, double>
FINALFUNC weighted_avg_final
INITCOND (0.0, 0.0)
LANGUAGE wasm
AS '0x<hex-encoded wasm bytes>';
The state type is a tuple<double, double> holding (weighted_sum, weight_total).
Each call to weighted_avg_state adds temperature * confidence to the first element and confidence to the second.
weighted_avg_final divides the two to produce the result.
Register histogram
CREATE OR REPLACE AGGREGATE sensor_analytics.histogram(double)
SFUNC histogram_state
STYPE map<text, int>
FINALFUNC histogram_final
INITCOND {}
LANGUAGE wasm
AS '0x<hex-encoded wasm bytes>';
The state type is a map<text, int>.
Each call to histogram_state increments the count for the 10-degree band that contains the given temperature.
Insert Sample Data
Load the sample readings:
USE sensor_analytics;
-- sensor-001: outdoor temperature probe, high-confidence readings
-- priority column is always populated for this sensor
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-001', '2026-03-20 06:00:00', 42.1, 0.97, 2, 3);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-001', '2026-03-20 07:00:00', 44.6, 0.98, 2, 3);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-001', '2026-03-20 08:00:00', 51.3, 0.99, 1, 3);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-001', '2026-03-20 09:00:00', 57.8, 0.96, 1, 3);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-001', '2026-03-20 10:00:00', 63.2, 0.95, 2, 3);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-001', '2026-03-20 11:00:00', 68.4, 0.97, 2, 3);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-001', '2026-03-20 12:00:00', 72.9, 0.98, 3, 3);
-- sensor-002: indoor HVAC probe, some readings have null priority
-- to demonstrate the coalesce() UDF falling back to default_priority
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-002', '2026-03-20 06:00:00', 68.0, 0.91, null, 2);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-002', '2026-03-20 07:00:00', 69.5, 0.89, 2, 2);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-002', '2026-03-20 08:00:00', 71.1, 0.92, null, 2);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-002', '2026-03-20 09:00:00', 73.4, 0.88, 2, 2);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-002', '2026-03-20 10:00:00', 75.0, 0.87, null, 2);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-002', '2026-03-20 11:00:00', 76.3, 0.90, 1, 2);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-002', '2026-03-20 12:00:00', 77.8, 0.93, null, 2);
-- sensor-003: rooftop weather station, lower confidence due to exposure
-- mix of null and populated priority values
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-003', '2026-03-20 06:00:00', 40.5, 0.75, null, 4);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-003', '2026-03-20 07:00:00', 45.8, 0.72, 4, 4);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-003', '2026-03-20 08:00:00', 53.3, 0.78, null, 4);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-003', '2026-03-20 09:00:00', 62.7, 0.70, 3, 4);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-003', '2026-03-20 10:00:00', 74.1, 0.73, null, 4);
INSERT INTO sensor_readings (sensor_id, ts, temperature, confidence, priority, default_priority)
VALUES ('sensor-003', '2026-03-20 12:00:00', 89.9, 0.74, null, 4);
The data covers three sensors across a single day:
-
sensor-001— outdoor probe, high confidence (0.95–0.99), priority always set -
sensor-002— indoor HVAC probe, moderate confidence (0.87–0.93), several rows with null priority -
sensor-003— rooftop weather station, lower confidence (0.70–0.78), many rows with null priority
Run the Queries
Scalar UDF: to_celsius
SELECT sensor_id, ts, temperature, confidence
FROM sensor_readings
WHERE sensor_id = 'sensor-001';
Ferrosa calls to_celsius(temperature) for every row before returning the result set.
This follows the same shape as calling a scalar function in a relational database — one invocation per row without a separate client round trip.
Expected output (rows ordered newest-first due to ts DESC):
sensor_id | temperature | sensor_analytics.to_celsius(temperature)
------------+-------------+-----------------------------------------
sensor-001 | 72.9 | 22.722222222222225
sensor-001 | 68.4 | 20.222222222222225
sensor-001 | 63.2 | 17.333333333333332
sensor-001 | 57.8 | 14.333333333333334
sensor-001 | 51.3 | 10.722222222222223
sensor-001 | 44.6 | 7.000000000000001
sensor-001 | 42.1 | 5.611111111111111
The conversion formula is (f - 32) * 5 / 9.
Scalar UDF: coalesce
FROM sensor_readings
WHERE sensor_id = 'sensor-001';
For sensor-001, priority is never null, so coalesce always returns the priority column directly.
This query is more interesting against sensor-002 or sensor-003, where null rows fall back to default_priority.
Expected output:
sensor_id | sensor_analytics.coalesce(priority, default_priority)
------------+------------------------------------------------------
sensor-001 | 3
sensor-001 | 2
sensor-001 | 2
sensor-001 | 1
sensor-001 | 1
sensor-001 | 2
sensor-001 | 2
Run the same query against sensor-002 to see the fallback in action.
Rows with a null priority will show 2 (the default_priority) instead of null.
Aggregate UDA: weighted_avg
-- Built-in aggregation: temperature range statistics across all sensors.
SELECT min(temperature), max(temperature), avg(temperature), count(temperature)
Ferrosa accumulates the state tuple (weighted_sum, weight_total) across all seven rows for sensor-001, then calls weighted_avg_final once to divide.
Expected output:
sensor_analytics.weighted_avg(temperature, confidence)
--------------------------------------------------------
57.154411764705884
Compare this to the unweighted average of the same readings:
(42.1 + 44.6 + 51.3 + 57.8 + 63.2 + 68.4 + 72.9) / 7 ≈ 57.19.
The weighted average is very close here because the confidence scores are all in a narrow band (0.95–0.99); wider variation in confidence would produce a more pronounced difference.
Aggregate UDA: histogram
This query aggregates across all three sensors (all partitions).
Ferrosa runs histogram_state on every row in the table and merges the partial maps from each partition.
Expected output:
sensor_analytics.histogram(temperature)
-----------------------------------------
{'40-50': 4, '50-60': 3, '60-70': 5, '70-80': 7, '80-90': 1}
Each key is a 10-degree Fahrenheit band. Reading the histogram at a glance: most readings landed in the 70–80 F range (7 of 20), with a healthy spread across the 40–70 range and a single outlier above 80 from the rooftop sensor at midday.
How Each Function Works
to_celsius
The conversion formula is:
celsius = (fahrenheit - 32.0) * 5.0 / 9.0
The WASM function receives a 64-bit float (f64 in Rust) and returns a 64-bit float.
When called with CALLED ON NULL INPUT and the argument is null, the function receives a sentinel and returns null, preserving null propagation.
coalesce
The implementation checks whether val is null.
If it is not null, it returns val.
If it is null, it returns fallback.
Because CALLED ON NULL INPUT is specified, the function receives the raw null state and makes its own decision rather than short-circuiting before the function body runs.
This is more general than a fixed-column fallback: you can pass any two expressions to coalesce, including literals and other function results.
weighted_avg
The state function signature in Rust looks like:
fn weighted_avg_state(state: (f64, f64), temperature: f64, confidence: f64) -> (f64, f64) {
let (weighted_sum, weight_total) = state;
(weighted_sum + temperature * confidence, weight_total + confidence)
}
The final function:
fn weighted_avg_final(state: (f64, f64)) -> f64 {
let (weighted_sum, weight_total) = state;
if weight_total == 0.0 { 0.0 } else { weighted_sum / weight_total }
}
The INITCOND (0.0, 0.0) clause seeds the state before the first row is processed, so the state function never receives an uninitialized accumulator.
histogram
The state function maps each temperature to a bucket label:
fn histogram_state(mut state: BTreeMap<String, i32>, temperature: f64) -> BTreeMap<String, i32> {
let lower = (temperature / 10.0).floor() as i32 * 10;
let upper = lower + 10;
let key = format!("{}-{}", lower, upper);
*state.entry(key).or_insert(0) += 1;
state
}
The final function returns the map unchanged. Because the WASM Component Model serializes the map to a canonical wire format, Ferrosa can merge partial results from multiple nodes when the aggregate spans partitions.
Real-World Considerations
Performance of cross-partition aggregates
The histogram query in this tutorial aggregates across all partitions in the table.
Ferrosa fans the query out to every node that owns a partition, runs histogram_state locally on each node’s rows, and merges the partial maps in the coordinator.
For tables with many partitions, this can be expensive.
Prefer per-partition aggregates (with a WHERE sensor_id = '…' clause) for latency-sensitive paths, and reserve cross-partition aggregates for background analytics jobs.
Updating a registered function
Registered functions are part of the schema and stored in the system keyspace.
To update a function after rebuilding the WASM binary, use CREATE OR REPLACE FUNCTION with the new hex bytes.
The change propagates to all nodes via schema gossip within a few seconds.
Null handling
The CALLED ON NULL INPUT vs RETURNS NULL ON NULL INPUT choice is significant.
-
RETURNS NULL ON NULL INPUT— Ferrosa skips calling the function and returns null immediately if any argument is null. Use this for pure mathematical functions where a null input can only produce a null output. -
CALLED ON NULL INPUT— Ferrosa always calls the function, passing a representation of null. Use this for functions likecoalescethat have meaningful behavior with null inputs.
Version pinning for reproducibility
The AS '0x<hex>' bytes embedded in CREATE FUNCTION are the complete, self-contained WASM binary.
Because the binary is stored in the schema, the exact function logic is preserved in the cluster even if the source code changes.
Pin your guest crate to a specific Rust toolchain version in rust-toolchain.toml to ensure reproducible builds across environments.
What You Learned
-
UDFs extend CQL with custom scalar logic. Write a Rust function, compile it to WASM, and call it from any
SELECTstatement. -
UDAs aggregate across rows with custom state. A state function folds one row at a time; a final function extracts the result. Ferrosa handles partial-aggregate merging across partitions.
-
WASM provides a safe execution sandbox. Guest functions cannot access the host filesystem, network, or other keyspaces. Ferrosa enforces resource limits at the runtime level.
-
Null handling is explicit. Choose between
CALLED ON NULL INPUTandRETURNS NULL ON NULL INPUTbased on whether your function has meaningful behavior for null arguments. -
Cross-partition aggregates have a cost. Ferrosa fans the query to every node; keep latency-sensitive aggregates within a single partition where possible.
Next, try the IoT Sensor Data tutorial to explore the time-series patterns these sensors build on, or jump to the Real-Time Analytics tutorial to see pre-computed aggregations for dashboard workloads.