Skip to content

Nodes

A node in Rime is a function over dataframes.

You write what the function computes — the body. Rime’s runtime owns everything else: reading the inputs, materializing them into a native value (pandas DataFrame, R tibble, DuckDB table, JS object array), running your code, capturing the return value, content-addressing it, caching it, and handing it to the next node.

You never write read_csv() at the top of a node. You never write to_parquet() at the bottom. The runtime does both. Your function signature is the contract.

How that differs from Airflow / Prefect / Dagster

Section titled “How that differs from Airflow / Prefect / Dagster”

The workflow orchestrators in the ETL world (Airflow, Prefect, Dagster) treat each step as a task: a Python function that reads from somewhere, transforms, and writes somewhere else. The function body is responsible for its own I/O. Tasks coordinate by writing artifacts that downstream tasks happen to read — the side effect is the contract.

Rime inverts this. Side effects are the runtime’s job; functions just compute.

In Airflow / Prefect / DagsterIn Rime
Task reads from S3, writes to S3Function takes a dataframe, returns a dataframe
Each task owns its own I/ORuntime owns I/O
Coordination via storage pathsCoordination via typed dataframe ports
Reproducibility requires hand-rolled idempotencyCaching is automatic (content-addressed)
Multi-language = hand-wiring separate task runtimesMulti-language = kind: r in YAML; dataframes cross through Rime artifacts
You write the boilerplateThe runtime owns the boilerplate

This is the same intuition behind dbt’s “you write the SELECT, we handle materialization” — extended past SQL into Python, R, and JavaScript.

scripts/cohort.py
def run(patients):
# `patients` arrives as a pandas DataFrame.
# You did not open a file. You did not pick a serializer.
return patients[patients["age"] >= 18]
- id: cohort
kind: python
source: scripts/cohort.py
in:
patients: raw_patients # upstream node ID

That’s the whole node. The runtime:

  1. Reads the upstream raw_patients output from disk (or cache),
  2. Decodes it as a pandas DataFrame,
  3. Calls run(patients=<the dataframe>),
  4. Captures the returned dataframe,
  5. Hashes the (source code + inputs) pair into a content address,
  6. Writes the result to outputs/cohort/default.parquet,
  7. Makes it available to any downstream node that references cohort.

Switch kind: python to kind: r and write the same function in R — same protocol, same caching, no glue code between them.

Most pipelines don’t even need to write a custom function for common shapes. Rime ships 14 built-in kinds that cover the things you’d otherwise re-write every project:

KindWhat it does
sourceRead a CSV / JSON / NDJSON / Parquet file into a tabular value
- id: patients
kind: source
path: data/patients.csv
KindWhat it does
filterKeep rows matching a boolean expression
deriveAdd a computed column
selectKeep specific columns
sortOrder rows by one or more expressions
aggregateGroup + reduce, with named metrics

These nodes share Rime’s expression language. The useful pattern is to keep data-shaping logic visible as small formulas instead of hiding every operation inside a script node:

- id: risk_index
kind: derive
inputs: [patient_lab_wide]
as: risk_index
expr: "coalesce([crp_mean], 0) * 2.0 + coalesce([ldl_max], 0) * 0.05"
KindWhat it does
joinTwo-input inner / left join on column keys
concatStack tables row-wise with a label column
pivotWide-format aggregation

These return a small JSON-shaped result (test statistic, p-value, etc.) rather than a table. Reports render them as stat-style key-value output cells.

KindWhat it does
t_testWelch / equal-variance two-sample t-test
anovaOne-way ANOVA across N groups
mann_whitney_uNon-parametric two-sample test
chi_squareCategorical independence test
correlationPearson / Spearman correlation between two columns
linear_regressionSingle-feature OLS, optional train/test split

Statistical nodes also emit assumption warnings. Those warnings show up in reports and the editor review surfaces because they are often as important as the p-value: low expected cell counts for chi-square, small or skewed groups for t-tests/ANOVA, Pearson/Spearman disagreement for correlation, and high-residual observations for linear regression.

KindWhat it does
subgraphEmbed an external .dag.yaml file with named bindings + outputs

Subgraphs are opaque from the outside; their bindings: map outer node refs to inner slot names, and their outputs: map exposed names to inner refs.

Anything you can’t express with the built-ins is a language node. Same functional contract — you write a function, declare its inputs as named slots, return a dataframe (or a dict of named dataframes):

- id: features
kind: python
source: scripts/features.py
in:
cohort: upstream_node # dataframe slot
threshold: params.threshold # scalar slot

Native values per language: pandas DataFrame (Python), data.frame/tibble-style table (R), row arrays (JS), temp table (SQL). See Polyglot runtime for the per-language details.

metadata:
label: "Friendly node label" # used in reports and visualizations
group: "feature_engineering" # logical grouping
visual_stats: ["row_count"] # engine emits these on each run
cache: false # boolean or { policy: ttl, seconds: N }
  • Move scripts between languages without rewriting glue. Switch kind: python to kind: r; the function signature stays the same.
  • No serialization decisions in user code. Arrow IPC and Parquet are runtime concerns, not yours.
  • Caching is automatic. Change a script — only it and its downstream re-run. Change an input — same.
  • Reproducibility is a side effect of the model, not extra work. The cache key is hash(source + inputs); same key = same result, every time.

Per-kind field reference lives under Node Reference in the sidebar.