DAG specification
pipeline.dag.yaml is the source of truth for graph shape, node configuration,
source paths, report inclusion, and optional interpreter/param defaults. For
per-node fields, use the Node Reference.
Top level
Section titled “Top level”specification_version: "2.1" # requirednodes: # required, >= 1, acyclic, unique ids - ...Optional top-level blocks:
params: threshold: { type: float, default: 0.5 } cohort_date: { type: date, required: true }
interpreters: python: .venv/bin/python r: /usr/local/bin/RscriptCLI flags and environment variables override inline interpreter and param defaults at run time.
Node shape (shared)
Section titled “Node shape (shared)”- id: my_node # identifier-shaped, unique within DAG kind: <discriminator> # source, filter, python, sql, etc. inputs: [upstream_id, ...] # core nodes use positional inputs metadata: # optional; closed schema label: "Friendly label" group: "ingest" report: false # omit or true to include in generated reports visual_stats: ["row_count"] cache: false # boolean | { policy: ttl, seconds: N }Kind-specific fields live at the top level on the node: expr, groupBy,
metrics, source, path, columnA, and so on. There is no per-node
params: bag. Unknown fields are rejected.
Core Node Inputs
Section titled “Core Node Inputs”Core nodes use inputs::
- id: adults kind: filter inputs: [patients] expr: "[age] >= 18"Use nodeId.outputName when reading a non-default output from a multi-output
language node:
- id: train_summary kind: aggregate inputs: [split.train] groupBy: ["[site]"] metrics: - "[n] = [score].count()"Language Node Inputs
Section titled “Language Node Inputs”Language nodes use named slots through in:. The slot key becomes the function
argument name for Python/R/JavaScript, or the temp table name for SQL.
- id: features kind: python source: scripts/features.py in: cohort: adults threshold: params.thresholdInput ref grammar
Section titled “Input ref grammar”Wherever a node references an upstream output:
node_id— the default output of the named nodenode_id.output_name— a named output of a multi-output node
Used in inputs:, in:, subgraph.bindings, and subgraph.outputs.
Validation
Section titled “Validation”validateDagSpec(parsed) runs in this order:
- Zod discriminated union validation per node — fails with
[V2_DAG_SCHEMA pipeline.dag.yaml:nodes[i].field] - Unique node ids —
[V2_DAG_GRAPH ...] - All input refs resolve to a known node (
kind: sourcehas noinputs:; itspath:is checked at run time) —[V2_DAG_GRAPH ...] - Acyclic graph (cycle detection includes the offending node id)
Project layout
Section titled “Project layout”A pipeline can live as a single pipeline.dag.yaml file (no marker required) or inside a folder with a rime.project.yaml marker. Project layout from the marker:
my-project/├── rime.project.yaml # marker + optional config├── pipeline.dag.yaml # the DAG├── data/ # raw data files (relative paths in DAG)├── scripts/ # python/r/js/sql script files├── outputs/ # generated outputs (gitignore this)└── .rime/ # cache + state (gitignore this)For one-off DAGs, drop the marker; everything resolves relative to the DAG file.
Node Kinds
Section titled “Node Kinds”source, filter, derive, aggregate, select, sort, join, pivot,
concat, t_test, anova, mann_whitney_u, chi_square, correlation,
linear_regression, subgraph, plus language kinds python, r,
javascript, and sql.
Each has its own reference page under Node Reference.
Expression language
Section titled “Expression language”filter.expr, derive.expr, aggregate.metrics[], aggregate.groupBy[], sort.by[].expr, and expression join keys use Rime’s small expression language. Column refs go in [brackets]; literals are plain values.
expr: "[age] >= 18 and [status] == 'active'"metrics: - "[mean_age] = [age].mean()" - "[max_score] = [score].max()"Supported operators include arithmetic (+, -, *, /), comparison (==, !=, <, >, <=, >=), boolean (and, or, not), membership (in (...)), function calls like coalesce(...), and column methods like .mean(), .sum(), .count(), .lowercase(), and .to_float().
Report Inclusion
Section titled “Report Inclusion”Reports include every node unless you opt out:
- id: raw_orders kind: source path: data/orders.csv metadata: report: falseUse this for raw source nodes and noisy intermediate staging nodes. The outputs still exist on disk and remain available to downstream nodes.