Skip to content

Penguin classifier

This example teaches the core-node path with the Palmer penguins data. It is small enough to read in one sitting and still shows the essential DAG pattern: source data, narrow the cohort, summarize the result, then build a report.

  • source loads a CSV with inferred column types.
  • filter keeps one species.
  • aggregate groups the filtered rows by island.
  • rime build renders the node outputs as an HTML report.

Use the checked-in single-file example:

examples/single-file/
├── data/
│ └── penguins.csv
└── pipeline.dag.yaml

The fixture-level penguin parity test under packages/core/test/fixtures/experiments/penguin/ uses the same idea in its smallest form: penguins_source -> adelie_only.

specification_version: "2.1"
nodes:
- id: penguins
kind: source
path: data/penguins.csv
- id: adelie_only
kind: filter
inputs: [penguins]
expr: '[species] == "Adelie"'
- id: by_island
kind: aggregate
inputs: [adelie_only]
groupBy: ["[island]"]
metrics:
- "[mean_bill_length] = [bill_length_mm].mean()"
- "[mean_flipper_length] = [flipper_length_mm].mean()"
- "[n] = [bill_length_mm].count()"
Terminal window
git clone https://github.com/danielsjoo/rime
cd rime
rime validate examples/single-file/pipeline.dag.yaml
rime run examples/single-file/pipeline.dag.yaml
rime build examples/single-file/pipeline.dag.yaml

by_island/default.parquet contains:

islandmean_bill_lengthmean_flipper_lengthn
Biscoe40.30195.001
Dream39.15180.002
Torgersen39.30183.502

After the first run, try these small edits:

  • Add metadata.report: false to the raw penguins source so only the filtered and aggregate nodes appear in the report.
  • Add a derive node before by_island to compute a simple ratio:
- id: bill_to_flipper
kind: derive
inputs: [adelie_only]
as: bill_to_flipper
expr: "[bill_length_mm] / [flipper_length_mm]"
  • Change the aggregate metrics to summarize the new column:
metrics:
- "[mean_ratio] = [bill_to_flipper].mean()"
- "[n] = [bill_to_flipper].count()"