Penguin classifier
This example teaches the core-node path with the Palmer penguins data. It is small enough to read in one sitting and still shows the essential DAG pattern: source data, narrow the cohort, summarize the result, then build a report.
What it Teaches
Section titled “What it Teaches”sourceloads a CSV with inferred column types.filterkeeps one species.aggregategroups the filtered rows by island.rime buildrenders the node outputs as an HTML report.
Project Layout
Section titled “Project Layout”Use the checked-in single-file example:
examples/single-file/├── data/│ └── penguins.csv└── pipeline.dag.yamlThe fixture-level penguin parity test under
packages/core/test/fixtures/experiments/penguin/ uses the same idea in its
smallest form: penguins_source -> adelie_only.
specification_version: "2.1"
nodes: - id: penguins kind: source path: data/penguins.csv
- id: adelie_only kind: filter inputs: [penguins] expr: '[species] == "Adelie"'
- id: by_island kind: aggregate inputs: [adelie_only] groupBy: ["[island]"] metrics: - "[mean_bill_length] = [bill_length_mm].mean()" - "[mean_flipper_length] = [flipper_length_mm].mean()" - "[n] = [bill_length_mm].count()"Run It
Section titled “Run It”git clone https://github.com/danielsjoo/rimecd rimerime validate examples/single-file/pipeline.dag.yamlrime run examples/single-file/pipeline.dag.yamlrime build examples/single-file/pipeline.dag.yamlExpected Output
Section titled “Expected Output”by_island/default.parquet contains:
| island | mean_bill_length | mean_flipper_length | n |
|---|---|---|---|
| Biscoe | 40.30 | 195.00 | 1 |
| Dream | 39.15 | 180.00 | 2 |
| Torgersen | 39.30 | 183.50 | 2 |
Extend It
Section titled “Extend It”After the first run, try these small edits:
- Add
metadata.report: falseto the rawpenguinssource so only the filtered and aggregate nodes appear in the report. - Add a
derivenode beforeby_islandto compute a simple ratio:
- id: bill_to_flipper kind: derive inputs: [adelie_only] as: bill_to_flipper expr: "[bill_length_mm] / [flipper_length_mm]"- Change the aggregate metrics to summarize the new column:
metrics: - "[mean_ratio] = [bill_to_flipper].mean()" - "[n] = [bill_to_flipper].count()"