Build a first pipeline
This workshop builds the same shape as the single-file example, but starts from an empty directory. It is intentionally small: the goal is to learn the file layout, node syntax, run artifacts, and report flow.
1. Create a Project Folder
Section titled “1. Create a Project Folder”mkdir rime-penguinscd rime-penguinsmkdir dataCreate data/penguins.csv:
species,island,bill_length_mm,flipper_length_mm,body_mass_gAdelie,Torgersen,39.1,181,3750Adelie,Torgersen,39.5,186,3800Adelie,Biscoe,40.3,195,3250Gentoo,Biscoe,46.1,217,4500Gentoo,Biscoe,50.0,222,5550Chinstrap,Dream,46.5,192,3500Chinstrap,Dream,50.0,196,3900Adelie,Dream,37.2,178,3900Adelie,Dream,41.1,182,3525Gentoo,Biscoe,45.2,210,43002. Declare the DAG
Section titled “2. Declare the DAG”Create pipeline.dag.yaml:
specification_version: "2.1"
nodes: - id: penguins kind: source path: data/penguins.csv metadata: report: false
- id: adelie_only kind: filter inputs: [penguins] expr: '[species] == "Adelie"'
- id: by_island kind: aggregate inputs: [adelie_only] groupBy: ["[island]"] metrics: - "[mean_bill_length] = [bill_length_mm].mean()" - "[mean_flipper_length] = [flipper_length_mm].mean()" - "[n] = [bill_length_mm].count()"The graph is:
penguins -> adelie_only -> by_islandmetadata.report: false hides the raw source table from the HTML report, but
the source still runs and its output still lands on disk.
3. Validate
Section titled “3. Validate”rime validate pipeline.dag.yamlValidation catches:
- invalid YAML
- unknown node kinds or fields
- duplicate node IDs
- missing source files
- input references that do not resolve
- graph cycles
4. Run
Section titled “4. Run”rime run pipeline.dag.yamlRime creates:
outputs/├── manifest.json├── penguins/│ └── default.parquet├── adelie_only/│ └── default.parquet└── by_island/ └── default.parquetThe final table is:
| island | mean_bill_length | mean_flipper_length | n |
|---|---|---|---|
| Biscoe | 40.30 | 195.00 | 1 |
| Dream | 39.15 | 180.00 | 2 |
| Torgersen | 39.30 | 183.50 | 2 |
5. Build the Report
Section titled “5. Build the Report”rime build pipeline.dag.yamlOpen:
outputs/run_report.htmlThe report includes adelie_only and by_island. It does not include
penguins, because that source node opted out with metadata.report: false.
6. Force a Recompute
Section titled “6. Force a Recompute”The second run should mostly hit cache:
rime run pipeline.dag.yamlTo recompute the graph but keep the fresh cache:
rime run pipeline.dag.yaml --no-cache-readTo run without reading or writing cache:
rime run pipeline.dag.yaml --lean7. Extend the DAG
Section titled “7. Extend the DAG”Add a derive node between adelie_only and by_island:
- id: bill_ratio kind: derive inputs: [adelie_only] as: bill_to_flipper expr: "[bill_length_mm] / [flipper_length_mm]"Then point by_island at bill_ratio and add a metric:
- id: by_island kind: aggregate inputs: [bill_ratio] groupBy: ["[island]"] metrics: - "[mean_bill_length] = [bill_length_mm].mean()" - "[mean_flipper_length] = [flipper_length_mm].mean()" - "[mean_bill_to_flipper] = [bill_to_flipper].mean()" - "[n] = [bill_length_mm].count()"Run again:
rime validate pipeline.dag.yamlrime build pipeline.dag.yamlRime reuses unchanged upstream artifacts where the cache key still matches and recomputes the changed branch.