The intelligent ML Pipeline Framework that chains actions into powerful workflows!
Urartu is a framework for building machine learning workflows by chaining Actions into Pipelines. Each Action is a self-contained, reusable component with built-in caching, and Pipelines orchestrate multiple Actions with automatic data flow.
pip install urartu
Or from source:
bash
git clone git@github.com:tamohannes/urartu.git
cd urartu
pip install -e .
# Run a pipeline (pipeline name is the first argument)
urartu my_pipeline
# With config group selectors (unquoted = config group, quoted = string override)
urartu my_pipeline machine=local aim=local slurm=no_slurm debug=true
# With string overrides (quoted values)
urartu my_pipeline machine="custom" descr="my experiment"
my_project/
├── actions/ # Action implementations
│ └── my_action.py
├── pipelines/ # Pipeline implementations
│ └── my_pipeline.py
└── configs/
├── action/ # Action configurations
│ └── my_action.yaml
└── pipeline/ # Pipeline configurations
└── my_pipeline.yaml
Actions are self-contained components that perform specific ML tasks:
from urartu.common import Action
class MyAction(Action):
def run(self):
# Your ML task here
data = self.load_data()
results = self.process(data)
# Save to cache using unified API
cache_dir = self.get_cache_entry_dir("my_data")
# Save machine-readable data to cache
# Save plots to run directory (always regenerated)
plots_dir = self.get_run_dir("plots")
# Save human-readable outputs here
def get_outputs(self):
return {
"results_path": str(self.get_cache_entry_dir("results")),
"run_dir": str(self.get_run_dir())
}
Pipelines chain Actions together with automatic dependency resolution:
# configs/pipeline/my_pipeline.yaml
pipeline_name: my_pipeline
pipeline:
device: cuda
seed: 42
actions:
- action_name: data_preprocessing
dataset:
source: "data.csv"
- action_name: model_training
depends_on:
data_preprocessing:
processed_data: dataset.data_path
model:
architecture: "transformer"
# configs/action/my_action.yaml
action_name: my_action
action:
experiment_name: "My Experiment"
device: cuda
dataset:
source: "data.csv"
# configs/pipeline/my_pipeline.yaml
pipeline_name: my_pipeline
pipeline:
experiment_name: "My Pipeline"
device: cuda
actions:
- action_name: action1
- action_name: action2
Actions automatically cache results. Use the unified APIs:
# For machine-readable cached data
cache_dir = self.get_cache_entry_dir("subdirectory")
# Structure: cache/{action_name}/{cache_hash}/{subdirectory}/
# For human-readable outputs (plots, reports)
run_dir = self.get_run_dir("plots")
# Structure: .runs/{pipeline_name}/{timestamp}/{subdirectory}/
Important: Plots should always be saved to run_dir and regenerated from cached data.
Pipelines automatically inject outputs from previous actions:
- action_name: model_training
depends_on:
data_preprocessing:
processed_data: dataset.data_path
stats: model.feature_stats
action:
cache_enabled: true
force_rerun: false
cache_max_age_days: 7
pipeline:
cache_enabled: true
force_rerun: false
cache_max_age_days: 7
Execute workflows on remote machines:
# configs_tamoyan/machine/remote.yaml
type: remote
host: "cluster.example.com"
username: "user"
ssh_key: "~/.ssh/id_rsa"
remote_workdir: "/path/to/workspace"
project_name: "my_project"
urartu my_pipeline machine=remote slurm=slurm
# Note: Multirun/sweep functionality is not yet implemented in the new CLI
# For now, use nested loops in your pipeline code or run multiple times manually
If you find Urartu helpful in your research, please cite it:
@software{Tamoyan_Urartu_2023,
author = {Hovhannes Tamoyan},
license = {Apache-2.0},
month = {8},
title = ,
url = {https://github.com/tamohannes/urartu},
year = {2023}
}