urartu

PyPI - Package Version PyPI - Python Version GitHub - License

Urartu 🦁

The intelligent ML Pipeline Framework that chains actions into powerful workflows!

Urartu is a framework for building machine learning workflows by chaining Actions into Pipelines. Each Action is a self-contained, reusable component with built-in caching, and Pipelines orchestrate multiple Actions with automatic data flow.

Installation

pip install urartu

Or from source: bash git clone git@github.com:tamohannes/urartu.git cd urartu pip install -e .

Quick Start

Running Pipelines

# Run a pipeline (pipeline name is the first argument)
urartu my_pipeline

# With config group selectors (unquoted = config group, quoted = string override)
urartu my_pipeline machine=local aim=local slurm=no_slurm debug=true

# With string overrides (quoted values)
urartu my_pipeline machine="custom" descr="my experiment"

Project Structure

my_project/
├── actions/              # Action implementations
│   └── my_action.py
├── pipelines/            # Pipeline implementations
│   └── my_pipeline.py
└── configs/
    ├── action/           # Action configurations
    │   └── my_action.yaml
    └── pipeline/         # Pipeline configurations
        └── my_pipeline.yaml

Core Concepts

Actions

Actions are self-contained components that perform specific ML tasks:

from urartu.common import Action

class MyAction(Action):
    def run(self):
        # Your ML task here
        data = self.load_data()
        results = self.process(data)
        
        # Save to cache using unified API
        cache_dir = self.get_cache_entry_dir("my_data")
        # Save machine-readable data to cache
        
        # Save plots to run directory (always regenerated)
        plots_dir = self.get_run_dir("plots")
        # Save human-readable outputs here
    
    def get_outputs(self):
        return {
            "results_path": str(self.get_cache_entry_dir("results")),
            "run_dir": str(self.get_run_dir())
        }

Pipelines

Pipelines chain Actions together with automatic dependency resolution:

# configs/pipeline/my_pipeline.yaml
pipeline_name: my_pipeline

pipeline:
  device: cuda
  seed: 42
  actions:
    - action_name: data_preprocessing
      dataset:
        source: "data.csv"
    
    - action_name: model_training
      depends_on:
        data_preprocessing:
          processed_data: dataset.data_path
      model:
        architecture: "transformer"

Configuration

Action Config

# configs/action/my_action.yaml
action_name: my_action

action:
  experiment_name: "My Experiment"
  device: cuda
  dataset:
    source: "data.csv"

Pipeline Config

# configs/pipeline/my_pipeline.yaml
pipeline_name: my_pipeline

pipeline:
  experiment_name: "My Pipeline"
  device: cuda
  actions:
    - action_name: action1
    - action_name: action2

Key Features

Unified Caching

Actions automatically cache results. Use the unified APIs:

# For machine-readable cached data
cache_dir = self.get_cache_entry_dir("subdirectory")
# Structure: cache/{action_name}/{cache_hash}/{subdirectory}/

# For human-readable outputs (plots, reports)
run_dir = self.get_run_dir("plots")
# Structure: .runs/{pipeline_name}/{timestamp}/{subdirectory}/

Important: Plots should always be saved to run_dir and regenerated from cached data.

Dependency Resolution

Pipelines automatically inject outputs from previous actions:

- action_name: model_training
  depends_on:
    data_preprocessing:
      processed_data: dataset.data_path
      stats: model.feature_stats

Caching Configuration

action:
  cache_enabled: true
  force_rerun: false
  cache_max_age_days: 7

pipeline:
  cache_enabled: true
  force_rerun: false
  cache_max_age_days: 7

Advanced Usage

Remote Execution

Execute workflows on remote machines:

# configs_tamoyan/machine/remote.yaml
type: remote
host: "cluster.example.com"
username: "user"
ssh_key: "~/.ssh/id_rsa"
remote_workdir: "/path/to/workspace"
project_name: "my_project"
urartu my_pipeline machine=remote slurm=slurm

Multi-run

# Note: Multirun/sweep functionality is not yet implemented in the new CLI
# For now, use nested loops in your pipeline code or run multiple times manually

Citation

If you find Urartu helpful in your research, please cite it:

@software{Tamoyan_Urartu_2023,
  author = {Hovhannes Tamoyan},
  license = {Apache-2.0},
  month = {8},
  title = ,
  url = {https://github.com/tamohannes/urartu},
  year = {2023}
}