π Latest Enhancements - Performance & Memory Powerhouse! Weβre excited to share major performance and memory management improvements! π
π Revolutionary Features:
depends_on
Ready to build next-generation ML pipelines? Letβs dive in! β€οΈ
The intelligent ML Pipeline Framework that chains actions into powerful workflows!
Welcome to Urartu, the revolutionary framework that transforms how you build machine learning workflows. At its core is the Pipeline System - a breakthrough approach that lets you chain individual Actions into sophisticated, automated workflows.
π― Core Improvements: Pipelines = Sequences of Actions
With a .yaml
file-based configuration system and seamless slurm
job submission capabilities on clusters, Urartu removes the technical hassle so you can focus on building impactful ML workflows! π
Getting started with Urartu is super easy! π Just run:
pip install urartu
Or, if you prefer to install directly from the source:
git clone git@github.com:tamohannes/urartu.git`
cd urartu
pip install -e .
And just like that, youβre all set! β¨ Use the following command anywhere in your system to access Urartu:
urartu --help
Urartuβs breakthrough feature: Transform sequences of ML operations into intelligent, automated workflows!
A Pipeline is a sequence of Actions that automatically manage data flow, caching, and execution order. Each Action is a self-contained component with built-in caching that can be chained together to create sophisticated ML workflows.
βββββββββββββββ π outputs βββββββββββββββ π outputs βββββββββββββββ
β Action 1 β βββββββββββββββΆ β Action 2 β βββββββββββββββΆ β Action 3 β
β Data Prep β β Model Train β β Evaluation β
β πΎ cached β β πΎ cached β β πΎ cached β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
π Actions: Self-contained, reusable components that:
π Pipelines: Orchestrators that:
πΎ Universal Caching: Every Action and Pipeline:
To jump right in with Urartuβs Pipeline System:
# Copy the starter template to begin your project
cp -r starter_template my_ml_project
cd my_ml_project
Think of Urartu as providing the foundational framework for your ML workflows:
slurm
cluster deployment using Submitit# config/action_config/my_pipeline.yaml
action_name: my_pipeline
pipeline_config:
device: cuda
actions:
- action_name: data_preprocessing
# ... data prep config ...
- action_name: model_training
depends_on:
data_preprocessing:
processed_data: dataset.data_files
# ... training config ...
By following these steps, you can efficiently build powerful, automated ML workflows with Urartuβs Pipeline System.
Once youβve cloned the starter_template
, head over to that directory in your terminal:
cd starter_template
To launch a single run with predefined configurations, execute the following command:
urartu action_config=generate aim=aim slurm=slurm
If youβre looking to perform multiple runs, simply use the --multirun
flag. To configure multiple runs, add a sweeper at the end of your generate.yaml
config file like this:
...
hydra:
sweeper:
params:
action_config.task.model.generate.num_beams: 1,5,10
This setup initiates 3 separate runs, each utilizing different num_beams
settings to adjust the modelβs behavior.
Then, start your multi-run session with the same command:
urartu action_config=generate aim=aim slurm=slurm
With these steps, you can effortlessly kickstart your machine learning experiments with Urartu, whether for a single test or comprehensive multi-run analyses!
Dive into the structured world of Urartu, where managing NLP components becomes straightforward and intuitive.
Set up your environment effortlessly with our configuration templates found in the urartu/config
directory:
urartu/config/main.yaml
: This primary configuration file lays the groundwork with default settings for all system keys.urartu/config/action_config
This space is dedicated to configurations specific to various actions.Configuring Urartu to meet your specific needs is straightforward. You have two easy options:
Custom Config Files: Store your custom configuration files in the configs directory to adjust the settings. This directory aligns with urartu/config
, allowing you to maintain project-specific settings in files like generate.yaml
for your starter_template
project.
configs_{username}
directory at the same level as configs, replacing {username}
with your system username. This setup automatically loads and overrides default settings without extra steps. β¨Configuration files are prioritized in the following order: urartu/config
, starter_template/configs
, starter_template/configs_{username}
, ensuring your custom settings take precedence.
CLI Approach: If you prefer using the command-line interface (CLI), Urartu supports enhancing commands with key-value pairs directly in the CLI, such as:
urartu action_config=example action_config.experiment_name=NAME_OF_EXPERIMENT
Select the approach that best fits your workflow and enjoy the customizability that Urartu offers.
At the heart of Urartu is the Action
class - individual, self-contained components that:
get_outputs()
methodThe Pipeline System is Urartuβs game-changing innovation that chains Actions into intelligent workflows:
depends_on
Example ML Pipeline (completely flexible - chain any number of actions):
βββββββββββββββββββ outputs βββββββββββββββββββ outputs βββββββββββββββββββ
β Data ββββββββββββββββΆβ Model ββββββββββββββββΆβ Evaluation β
β Preprocessing β β Training β β Metrics β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β outputs
βΌ
βββββββββββββββββββ
β Inference & β
β Deployment β
βββββββββββββββββββ
urartu.common.Action
and implements get_outputs()
methoddepends_on
configuration# Action 1: Data Constructor (with caching)
- action_name: data_constructor
seed: 42
dataset:
entity_types: [player, movie, city]
# πΎ Caches outputs: {"data_files": "/path/to/data", "sample_count": 1000}
# Action 2: Model Trainer (with caching + dependencies)
- action_name: model_trainer
device: cuda # Overrides pipeline device
depends_on:
data_constructor:
data_files: dataset.data_files # Map their output to my config
sample_count: training.num_samples # Flexible dot-notation paths
# πΎ Caches outputs: {"model_path": "/path/to/model.pt", "accuracy": 0.95}
# What the Pipeline automatically does:
# 1. Check if data_constructor cached results exist
if cache_exists("data_constructor_config_hash"):
outputs1 = load_from_cache() # β‘ Instant loading
else:
outputs1 = data_constructor.run() # π Run and cache
save_to_cache(outputs1)
# 2. Inject outputs into next action's config
model_trainer.config.dataset.data_files = outputs1["data_files"] # "/path/to/data"
model_trainer.config.training.num_samples = outputs1["sample_count"] # 1000
# 3. Check if model_trainer cached results exist
if cache_exists("model_trainer_config_hash"):
outputs2 = load_from_cache() # β‘ Instant loading
else:
outputs2 = model_trainer.run() # π Run and cache
save_to_cache(outputs2)
The key innovation: Pipelines inherit from Action
, making them fully composable building blocks!
# Use pipelines inside other pipelines
pipeline_config:
actions:
- action_name: data_preprocessing
- action_name: ml_training_pipeline # This is a pipeline!
- action_name: evaluation_pipeline # This is also a pipeline!
- action_name: deployment
# Create reusable pipeline building blocks
# data_processing_pipeline.yaml
action_name: data_processing_pipeline
pipeline_config:
actions:
- action_name: data_cleaning
- action_name: feature_engineering
- action_name: data_validation
# main_workflow.yaml - Reuse the data processing pipeline
action_name: main_workflow
pipeline_config:
actions:
- action_name: data_processing_pipeline # Reuse!
- action_name: model_training
- action_name: evaluation_pipeline # Another reusable component
Build sophisticated multi-level architectures:
pipeline_config:
actions:
- action_name: simple_action # Regular action
- action_name: data_pipeline # Pipeline as action
- action_name: another_simple_action # Regular action
- action_name: complex_pipeline # Another pipeline
Every pipeline action must implement the get_outputs()
method:
from urartu.common import Action
class DataPreprocessing(Action):
def run(self):
# Preprocess raw data
self.processed_data_path = self.preprocess_dataset()
self.feature_stats = self.compute_statistics()
def get_outputs(self):
"""Return outputs for pipeline consumption."""
return {
"processed_data": str(self.processed_data_path),
"feature_statistics": self.feature_stats,
"num_samples": len(self.dataset)
}
class ModelTraining(Action):
def run(self):
# Train model using preprocessed data
self.model_path = self.train_model()
self.training_metrics = self.evaluate_training()
def get_outputs(self):
"""Return outputs for pipeline consumption."""
return {
"model_checkpoint": str(self.model_path),
"training_accuracy": self.training_metrics["accuracy"],
"loss_history": self.training_metrics["loss_history"]
}
Configure pipelines using YAML files that define the action sequence and dependencies:
# config/action_config/ml_pipeline.yaml
action_name: ml_pipeline
pipeline_config:
experiment_name: "Complete ML Pipeline"
device: cuda # Inherited by all actions unless overridden
seed: 42
# Pipeline caching configuration
cache_enabled: true
force_rerun: false
cache_max_age_hours: 24
# Memory management (NEW!)
memory_management:
auto_cleanup: true # Clean up after each action
force_cpu_offload: true # Move models to CPU when not in use
aggressive_gc: true # Force garbage collection
# Define the pipeline workflow
actions:
# Step 1: Data Preprocessing
- action_name: data_preprocessing
dataset:
source: "raw_data.csv"
validation_split: 0.2
normalize: true
preprocessing:
remove_outliers: true
feature_scaling: "standard"
model:
batch_size: 16 # Parallelization support
# Step 2: Model Training (NEW: Explicit dependencies!)
- action_name: model_training
device: cuda # Override pipeline device if needed
# NEW: Explicit dependency declaration
depends_on:
data_preprocessing:
processed_data: dataset.data_path # Map outputs to config paths
feature_stats: model.feature_stats # Can map multiple outputs
model:
architecture: "transformer"
hidden_size: 768
num_layers: 12
batch_size: 32 # Batch processing optimization
training:
epochs: 10
learning_rate: 1e-4
# NEW: Action-specific memory management
memory_management:
offload_to_cpu: true
clear_cache_after_batch: true
max_feature_cache_size: 100
# Step 3: Evaluation
- action_name: model_evaluation
depends_on:
model_training:
model_checkpoint: model.path
training_accuracy: validation.baseline
data_preprocessing:
processed_data: dataset.test_data
metrics: ["accuracy", "f1_score", "auc"]
# Step 4: Deployment
- action_name: model_deployment
depends_on:
model_training:
model_checkpoint: deployment.model_path
model_evaluation:
accuracy: deployment.performance_score
deployment:
performance_threshold: 0.85
target: "production"
Execute pipelines just like individual actions:
# Run the complete ML pipeline
urartu action_name=ml_pipeline
# Force rerun without cache
urartu action_name=ml_pipeline +pipeline_config.force_rerun=true
# Override specific configurations
urartu action_name=ml_pipeline ++pipeline_config.actions[1].training.epochs=20
# Run with multirun for hyperparameter sweeps
urartu --multirun action_config=ml_pipeline pipeline_config.actions[1].training.learning_rate=1e-3,1e-4,1e-5
π Dynamic Dependency System (NEW!):
# Explicitly declare what each action needs from previous actions
- action_name: model_training
depends_on:
data_preprocessing:
processed_data: dataset.data_path # Map any output to any config path
feature_stats: model.feature_stats # Multiple mappings supported
sample_count: training.num_samples # Flexible dot-notation paths
π Batch Processing & Parallelization (NEW!):
# Enable high-performance batch processing
model:
batch_size: 32 # Process multiple samples simultaneously
use_parallel: true # Parallel entity processing
max_workers: 4 # Number of parallel workers
use_parallel_templates: true # Parallel template construction
π§ Intelligent Memory Management (NEW!):
# Automatic memory management for large models
memory_management:
auto_cleanup: true # Clean up after each action
force_cpu_offload: true # Move models to CPU when not in use
aggressive_gc: true # Force garbage collection
# Action-specific settings:
offload_to_cpu: true # Offload features to CPU
clear_cache_after_batch: true # Clear cache frequently
layer_by_layer_processing: true # Fallback for OOM situations
max_feature_cache_size: 100 # Limit cache growth
π Device Configuration Inheritance:
pipeline_config:
device: auto # Default for all actions
actions:
- action_name: data_prep # Inherits device: auto
- action_name: gpu_training
device: cuda # Overrides to use GPU
- action_name: cpu_postprocess
device: cpu # Overrides to use CPU
Smart Caching:
Configuration Inheritance:
# Import base configurations and extend them
defaults:
- /action_config/base_model@pipeline.model
- /action_config/datasets/image_classification@pipeline.dataset
# Then override specific fields as needed
pipeline_config:
dataset:
batch_size: 64 # Override just the batch size
depends_on
Data Science Workflow:
Data Collection β Cleaning β Feature Engineering β Model Training β Evaluation β Deployment
NLP Pipeline:
Text Preprocessing β Tokenization β Model Training β Fine-tuning β Inference β Analysis
Computer Vision Pipeline:
Image Augmentation β Model Training β Validation β Test Evaluation β Model Optimization
Research Pipeline:
Experiment Setup β Multiple Model Training β Comparative Analysis β Visualization β Report Generation
The Pipeline System transforms Urartu from a single-action executor into a comprehensive workflow orchestration platform, perfect for end-to-end machine learning projects! π
Every Action in Urartu automatically provides intelligent caching - the foundation of efficient ML workflows!
Each Action automatically:
# Example: What happens when you run an Action
@cached_action # Automatic - no extra code needed!
class ModelTraining(Action):
def run(self):
# Your expensive ML training code
self.model = train_large_model() # Takes 2 hours
def get_outputs(self):
return {"model_path": str(self.model_path)}
# First run: Takes 2 hours, saves to cache
# Second run with same config: Loads in 0.1 seconds! β‘
# Individual Action caching
action_config:
cache_enabled: true # Enable/disable caching (default: true)
force_rerun: false # Force rerun even if cached (default: false)
cache_max_age_hours: 24 # Cache validity in hours (default: no expiry)
# Pipeline-level caching
pipeline_config:
cache_enabled: true # Enable pipeline-level caching
force_rerun: false # Force rerun entire pipeline
cache_max_age_hours: 24 # Pipeline cache validity
.runs/action_cache/
, .runs/pipeline_cache/
) survive across runs.json
files alongside .pkl
cache files for easy inspection# First run: All Actions execute and cache
urartu action=ml_pipeline # Takes 3 hours
# Change only training hyperparameters
# Second run: Only model_training reruns, data preprocessing loads from cache!
urartu action=ml_pipeline # Takes 1 hour (2 hours saved!)
# Force rerun specific action
urartu action=ml_pipeline ++pipeline_config.force_rerun=true
# Force rerun a single action (ignores cache)
urartu action=my_action ++action.force_rerun=true
# Force rerun entire pipeline (ignores cache)
urartu action=my_pipeline ++pipeline.force_rerun=true
# Clear cache manually (nuclear option)
rm -rf .runs/action_cache .runs/pipeline_cache
π― Result: Never waste compute cycles on identical configurations - focus on whatβs actually changing!
Urartu includes state-of-the-art performance optimizations and memory management features designed for large-scale ML workloads.
Automatic Batch Inference:
Parallel Entity Processing:
Configuration Example:
action_config:
model:
batch_size: 16 # Batch size for inference
use_parallel: true # Enable parallelization
max_workers: 4 # Number of parallel workers
use_parallel_templates: true # Parallel template construction
template_max_workers: 4 # Workers for template construction
Intelligent OOM Prevention:
# Comprehensive memory management configuration
memory_management:
auto_cleanup: true # Automatic cleanup after each action
force_cpu_offload: true # Move models to CPU when not in use
aggressive_gc: true # Force garbage collection
# Action-specific memory management
offload_to_cpu: true # Offload features to CPU to save GPU memory
clear_cache_after_batch: true # Clear cache after each batch
layer_by_layer_processing: true # Process layers individually on OOM
max_feature_cache_size: 100 # Limit feature cache growth
Multi-Level OOM Protection:
Expected Performance Gains:
Graceful Degradation:
Resource Monitoring:
Urartu is equipped with a comprehensive logging system to ensure no detail of your projectβs execution is missed. Hereβs how it works:
.runs/${action_name}/${now:%Y-%m-%d}_${now:%H-%M-%S}
.runs/debug/${action_name}/${now:%Y-%m-%d}_${now:%H-%M-%S}
.runs/debug/${action_name}/${now:%Y-%m-%d}_${now:%H-%M-%S}_multirun
suffix to differentiate them.Each run directory is organized to contain essential files such as:
Additional files may be included depending on the type of run, ensuring you have all the data you need at your fingertips.
Launching with Urartu is a breeze, offering you two launch options:
slurm.use_slurm
in config_{username}/slurm/slurm.yaml
to switch between local and cluster executions.Choose your adventure and launch your projects with ease! π
Encountered any issues or have suggestions? Feel free to open an issue for support.
Unveil insights with ease using Urartu in partnership with Aim, the intuitive and powerful open-source AI metadata tracker. To access a rich trove of metrics captured by Aim, simply:
aim up
Watch as Aim brings your experiments into sharp relief, providing the clarity needed to drive informed decisions and pioneering efforts in machine learning. π