Think of UrarTU as the foundational framework for your projects, similar to an abstract class in object-oriented programming (OOP).
Your project acts as the implementation, where UrarTU provides the scaffolding with high-level abstractions, .yaml
configuration, and slurm
job management.
It also includes key NLP features such as dataset readers, model loaders, and device handlers.
Hereβs how to get started:
By following these steps, you can efficiently set up and customize your machine learning projects with UrarTU.
The first step is to create a structure similar to UrarTU
, here is the structure of the starter_template
project we are trying to achieve, that contains generate action and configs for a basic autoregressive generation action completion:
starter_template
βββ actions
β βββ generate.py
β βββ __init__.py
βββ configs
β βββ action_config
β β βββ generate.yaml
β βββ __init__.py
βββ configs_tamoyan
β βββ aim
β β βββ aim.yaml
β βββ __init__.py
β βββ slurm
β βββ no_slurm.yaml
β βββ slurm.yaml
βββ __init__.py
It is a basic module that contains actions
directory, general configs configs
and user specific configs confgis_tamoyan
. Simply copy this structure in your starter_template
project.
Create a configuration template.
Hereβs a basic structure for generate.yaml
configuration file:
# @package _global_
action_name: generate
debug: false
action_config:
experiment_name: "Example - next token prediction"
device: "gpu" # auto, cuda, cpu (default)
task:
model:
type:
_target_: urartu.models.model_causal_language.ModelCausalLanguage
name: gpt2
dtype: torch.float32
cache_dir: ""
generate:
max_length: 100
num_beams: 5
no_repeat_ngram_size: 2
dataset:
type:
_target_: urartu.datasets.hf.dataset_from_hub.DatasetFromHub
name: truthfulqa/truthful_qa
subset: generation
split: validation
input_key: "question"
The task
contains two main configs: model
and dataset
.
Pay attention to their _target_
argument which locate to urartu
classes.
These classes are being instantiated using the the rest of the configs in the body, e.g. the βvalidationβ split of the βgenerationβ subset of truthfulqa/truthful_qa
dataset from the huggingface hub.
The generate config will be passed to the generate
function
This is a general configuration file for the next token prediction project. However, if multiple team members are working on the same project and have their specific configurations, follow these steps:
configs
directory and name it configs_{username}
, where {username}
is your OS username.configs_{username}
directory.To achieve this, Iβve created a custom configuration thatβs unique to my setup:
# @package _global_
use_aim: true
repo: aim://0.0.0.0:43800
log_system_params: true
With just a few straightforward slurm configuration parameters, we can seamlessly submit our action to the slurm system. To achieve this, fill in the slurm
configuration in the starter_template/configs_{username}/slurm/slurm.yaml
.
Setting the use_slurm
argument to true
activates slurm job submission.
The other arguments align with familiar sbatch
command options.
We have added a no_slurm
file under the same starter_template/configs_{username}/slurm/no_slurm.yaml
path that simply contains use_slurm: false
.
Next, letβs create the action file that will use the parsed configuration to kickstart your work.
Inside generate.py
, define a main method with the following arguments:
from aim import Run, Text
from omegaconf import DictConfig
def main(cfg: DictConfig, aim_run: Run):
action = Generate(cfg, aim_run)
action.main()
The cfg
parameter will contain overridden parameters, and aim_run
is an instance of our Aim run for tracking progress.
Generate
ClassNow, letβs create the Generate
class:
from urartu.common.action import Action
class Generate(Action):
def __init__(self, cfg: DictConfig, aim_run: Run) -> None:
super().__init__(cfg, aim_run)
def main(self):
# Your code goes here
Ensure that Generate
inherits from the abstract Action
class. From this point forward, you have full control to implement your text generation logic, here is the final script:
from omegaconf import DictConfig
from aim import Run, Text
from tqdm import tqdm
from urartu.common.action import Action
from urartu.common.dataset import Dataset
from urartu.common.model import Model
class Generate(Action):
def __init__(self, cfg: DictConfig, aim_run: Run) -> None:
super().__init__(cfg, aim_run)
def main(self):
model = Model.get_model(self.task_cfg.model)
dataset = Dataset.get_dataset(self.task_cfg.dataset)
for idx, sample in tqdm(enumerate(dataset.dataset)):
prompt = sample[self.task_cfg.dataset.get("input_key")]
self.aim_run.track(Text(prompt), name="input")
output = model.generate(prompt)
self.aim_run.track(Text(output), name="output")
def main(cfg: DictConfig, aim_run: Run):
action = Generate(cfg, aim_run)
action.main()
Here, we utilize the HuggingFace Casual Language Model to continue a given token sequence from truthful_qa
. We then track the inputs and outputs of the model.
Letβs navigate to the project directory in the terminal:
cd
After which you can easily run the generate
action from the command line by specifying generate
as the action_config
:
urartu action_config=generate aim=aim slurm=no_slurm
You can streamline your experimentation by using Hydraβs --multirun
flag, allowing you to submit multiple runs with different parameters all at once. For example, if you need to run the same script with various model dtype
s, follow these steps:
hydra:
sweeper:
params:
++action_config.tasks.0.model.dtype: torch.float32, torch.bfloat16
The double plus sign (++
) will append this configuration to the existing one, resulting in three runs with action_config.tasks.0.model.dtype
set to torch.float16
, torch.float32
, and torch.bfloat16
.
urartu --multirun action_config=generate aim=aim slurm=no_slurm
This approach simplifies the process of running experiments with various configurations, making it easier to explore and optimize your models.
To monitor your experimentβs progress and view tracked metadata, simply initiate Aim with the following command:
aim up
You can expect a similar experience as demonstrated in the following image:
https://github.com/tamohannes/urartu/assets/23078323/11705f35-e3df-41f0-b0d1-42eb846a5921
UrarTU
is built upon a straightforward combination of three widely recognized libraries. For more in-depth information on how each of these libraries operates, please consult their respective GitHub repositories:
Hydra: **GitHub Repository, [Getting started | Hydra](https://hydra.cc/docs/1.3/intro/)** |
These repositories provide detailed insights into the inner workings of each library.