This repository contains a distributed training system that could be used to train models across different kinds of data over multiple GPUs
Note
Currently, this system is explicitly designed to run on genomic data
The following codebase has been used to train EvaReg models. This is a distributed training platform where you can train genomic data on a single machine multi-GPU system.
This pipeline has some required packages that are needed to be installed before running. You can following the following steps:
- Create a
condaenvironment using the following command:conda create -n <your-env-name> python=3.11 - Activate the environment
conda activate <your-env-name> - Run the
install.shbash scriptbash install.sh
The entire database has three main sections of modules that work together to form the complete pipeline.
This section acts as the dataloader for model training. However, in this implementation, the module has been defined to process text data in h5 formats. The create_dataloader function can be used to create dataloaders for the training, validation and the test steps.
Note
Since we are employing DDP (Distraibuted Data Parallel), during our training, the dataloader samples the data using DistributedSampler.
The genomeloader section has been designed specifically to parse the h5 files and conver the bases the sequences into their one-hot encoded form. The following one-hot encoded method is being used:
| A | C | G | T |
|---|---|---|---|
| 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 |
Note
Based on how we have processed our dataset, we have prepared the dataloader to output only two values that are stored in the h5 files. Namely, "sequence" and "cDNA_counts" values. This means, if you process your data then it should have these attributes mentioned explicitly.
This contains the core training engine that will be used for training the model. The core engine has two main components: TrainEngine and DistributedSetup.
This is the main training engine that we have used to train, validate and test the model using different data configurations. Since, we are opting for a distributed training method, we have to aggregate all the gradients across all devices and distribute the result (after performing loss operations) back to the GPUs. To do so, we have employed the dist.all_reduce functionality. During the validation and testing phase, the engine will calculate the correlation between the measured scores and the predicted scores.
This is the main class that starts the distributed training method. It checks how many devices have been requested (via the config.py file) and if they exist, initializes them and synchronize all processes.
# ==============================
# DATA SPECIFICATIONS
# ==============================
SEQDATA="./seq_boaec_into_boaec"
TOTAL_CHROMOSOMES=29
# ==============================
# MODEL SPECIFICATIONS
# ==============================
SEQUENCE_LEN = 600
NUM_CLASSES = 1
FINETUNE_MODE = False
FINETUNE_MODEL_WEIGHTS = "cattle_model.pth"
# ==============================
# MODEL TRAINING SPECIFICATIONS
# ==============================
N_EPOCHS = 10
WARMUP_EPOCHS = 2
K_FOLD = 3
N_TRIALS = 10
# ==============================
# MODEL TRAINING SPECIFICATIONS
# ==============================
TEST_CHROMOSOMES = [8, 13]
# ==============================
# GPU SPECIFICATIONS
# ==============================
NODES = 1
GPUS = 1
# ==============================
# OPTUNA SPECIFICATIONS
# ==============================
LOW_MIN_LR, HIGH_MIN_LR = 1e-8, 1e-4
LOW_MAX_LR, HIGH_MAX_LR = 1e-4, 1e-1
LOW_DECAY, HIGH_DECAY = 1e-8, 1e-2
LOW_BATCH, HIGH_BATCH = 256, 1024
BATCH_STEP = 32
For the data configuration, the pipeline uses the config.py file. The YAML file has all the necessary info for running the model.
-
SEQDATA: location of the dataset -
TOTAL_CHROMOSOMES: total number of chromosomes/shards in the dataset -
SEQUENCE_LEN: total length of the sequence used for model training -
NUM_CLASSES: the total number of outputs from the model -
FINETUNE_MODE: if the pipeline is to be used for finetuning -
FINETUNE_MODEL_WEIGHTS: weights to be used for model training -
N_EPOCHS: total number of epochs to be done for model training -
WARMUP_EPOCHS: total number of epochs the model should run before the scedulers kick in -
K_FOLD: total folds of data ($k$ -fold cross-validation) -
N_TRIALS: total number of trials for Optuna hyperparameter search -
TEST_CHROMOSOMES: These chromosomes are left out for testing -
NODES: how many nodes in the system -
GPUS: how many GPUs in the node -
LOW_MIN_LR,HIGH_MIN_LR: range of minimum LR for Optuna -
LOW_MAX_LR,HIGH_MAX_LR: range of maximum LR for Optuna -
LOW_DECAY,HIGH_DECAY: range of decay for Optuna -
LOW_BATCH,HIGH_BATCH: range of batch sizes for Optuna -
BATCH_STEP: step size to be used for increasing batch size
Running the pipeline needs to properly configure the config.py file and that's it! You can run the pipeline using the following command:
python main.py
Note
Since we don't have a dedicated output folder mentioned, the output from the model is going to be saved in your current working directory
In our tuning pipeline, we have focused on making it less biased as much as possible. One of the key things that we have integrated in our system is doing a
The tuning process proceeds as such:
- from the very start of the process, the test dataset is separated out. This ensures that the test data remains out of the training data at every step during its hyperparameter search/tuning and training processes
- With the rest of the data, we split the data into
$n$ -folds - For each fold that is left out for validation, the model is trained on a set of parameters. The model is trained for a total
$p$ epochs, as set by the user - The model is then tested on the validation dataset (the fold of dataset that was left out). Then the next fold is chosen as the validation fold data and the training progresses as mentioned above. At each stage, we collect the correlations
- After all the folds have been a part of the validation data exactly once, we take the mean of all the correlations. The mean correlation becomes the metric for the Optuna pipeline to know how well the hyperparameters did for that round of training
- The steps continue till
$k$ number of trials
Note
This increases the tuning/training time a lot. But this also ensures that the model is trained in a bias-free environment. If you want to implement the system into your code then you might want to make sure that you implement some strategies that can make the system faster for your dastaset
We thank the open-source community for their invaluable contributions to science and making knowledge free for all.
Warning
The following package has been mainly developed for EIDF (https://edinburgh-international-data-facility.ed.ac.uk/) jobs, that are based on Kubernetes cluster. If you have access to any other cluster, based on Kubernetes then the same could be directly applied. In other cases, the code for training could be directly used. However, there might be differences in how the training jobs are scheduled.
