Skip to content

evotools/EvaReg-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EvaReg training pipeline

Python Version PyTorch PyTorch Lightning CUDA Ubuntu

This repository contains a distributed training system that could be used to train models across different kinds of data over multiple GPUs

Note

Currently, this system is explicitly designed to run on genomic data

Table of contents

Introduction to codebase

The following codebase has been used to train EvaReg models. This is a distributed training platform where you can train genomic data on a single machine multi-GPU system.

Installation

This pipeline has some required packages that are needed to be installed before running. You can following the following steps:

  • Create a conda environment using the following command:
    conda create -n <your-env-name> python=3.11
    
  • Activate the environment
    conda activate <your-env-name>
    
  • Run the install.sh bash script
    bash install.sh
    

Modules

The entire database has three main sections of modules that work together to form the complete pipeline.

Data Utilities

data loader

This section acts as the dataloader for model training. However, in this implementation, the module has been defined to process text data in h5 formats. The create_dataloader function can be used to create dataloaders for the training, validation and the test steps.

Note

Since we are employing DDP (Distraibuted Data Parallel), during our training, the dataloader samples the data using DistributedSampler.

genomeloader: parsing and encoding genomic sequences

The genomeloader section has been designed specifically to parse the h5 files and conver the bases the sequences into their one-hot encoded form. The following one-hot encoded method is being used:

A C G T
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1

Note

Based on how we have processed our dataset, we have prepared the dataloader to output only two values that are stored in the h5 files. Namely, "sequence" and "cDNA_counts" values. This means, if you process your data then it should have these attributes mentioned explicitly.

Core Engine section

This contains the core training engine that will be used for training the model. The core engine has two main components: TrainEngine and DistributedSetup.

TrainEngine

This is the main training engine that we have used to train, validate and test the model using different data configurations. Since, we are opting for a distributed training method, we have to aggregate all the gradients across all devices and distribute the result (after performing loss operations) back to the GPUs. To do so, we have employed the dist.all_reduce functionality. During the validation and testing phase, the engine will calculate the correlation between the measured scores and the predicted scores.

Distributed Setup

This is the main class that starts the distributed training method. It checks how many devices have been requested (via the config.py file) and if they exist, initializes them and synchronize all processes.

Configuration

# ==============================
# DATA SPECIFICATIONS
# ==============================

SEQDATA="./seq_boaec_into_boaec"
TOTAL_CHROMOSOMES=29

# ==============================
# MODEL SPECIFICATIONS
# ==============================

SEQUENCE_LEN = 600
NUM_CLASSES = 1
FINETUNE_MODE = False
FINETUNE_MODEL_WEIGHTS = "cattle_model.pth"

# ==============================
# MODEL TRAINING SPECIFICATIONS
# ==============================

N_EPOCHS = 10
WARMUP_EPOCHS = 2
K_FOLD = 3
N_TRIALS = 10

# ==============================
# MODEL TRAINING SPECIFICATIONS
# ==============================

TEST_CHROMOSOMES = [8, 13]


# ==============================
# GPU SPECIFICATIONS
# ==============================

NODES = 1
GPUS = 1

# ==============================
# OPTUNA SPECIFICATIONS
# ==============================
LOW_MIN_LR, HIGH_MIN_LR = 1e-8, 1e-4
LOW_MAX_LR, HIGH_MAX_LR = 1e-4, 1e-1
LOW_DECAY, HIGH_DECAY = 1e-8, 1e-2
LOW_BATCH, HIGH_BATCH = 256, 1024
BATCH_STEP = 32

For the data configuration, the pipeline uses the config.py file. The YAML file has all the necessary info for running the model.

  • SEQDATA: location of the dataset
  • TOTAL_CHROMOSOMES: total number of chromosomes/shards in the dataset
  • SEQUENCE_LEN: total length of the sequence used for model training
  • NUM_CLASSES: the total number of outputs from the model
  • FINETUNE_MODE: if the pipeline is to be used for finetuning
  • FINETUNE_MODEL_WEIGHTS: weights to be used for model training
  • N_EPOCHS: total number of epochs to be done for model training
  • WARMUP_EPOCHS: total number of epochs the model should run before the scedulers kick in
  • K_FOLD: total folds of data ($k$-fold cross-validation)
  • N_TRIALS: total number of trials for Optuna hyperparameter search
  • TEST_CHROMOSOMES: These chromosomes are left out for testing
  • NODES: how many nodes in the system
  • GPUS: how many GPUs in the node
  • LOW_MIN_LR, HIGH_MIN_LR: range of minimum LR for Optuna
  • LOW_MAX_LR, HIGH_MAX_LR: range of maximum LR for Optuna
  • LOW_DECAY, HIGH_DECAY: range of decay for Optuna
  • LOW_BATCH, HIGH_BATCH: range of batch sizes for Optuna
  • BATCH_STEP: step size to be used for increasing batch size

Running the pipeline

Running the pipeline needs to properly configure the config.py file and that's it! You can run the pipeline using the following command:

python main.py

Note

Since we don't have a dedicated output folder mentioned, the output from the model is going to be saved in your current working directory

Tuning principle

In our tuning pipeline, we have focused on making it less biased as much as possible. One of the key things that we have integrated in our system is doing a $n$-fold cross-validation that makes sure that the model doesn't rely too much on only one section of the data and has generalises over all the data that it has at its disposal.

The tuning process proceeds as such:

  • from the very start of the process, the test dataset is separated out. This ensures that the test data remains out of the training data at every step during its hyperparameter search/tuning and training processes
  • With the rest of the data, we split the data into $n$-folds
  • For each fold that is left out for validation, the model is trained on a set of parameters. The model is trained for a total $p$ epochs, as set by the user
  • The model is then tested on the validation dataset (the fold of dataset that was left out). Then the next fold is chosen as the validation fold data and the training progresses as mentioned above. At each stage, we collect the correlations
  • After all the folds have been a part of the validation data exactly once, we take the mean of all the correlations. The mean correlation becomes the metric for the Optuna pipeline to know how well the hyperparameters did for that round of training
  • The steps continue till $k$ number of trials

Note

This increases the tuning/training time a lot. But this also ensures that the model is trained in a bias-free environment. If you want to implement the system into your code then you might want to make sure that you implement some strategies that can make the system faster for your dastaset

Acknowledgements

We thank the open-source community for their invaluable contributions to science and making knowledge free for all.


Warning

The following package has been mainly developed for EIDF (https://edinburgh-international-data-facility.ed.ac.uk/) jobs, that are based on Kubernetes cluster. If you have access to any other cluster, based on Kubernetes then the same could be directly applied. In other cases, the code for training could be directly used. However, there might be differences in how the training jobs are scheduled.

About

Training code for EvaReg models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors