EvaReg training pipeline

This repository contains a distributed training system that could be used to train models across different kinds of data over multiple GPUs

Note

Currently, this system is explicitly designed to run on genomic data

Introduction to codebase

The following codebase has been used to train EvaReg models. This is a distributed training platform where you can train genomic data on a single machine multi-GPU system.

Installation

This pipeline has some required packages that are needed to be installed before running. You can following the following steps:

Create a conda environment using the following command:
```
conda create -n <your-env-name> python=3.11
```
Activate the environment
```
conda activate <your-env-name>
```
Run the install.sh bash script
```
bash install.sh
```

Modules

The entire database has three main sections of modules that work together to form the complete pipeline.

Data Utilities

data loader

This section acts as the dataloader for model training. However, in this implementation, the module has been defined to process text data in h5 formats. The create_dataloader function can be used to create dataloaders for the training, validation and the test steps.

Note

Since we are employing DDP (Distraibuted Data Parallel), during our training, the dataloader samples the data using DistributedSampler.

`genomeloader`: parsing and encoding genomic sequences

The genomeloader section has been designed specifically to parse the h5 files and conver the bases the sequences into their one-hot encoded form. The following one-hot encoded method is being used:

A	C	G	T
1	0	0	0
0	1	0	0
0	0	1	0
0	0	0	1

Note

Based on how we have processed our dataset, we have prepared the dataloader to output only two values that are stored in the h5 files. Namely, "sequence" and "cDNA_counts" values. This means, if you process your data then it should have these attributes mentioned explicitly.

Core Engine section

This contains the core training engine that will be used for training the model. The core engine has two main components: TrainEngine and DistributedSetup.

TrainEngine

This is the main training engine that we have used to train, validate and test the model using different data configurations. Since, we are opting for a distributed training method, we have to aggregate all the gradients across all devices and distribute the result (after performing loss operations) back to the GPUs. To do so, we have employed the dist.all_reduce functionality. During the validation and testing phase, the engine will calculate the correlation between the measured scores and the predicted scores.

Distributed Setup

This is the main class that starts the distributed training method. It checks how many devices have been requested (via the config.py file) and if they exist, initializes them and synchronize all processes.

Configuration

# ==============================
# DATA SPECIFICATIONS
# ==============================

SEQDATA="./seq_boaec_into_boaec"
TOTAL_CHROMOSOMES=29

# ==============================
# MODEL SPECIFICATIONS
# ==============================

SEQUENCE_LEN = 600
NUM_CLASSES = 1
FINETUNE_MODE = False
FINETUNE_MODEL_WEIGHTS = "cattle_model.pth"

# ==============================
# MODEL TRAINING SPECIFICATIONS
# ==============================

N_EPOCHS = 10
WARMUP_EPOCHS = 2
K_FOLD = 3
N_TRIALS = 10

# ==============================
# MODEL TRAINING SPECIFICATIONS
# ==============================

TEST_CHROMOSOMES = [8, 13]


# ==============================
# GPU SPECIFICATIONS
# ==============================

NODES = 1
GPUS = 1

# ==============================
# OPTUNA SPECIFICATIONS
# ==============================
LOW_MIN_LR, HIGH_MIN_LR = 1e-8, 1e-4
LOW_MAX_LR, HIGH_MAX_LR = 1e-4, 1e-1
LOW_DECAY, HIGH_DECAY = 1e-8, 1e-2
LOW_BATCH, HIGH_BATCH = 256, 1024
BATCH_STEP = 32

For the data configuration, the pipeline uses the config.py file. The YAML file has all the necessary info for running the model.

SEQDATA: location of the dataset
TOTAL_CHROMOSOMES: total number of chromosomes/shards in the dataset
SEQUENCE_LEN: total length of the sequence used for model training
NUM_CLASSES: the total number of outputs from the model
FINETUNE_MODE: if the pipeline is to be used for finetuning
FINETUNE_MODEL_WEIGHTS: weights to be used for model training
N_EPOCHS: total number of epochs to be done for model training
WARMUP_EPOCHS: total number of epochs the model should run before the scedulers kick in
K_FOLD: total folds of data ($k$-fold cross-validation)
N_TRIALS: total number of trials for Optuna hyperparameter search
TEST_CHROMOSOMES: These chromosomes are left out for testing
NODES: how many nodes in the system
GPUS: how many GPUs in the node
LOW_MIN_LR, HIGH_MIN_LR: range of minimum LR for Optuna
LOW_MAX_LR, HIGH_MAX_LR: range of maximum LR for Optuna
LOW_DECAY, HIGH_DECAY: range of decay for Optuna
LOW_BATCH, HIGH_BATCH: range of batch sizes for Optuna
BATCH_STEP: step size to be used for increasing batch size

Running the pipeline

Running the pipeline needs to properly configure the config.py file and that's it! You can run the pipeline using the following command:

python main.py

Note

Since we don't have a dedicated output folder mentioned, the output from the model is going to be saved in your current working directory

Tuning principle

In our tuning pipeline, we have focused on making it less biased as much as possible. One of the key things that we have integrated in our system is doing a $n$-fold cross-validation that makes sure that the model doesn't rely too much on only one section of the data and has generalises over all the data that it has at its disposal.

The tuning process proceeds as such:

from the very start of the process, the test dataset is separated out. This ensures that the test data remains out of the training data at every step during its hyperparameter search/tuning and training processes
With the rest of the data, we split the data into $n$-folds
For each fold that is left out for validation, the model is trained on a set of parameters. The model is trained for a total $p$ epochs, as set by the user
The model is then tested on the validation dataset (the fold of dataset that was left out). Then the next fold is chosen as the validation fold data and the training progresses as mentioned above. At each stage, we collect the correlations
After all the folds have been a part of the validation data exactly once, we take the mean of all the correlations. The mean correlation becomes the metric for the Optuna pipeline to know how well the hyperparameters did for that round of training
The steps continue till $k$ number of trials

Note

This increases the tuning/training time a lot. But this also ensures that the model is trained in a bias-free environment. If you want to implement the system into your code then you might want to make sure that you implement some strategies that can make the system faster for your dastaset

Acknowledgements

We thank the open-source community for their invaluable contributions to science and making knowledge free for all.

Warning

The following package has been mainly developed for EIDF (https://edinburgh-international-data-facility.ed.ac.uk/) jobs, that are based on Kubernetes cluster. If you have access to any other cluster, based on Kubernetes then the same could be directly applied. In other cases, the code for training could be directly used. However, there might be differences in how the training jobs are scheduled.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
config.py		config.py
data_utils.py		data_utils.py
evareg.py		evareg.py
install.sh		install.sh
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvaReg training pipeline

Table of contents

Introduction to codebase

Installation

Modules

Data Utilities

data loader

`genomeloader`: parsing and encoding genomic sequences

Core Engine section

TrainEngine

Distributed Setup

Configuration

Running the pipeline

Tuning principle

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvaReg training pipeline

Table of contents

Introduction to codebase

Installation

Modules

Data Utilities

data loader

genomeloader: parsing and encoding genomic sequences

Core Engine section

TrainEngine

Distributed Setup

Configuration

Running the pipeline

Tuning principle

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`genomeloader`: parsing and encoding genomic sequences

Packages