Skip to content

Bonidia/BioAutoML-FAST

Repository files navigation

Python Dependencies Contributions welcome Status

BioAutoML-FAST

BioAutoML-FAST: Empowering Breakthroughs in Life Sciences with End-to-End Machine Learning

HomeInstallingHow To UseCitation

Awards

⭐ 2025 Google PhD Fellowship in Health Research awarded to support outstanding and innovative research in computer science and related fields, providing total funding of USD 30.000 over two years - [Link].

⭐ ISME Scholar Mobility Fund awarded with funding of € 2.300 for a research period in July 2026 at the Helmholtz Centre for Environmental Research (UFZ) in Leipzig, Germany

Abstract

The prediction of biological sequence properties has traditionally relied on alignment-based methods that assume evolutionary homology and depend on curated reference databases. This, in turn, limits scalability and sensitivity for large or heterogeneous datasets, remote homologs, short sequences, and rapidly evolving genomic regions. Although Machine-Learning (ML) approaches offer alignment-free alternatives, their broader adoption is limited by: (i) the lack of standardized, externally validated benchmark models across diverse datasets, and (ii) the technical expertise required for feature engineering, model selection, and evaluation. Automated machine learning (AutoML) alleviates these challenges by systematically optimizing representations and models with minimal user intervention. However, most existing frameworks prioritize task-specific model construction and lack mechanisms for preserving trained models as persistent, comparable benchmarks. We introduce BioAutoML-FAST, an end-to-end web platform for automated ML analysis of nucleotide and amino acid sequences. It supports both classification and regression tasks and automates feature extraction, model training, and evaluation without requiring prior user expertise. Uniquely, it serves as a community benchmarking resource, hosting a continuously expanding repository of reusable, standardized models (currently 60) for genomic, transcriptomic, and proteomic applications. Extensive validation on independent datasets demonstrates performance comparable to or exceeding that of state-of-the-art methods, including protein language models such as ESM-2. BioAutoML-FAST is available at https://bioautoml.icmc.usp.br/. This website is free and open to all users, and there is no login requirement.

Overview

Authors

Publication

Silva de Almeida, B. L., Bonidia, R., Bole, M., Avila-Santos, A., Stadler, P. F., Nunes da Rocha, U., & de Carvalho, A. C. L. F. (2026). BioAutoML-FAST: an automated machine-learning platform for reusable and benchmarked biological sequence models. bioRxiv, 2026-04. DOI

Installing dependencies and package

If you want to use BioAutoML-FAST locally you can clone the repository and add the necessary submodules:

git clone https://github.com/Bonidia/BioAutoML-FAST.git BioAutoML-FAST

cd BioAutoML-FAST

git submodule init

git submodule update

uv (Linux/Mac/Windows)

1 - Install uv

If using Linux or Mac:

curl -LsSf https://astral.sh/uv/install.sh | sh

If using Windows, use irm to download the script and execute it with iex:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

2 - Preparing the virtual environment

With uv installed, inside the folder use following command to syncronize the virtual environment with the necessary dependencies:

uv sync

3 - Activate environment

After preparing the environment, you can activate the environment on Linux or Mac with:

source .venv/bin/activate

Using Windows:

.venv\Scripts\activate

4 - Deactivate environment

You can deactivate the environment using:

deactivate

How to use

There are two main scripts that are part of BioAutoML-FAST, engineering.py and generation.py. engineering.py is the first step of BioAutoML-FAST, with optimal descriptors selection, and generation.py is the second step, with hyperparamer optimization:

Modules

engineering.py

The engineering.py script performs the first step of BioAutoML-FAST. It extracts sequence descriptors from the input FASTA files, performs automated feature engineering/descriptor selection, and then automatically calls generation.py for model generation and hyperparameter optimization.

Option Description Default
-fasta_train, --fasta_train One or more training FASTA files. Required
-fasta_label_train, --fasta_label_train Labels associated with each training FASTA file. The order must match -fasta_train. Required
-fasta_test, --fasta_test One or more testing FASTA files. Optional
-fasta_label_test, --fasta_label_test Labels associated with each testing FASTA file. The order must match -fasta_test. Optional
-dtype, --dtype Type of input data. Supported values: DNA/RNA or Protein. DNA/RNA
-task, --task Machine learning task. Use 0 for classification and 1 for regression. 0
-estimations, --estimations Number of estimations used during automated feature engineering. 200
-patience, --patience Number of trials without improvement before early stopping. 80
-tuning, --tuning Number of trials used for hyperparameter optimization in generation.py. 150
-difference, --difference Minimum improvement required before early stopping. 0.001
-n_cpu, --n_cpu Number of CPU cores to use. Use -1 to use all available cores. -1
-output, --output Output directory where results will be saved. Required

Example: DNA/RNA (nucleotide) classification

python engineering.py \
  -fasta_train train/ncRNA.fasta train/lncRNA.fasta train/circRNA.fasta \
  -fasta_label_train ncRNA lncRNA circRNA \
  -fasta_test test/ncRNA.fasta test/lncRNA.fasta test/circRNA.fasta \
  -fasta_label_test ncRNA lncRNA circRNA \
  -dtype DNA/RNA \
  -task 0 \
  -output results

Example: Protein (amino acid) regression

python engineering.py \
  -fasta_train train/enzyme.fasta \
  -fasta_label_train enzyme \
  -fasta_test test/enzyme.fasta \
  -fasta_label_test enzyme \
  -dtype Protein \
  -task 1 \
  -output results

generation.py

The generation.py script performs the second step of BioAutoML-FAST. It trains and optimizes machine learning models using the descriptors generated during the feature engineering step. The module supports both classification and regression tasks, including hyperparameter optimization and external test evaluation.

Option Description Default
-path_model, --path_model Path to a previously trained model to be reused for prediction or evaluation. ''
-task, --task Machine learning task. Use 0 for classification and 1 for regression. 0
-tuning, --tuning Number of hyperparameter optimization trials. 150
-train, --train Training feature matrix in CSV format. Required
-train_label, --train_label Training labels in CSV format. Required
-train_nameseq, --train_nameseq CSV file containing sequence names/identifiers for the training set. Required
-test, --test Test feature matrix in CSV format. Optional
-test_label, --test_label Test labels in CSV format. Optional
-test_nameseq, --test_nameseq CSV file containing sequence names/identifiers for the test set. Optional
-n_cpu, --n_cpu Number of CPU cores to use. Use -1 to use all available cores. -1
-output, --output Output directory where models and results will be saved. Required

Note: This script can be used directly with structured data, without the need of the first step.

Citation

If you use this code in a scientific publication, we would appreciate citations to the following paper:

Silva de Almeida, B. L., Bonidia, R., Bole, M., Avila-Santos, A., Stadler, P. F., Nunes da Rocha, U., & de Carvalho, A. C. L. F. (2026). BioAutoML-FAST: an automated machine-learning platform for reusable and benchmarked biological sequence models. bioRxiv, 2026-04. DOI

@article{silva2026bioautoml,
  title={BioAutoML-FAST: an automated machine-learning platform for reusable and benchmarked biological sequence models},
  author={Silva de Almeida, Breno Livio and Bonidia, Robson and Bole, Martin and Avila-Santos, Anderson and Stadler, Peter F and Nunes da Rocha, Ulisses and de Carvalho, Andre CP L F},
  journal={bioRxiv},
  pages={2026--04},
  year={2026},
  publisher={Cold Spring Harbor Laboratory}
}

About

BioAutoML-FAST

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors