GitHub - Bonidia/BioAutoML-FAST: BioAutoML-FAST

BioAutoML-FAST: Empowering Breakthroughs in Life Sciences with End-to-End Machine Learning

Home • Installing • How To Use • Citation

Awards

⭐ 2025 Google PhD Fellowship in Health Research awarded to support outstanding and innovative research in computer science and related fields, providing total funding of USD 30.000 over two years - [Link].

⭐ ISME Scholar Mobility Fund awarded with funding of € 2.300 for a research period in July 2026 at the Helmholtz Centre for Environmental Research (UFZ) in Leipzig, Germany

Abstract

The prediction of biological sequence properties has traditionally relied on alignment-based methods that assume evolutionary homology and depend on curated reference databases. This, in turn, limits scalability and sensitivity for large or heterogeneous datasets, remote homologs, short sequences, and rapidly evolving genomic regions. Although Machine-Learning (ML) approaches offer alignment-free alternatives, their broader adoption is limited by: (i) the lack of standardized, externally validated benchmark models across diverse datasets, and (ii) the technical expertise required for feature engineering, model selection, and evaluation. Automated machine learning (AutoML) alleviates these challenges by systematically optimizing representations and models with minimal user intervention. However, most existing frameworks prioritize task-specific model construction and lack mechanisms for preserving trained models as persistent, comparable benchmarks. We introduce BioAutoML-FAST, an end-to-end web platform for automated ML analysis of nucleotide and amino acid sequences. It supports both classification and regression tasks and automates feature extraction, model training, and evaluation without requiring prior user expertise. Uniquely, it serves as a community benchmarking resource, hosting a continuously expanding repository of reusable, standardized models (currently 60) for genomic, transcriptomic, and proteomic applications. Extensive validation on independent datasets demonstrates performance comparable to or exceeding that of state-of-the-art methods, including protein language models such as ESM-2. BioAutoML-FAST is available at https://bioautoml.icmc.usp.br/. This website is free and open to all users, and there is no login requirement.

Authors

Breno L. S. de Almeida, Robson P. Bonidia, Martin Bole, Anderson P. Avila-Santos, Peter F. Stadler, Ulisses Rocha, André C. P. L. F. de Carvalho
Correspondence: brenoslivio@usp.br, bonidia@utfpr.edu.br or ulisses.rocha@ufz.de

Publication

Silva de Almeida, B. L., Bonidia, R., Bole, M., Avila-Santos, A., Stadler, P. F., Nunes da Rocha, U., & de Carvalho, A. C. L. F. (2026). BioAutoML-FAST: an automated machine-learning platform for reusable and benchmarked biological sequence models. bioRxiv, 2026-04. DOI

Installing dependencies and package

If you want to use BioAutoML-FAST locally you can clone the repository and add the necessary submodules:

git clone https://github.com/Bonidia/BioAutoML-FAST.git BioAutoML-FAST

cd BioAutoML-FAST

git submodule init

git submodule update

uv (Linux/Mac/Windows)

1 - Install uv

If using Linux or Mac:

curl -LsSf https://astral.sh/uv/install.sh | sh

If using Windows, use irm to download the script and execute it with iex:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

2 - Preparing the virtual environment

With uv installed, inside the folder use following command to syncronize the virtual environment with the necessary dependencies:

uv sync

3 - Activate environment

After preparing the environment, you can activate the environment on Linux or Mac with:

source .venv/bin/activate

Using Windows:

.venv\Scripts\activate

4 - Deactivate environment

You can deactivate the environment using:

deactivate

How to use

There are two main scripts that are part of BioAutoML-FAST, engineering.py and generation.py. engineering.py is the first step of BioAutoML-FAST, with optimal descriptors selection, and generation.py is the second step, with hyperparamer optimization:

`engineering.py`

The engineering.py script performs the first step of BioAutoML-FAST. It extracts sequence descriptors from the input FASTA files, performs automated feature engineering/descriptor selection, and then automatically calls generation.py for model generation and hyperparameter optimization.

Option	Description	Default
`-fasta_train`, `--fasta_train`	One or more training FASTA files.	Required
`-fasta_label_train`, `--fasta_label_train`	Labels associated with each training FASTA file. The order must match `-fasta_train`.	Required
`-fasta_test`, `--fasta_test`	One or more testing FASTA files.	Optional
`-fasta_label_test`, `--fasta_label_test`	Labels associated with each testing FASTA file. The order must match `-fasta_test`.	Optional
`-dtype`, `--dtype`	Type of input data. Supported values: `DNA/RNA` or `Protein`.	`DNA/RNA`
`-task`, `--task`	Machine learning task. Use `0` for classification and `1` for regression.	`0`
`-estimations`, `--estimations`	Number of estimations used during automated feature engineering.	`200`
`-patience`, `--patience`	Number of trials without improvement before early stopping.	`80`
`-tuning`, `--tuning`	Number of trials used for hyperparameter optimization in `generation.py`.	`150`
`-difference`, `--difference`	Minimum improvement required before early stopping.	`0.001`
`-n_cpu`, `--n_cpu`	Number of CPU cores to use. Use `-1` to use all available cores.	`-1`
`-output`, `--output`	Output directory where results will be saved.	Required

Example: DNA/RNA (nucleotide) classification

python engineering.py \
  -fasta_train train/ncRNA.fasta train/lncRNA.fasta train/circRNA.fasta \
  -fasta_label_train ncRNA lncRNA circRNA \
  -fasta_test test/ncRNA.fasta test/lncRNA.fasta test/circRNA.fasta \
  -fasta_label_test ncRNA lncRNA circRNA \
  -dtype DNA/RNA \
  -task 0 \
  -output results

Example: Protein (amino acid) regression

python engineering.py \
  -fasta_train train/enzyme.fasta \
  -fasta_label_train enzyme \
  -fasta_test test/enzyme.fasta \
  -fasta_label_test enzyme \
  -dtype Protein \
  -task 1 \
  -output results

`generation.py`

The generation.py script performs the second step of BioAutoML-FAST. It trains and optimizes machine learning models using the descriptors generated during the feature engineering step. The module supports both classification and regression tasks, including hyperparameter optimization and external test evaluation.

Option	Description	Default
`-path_model`, `--path_model`	Path to a previously trained model to be reused for prediction or evaluation.	`''`
`-task`, `--task`	Machine learning task. Use `0` for classification and `1` for regression.	`0`
`-tuning`, `--tuning`	Number of hyperparameter optimization trials.	`150`
`-train`, `--train`	Training feature matrix in CSV format.	Required
`-train_label`, `--train_label`	Training labels in CSV format.	Required
`-train_nameseq`, `--train_nameseq`	CSV file containing sequence names/identifiers for the training set.	Required
`-test`, `--test`	Test feature matrix in CSV format.	Optional
`-test_label`, `--test_label`	Test labels in CSV format.	Optional
`-test_nameseq`, `--test_nameseq`	CSV file containing sequence names/identifiers for the test set.	Optional
`-n_cpu`, `--n_cpu`	Number of CPU cores to use. Use `-1` to use all available cores.	`-1`
`-output`, `--output`	Output directory where models and results will be saved.	Required

Note: This script can be used directly with structured data, without the need of the first step.

Citation

If you use this code in a scientific publication, we would appreciate citations to the following paper:

Silva de Almeida, B. L., Bonidia, R., Bole, M., Avila-Santos, A., Stadler, P. F., Nunes da Rocha, U., & de Carvalho, A. C. L. F. (2026). BioAutoML-FAST: an automated machine-learning platform for reusable and benchmarked biological sequence models. bioRxiv, 2026-04. DOI

@article{silva2026bioautoml,
  title={BioAutoML-FAST: an automated machine-learning platform for reusable and benchmarked biological sequence models},
  author={Silva de Almeida, Breno Livio and Bonidia, Robson and Bole, Martin and Avila-Santos, Anderson and Stadler, Peter F and Nunes da Rocha, Ulisses and de Carvalho, Andre CP L F},
  journal={bioRxiv},
  pages={2026--04},
  year={2026},
  publisher={Cold Spring Harbor Laboratory}
}

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
App		App
Case Studies		Case Studies
MathFeature @ 69d2a32		MathFeature @ 69d2a32
example_csv		example_csv
example_results		example_results
exemplo_fasta		exemplo_fasta
manuscript		manuscript
other-methods		other-methods
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
engineering.py		engineering.py
generation.py		generation.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioAutoML-FAST: Empowering Breakthroughs in Life Sciences with End-to-End Machine Learning

Awards

Abstract

Authors

Publication

Installing dependencies and package

uv (Linux/Mac/Windows)

How to use

`engineering.py`

Example: DNA/RNA (nucleotide) classification

Example: Protein (amino acid) regression

`generation.py`

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BioAutoML-FAST: Empowering Breakthroughs in Life Sciences with End-to-End Machine Learning

Awards

Abstract

Authors

Publication

Installing dependencies and package

uv (Linux/Mac/Windows)

How to use

engineering.py

Example: DNA/RNA (nucleotide) classification

Example: Protein (amino acid) regression

generation.py

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`engineering.py`

`generation.py`

Packages