feat(fasta): add lightweight FASTA file format support by behroozazarkhalili · Pull Request #7923 · huggingface/datasets

behroozazarkhalili · 2025-12-31T19:33:00Z

Summary

This PR adds support for loading FASTA files directly with load_dataset(), addressing feedback from #7851.

FASTA is a text-based format for representing nucleotide sequences (DNA/RNA) or peptide sequences (proteins), widely used in bioinformatics.

Key Features

Zero external dependencies - Uses a lightweight pure Python parser based on readfq.py by Heng Li
Streaming support - Generator-based parsing for memory efficiency with large genomic files
Compression support - Automatic detection and handling of gzip, bzip2, and xz compressed files via magic bytes
Large sequence support - Uses large_string Arrow type to handle viral genomes and long sequences (fixes UTF-8 overflow)
Adaptive batching - max_batch_bytes parameter (default 256MB) prevents Parquet page size errors with very large sequences

Technical Decisions (Addressing #7851 Feedback)

Concern	Solution
Long sequences → UTF-8 overflow (@apcamargo, @UriNeri)	Uses `pa.large_string()` for sequence column
BioPython is overkill (@apcamargo)	Pure Python parser based on Heng Li's readfq.py
Parquet page size limit i32::MAX (@UriNeri)	Adaptive dual-threshold batching with `max_batch_bytes`

Columns

Column	Type	Description
`id`	string	Sequence identifier (first word after `>`)
`description`	string	Full description line (everything after id)
`sequence`	large_string	The biological sequence (DNA/RNA/protein)

Supported Extensions

.fa, .fasta, .fna, .ffn, .faa, .frn (and compressed variants)

Usage

from datasets import load_dataset

# Load FASTA file
dataset = load_dataset("fasta", data_files="sequences.fasta")

# Load with column filtering
dataset = load_dataset("fasta", data_files="sequences.fa", columns=["id", "sequence"])

# Load gzipped file
dataset = load_dataset("fasta", data_files="sequences.fa.gz")

# Configure batching for very large genomes
dataset = load_dataset("fasta", data_files="genome.fasta", max_batch_bytes=128*1024*1024)

Test Plan

All 22 tests passing.

cc: @georgia-hf

Add native support for loading FASTA biological sequence files with zero external dependencies. This addresses feedback from PR huggingface#7851. Key features: - Pure Python parser based on Heng Li's readfq.py (no BioPython dependency) - Uses pa.large_string() for sequences to handle UTF-8 overflow with long genomes - Adaptive byte-based batching (max_batch_bytes=256MB) prevents Parquet page size errors with very large sequences like complete viral genomes - Supports gzip, bzip2, and xz compression via magic byte detection - Column filtering: select subset of [id, description, sequence] Supported extensions: .fa, .fasta, .fna, .ffn, .faa, .frn

FastaConfig was missing the __post_init__ method that calls super().__post_init__(). This is required to inherit BuilderConfig's validation for: - Invalid config name characters (InvalidConfigName) - data_files type validation (ValueError) This aligns with the pattern used in ArrowConfig, MmcifFolderConfig, and other packaged module configs. Also includes minor style formatting from ruff.

lhoestq · 2026-06-19T15:09:01Z

As fo the other PR I think we can have biopython as an optional dependency for the parsing

Also I mentioned in #7851 that maybe it would be nice to have a dedicated type for this kind of data and decode the data as SeqRecord objects from biopython, which are nicer than simple python dictionaries for practitioners. wdyt ?

apcamargo · 2026-06-19T15:20:28Z

Is there any advantage in using biopython for parsing? It tends to be much slower than Heng Li's native parser (on top of adding an extra dependency)

lhoestq · 2026-06-19T15:44:56Z

Mostly convenience for users, since they get python objects they are used to.
And since the biopython open source community seems to be growing it will also provide access to future developments / features.

But let me know if you think otherwise

behroozazarkhalili · 2026-06-19T15:55:00Z

Thanks! I agree with decoding to BioPython objects (SeqRecord here) rather than plain dicts — it gives practitioners something they can process/convert directly, and aligns with where the community already contributes.

Proposed approach for this PR (and the sibling FASTQ/PDB/mmCIF ones):

Add biopython as an optional dependency. Parsing/decoding returns SeqRecord objects when it's installed.
Keep a lightweight fallback (the existing zero-dependency parser) for when biopython isn't present, so the base install stays lean — this addresses @apcamargo's and @UriNeri's point on Add fasta support #7851 that BioPython is heavy/slow for plain FASTA reads.
On the dedicated feature type: I'd be glad to add one (decoding to SeqRecord) — should it live alongside the existing decodable features (like Audio/Image), or would you prefer to start with biopython optional + objects and add the dedicated type as a follow-up?

Also noted your suggestion to merge #7923 + #7924 since FASTA/FASTQ are so similar — happy to consolidate them into one PR if you prefer. Will hold the PDB/mmCIF "one row = one structure" restructuring for those PRs specifically.

apcamargo · 2026-06-19T16:04:12Z

Two additional suggestions:

I'm not a big fan on requiring specific extensions. Is this needed? If so, you should include .afa, which is a common extension for alignments.
Support for ZSTD (which is becoming very popular for FASTA files) can be added when the user has Python 3.14 or higher. You can find an example here: https://gist.github.com/apcamargo/d039aa04a2cbbcbb14e2d34a0963b862#file-fancy_fasta_reader-py-L7-L35

behroozazarkhalili · 2026-06-19T17:23:25Z

Thanks @apcamargo, both good points.

Extensions: I'll add .afa. On dropping the allowlist entirely — the extension list also feeds the packaged-module auto-detection (how load_dataset picks the FASTA builder for a folder of files), so I can't remove it outright without breaking that inference. What I can do is relax it (broaden the set, and not hard-fail on unlisted extensions when the builder is selected explicitly). This also ties into the open question above re: decoding via BioPython — if we go that route, format detection leans on content/handle rather than the filename, which makes the extension list less load-bearing. I'll align the final behavior with whatever we land on there.

ZSTD: Agreed — it's increasingly common for FASTA. Python 3.14 added compression.zstd to the stdlib, so I can add it as a version-gated branch in _open_file (alongside the existing gzip/bz2/xz magic-byte detection) without a new dependency. Thanks for the gist reference.

lhoestq · 2026-07-01T15:56:37Z

Your proposed approach looks good !

Add biopython as an optional dependency. Parsing/decoding returns SeqRecord objects when it's installed.

yes sound sgood

Keep a lightweight fallback (the existing zero-dependency parser) for when biopython isn't present, so the base install stays lean — this addresses @apcamargo's and @UriNeri's point on #7851 that BioPython is heavy/slow for plain FASTA reads.
On the dedicated feature type: I'd be glad to add one (decoding to SeqRecord) — should it live alongside the existing decodable features (like Audio/Image), or would you prefer to start with biopython optional + objects and add the dedicated type as a follow-up?

Users can disable decoding anyways, and in this case they get the raw files bytes that they can parse themselves. Let's use biopython and have the dedicated type

behroozazarkhalili · 2026-07-03T16:19:07Z

Great — I'll start implementing the biopython-optional + dedicated-type approach across the bio-format PRs. Three things I'd like your call on so I name the public API right the first time:

Feature-type naming. I'm planning a decodable feature (like Audio/Image) that stores raw bytes and decodes to a biopython object when biopython is installed, with decode=False giving raw bytes. Since FASTA/FASTQ decode to Bio.SeqRecord and PDB/mmCIF decode to Bio.PDB.Structure, do you prefer (a) two types, e.g. BioSequence (SeqRecord) and Structure (Bio.PDB), or (b) one general Biopython/SeqRecord-style type parameterized by format? I'll follow whichever you prefer.
Merge feat(fasta): add lightweight FASTA file format support #7923 + Add lightweight FASTQ file format support #7924? You mentioned FASTA and FASTQ are similar enough to be one PR — happy to consolidate them into feat(fasta): add lightweight FASTA file format support #7923 and close Add lightweight FASTQ file format support #7924, or keep them separate. Your call.
Where should the feature type live — added to core src/datasets/features/ now (registered in features/__init__.py), or start with biopython-optional decoding inside the packaged modules and add the dedicated feature type as a follow-up PR?

I'll wire the shared foundation (optional biopython dep + availability flag) meanwhile since that's name-independent.

@apcamargo

Per review feedback (@apcamargo), add .afa — a common extension for multiple sequence alignments in FASTA format. Wire it into the extension→builder auto-detection map so .afa files load as FASTA, and add it to the builder's EXTENSIONS list. Covered by a new test asserting both the mapping and that an .afa file loads.

behroozazarkhalili added 2 commits June 9, 2026 21:45

behroozazarkhalili force-pushed the feat/fasta-support branch from 570485f to 980fdec Compare June 10, 2026 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(fasta): add lightweight FASTA file format support#7923

feat(fasta): add lightweight FASTA file format support#7923
behroozazarkhalili wants to merge 3 commits into
huggingface:mainfrom
behroozazarkhalili:feat/fasta-support

behroozazarkhalili commented Dec 31, 2025 •

edited

Loading

Uh oh!

lhoestq commented Jun 19, 2026

Uh oh!

apcamargo commented Jun 19, 2026

Uh oh!

lhoestq commented Jun 19, 2026

Uh oh!

behroozazarkhalili commented Jun 19, 2026

Uh oh!

apcamargo commented Jun 19, 2026

Uh oh!

behroozazarkhalili commented Jun 19, 2026

Uh oh!

lhoestq commented Jul 1, 2026

Uh oh!

behroozazarkhalili commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

behroozazarkhalili commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Technical Decisions (Addressing #7851 Feedback)

Columns

Supported Extensions

Usage

Test Plan

Uh oh!

lhoestq commented Jun 19, 2026

Uh oh!

apcamargo commented Jun 19, 2026

Uh oh!

lhoestq commented Jun 19, 2026

Uh oh!

behroozazarkhalili commented Jun 19, 2026

Uh oh!

apcamargo commented Jun 19, 2026

Uh oh!

behroozazarkhalili commented Jun 19, 2026

Uh oh!

lhoestq commented Jul 1, 2026

Uh oh!

behroozazarkhalili commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

behroozazarkhalili commented Dec 31, 2025 •

edited

Loading