Skip to content

feat(fasta): add lightweight FASTA file format support#7923

Open
behroozazarkhalili wants to merge 3 commits into
huggingface:mainfrom
behroozazarkhalili:feat/fasta-support
Open

feat(fasta): add lightweight FASTA file format support#7923
behroozazarkhalili wants to merge 3 commits into
huggingface:mainfrom
behroozazarkhalili:feat/fasta-support

Conversation

@behroozazarkhalili

@behroozazarkhalili behroozazarkhalili commented Dec 31, 2025

Copy link
Copy Markdown

Summary

This PR adds support for loading FASTA files directly with load_dataset(), addressing feedback from #7851.

FASTA is a text-based format for representing nucleotide sequences (DNA/RNA) or peptide sequences (proteins), widely used in bioinformatics.

Key Features

  • Zero external dependencies - Uses a lightweight pure Python parser based on readfq.py by Heng Li
  • Streaming support - Generator-based parsing for memory efficiency with large genomic files
  • Compression support - Automatic detection and handling of gzip, bzip2, and xz compressed files via magic bytes
  • Large sequence support - Uses large_string Arrow type to handle viral genomes and long sequences (fixes UTF-8 overflow)
  • Adaptive batching - max_batch_bytes parameter (default 256MB) prevents Parquet page size errors with very large sequences

Technical Decisions (Addressing #7851 Feedback)

Concern Solution
Long sequences → UTF-8 overflow (@apcamargo, @UriNeri) Uses pa.large_string() for sequence column
BioPython is overkill (@apcamargo) Pure Python parser based on Heng Li's readfq.py
Parquet page size limit i32::MAX (@UriNeri) Adaptive dual-threshold batching with max_batch_bytes

Columns

Column Type Description
id string Sequence identifier (first word after >)
description string Full description line (everything after id)
sequence large_string The biological sequence (DNA/RNA/protein)

Supported Extensions

.fa, .fasta, .fna, .ffn, .faa, .frn (and compressed variants)

Usage

from datasets import load_dataset

# Load FASTA file
dataset = load_dataset("fasta", data_files="sequences.fasta")

# Load with column filtering
dataset = load_dataset("fasta", data_files="sequences.fa", columns=["id", "sequence"])

# Load gzipped file
dataset = load_dataset("fasta", data_files="sequences.fa.gz")

# Configure batching for very large genomes
dataset = load_dataset("fasta", data_files="genome.fasta", max_batch_bytes=128*1024*1024)

Test Plan

  • Basic FASTA loading (3 sequences, multi-line)
  • Multiple extension support (.fa, .fasta, .fna, .ffn, .faa, .frn)
  • Compression formats (gzip, bz2, xz)
  • Long sequences with large_string type
  • Column filtering
  • Batch size configuration
  • Byte-based batching (max_batch_bytes)
  • Large genome handling (simulated 50KB sequences)
  • Empty description handling
  • Multiple files loading
  • Custom feature casting

All 22 tests passing.

cc: @georgia-hf

Add native support for loading FASTA biological sequence files with zero
external dependencies. This addresses feedback from PR huggingface#7851.

Key features:
- Pure Python parser based on Heng Li's readfq.py (no BioPython dependency)
- Uses pa.large_string() for sequences to handle UTF-8 overflow with long genomes
- Adaptive byte-based batching (max_batch_bytes=256MB) prevents Parquet page
  size errors with very large sequences like complete viral genomes
- Supports gzip, bzip2, and xz compression via magic byte detection
- Column filtering: select subset of [id, description, sequence]

Supported extensions: .fa, .fasta, .fna, .ffn, .faa, .frn
FastaConfig was missing the __post_init__ method that calls
super().__post_init__(). This is required to inherit BuilderConfig's
validation for:
- Invalid config name characters (InvalidConfigName)
- data_files type validation (ValueError)

This aligns with the pattern used in ArrowConfig, MmcifFolderConfig,
and other packaged module configs.

Also includes minor style formatting from ruff.
@lhoestq

lhoestq commented Jun 19, 2026

Copy link
Copy Markdown
Member

As fo the other PR I think we can have biopython as an optional dependency for the parsing

Also I mentioned in #7851 that maybe it would be nice to have a dedicated type for this kind of data and decode the data as SeqRecord objects from biopython, which are nicer than simple python dictionaries for practitioners. wdyt ?

@apcamargo

Copy link
Copy Markdown

Is there any advantage in using biopython for parsing? It tends to be much slower than Heng Li's native parser (on top of adding an extra dependency)

@lhoestq

lhoestq commented Jun 19, 2026

Copy link
Copy Markdown
Member

Mostly convenience for users, since they get python objects they are used to.
And since the biopython open source community seems to be growing it will also provide access to future developments / features.

But let me know if you think otherwise

@behroozazarkhalili

Copy link
Copy Markdown
Author

Thanks! I agree with decoding to BioPython objects (SeqRecord here) rather than plain dicts — it gives practitioners something they can process/convert directly, and aligns with where the community already contributes.

Proposed approach for this PR (and the sibling FASTQ/PDB/mmCIF ones):

  • Add biopython as an optional dependency. Parsing/decoding returns SeqRecord objects when it's installed.
  • Keep a lightweight fallback (the existing zero-dependency parser) for when biopython isn't present, so the base install stays lean — this addresses @apcamargo's and @UriNeri's point on Add fasta support #7851 that BioPython is heavy/slow for plain FASTA reads.
  • On the dedicated feature type: I'd be glad to add one (decoding to SeqRecord) — should it live alongside the existing decodable features (like Audio/Image), or would you prefer to start with biopython optional + objects and add the dedicated type as a follow-up?

Also noted your suggestion to merge #7923 + #7924 since FASTA/FASTQ are so similar — happy to consolidate them into one PR if you prefer. Will hold the PDB/mmCIF "one row = one structure" restructuring for those PRs specifically.

@apcamargo

Copy link
Copy Markdown

Two additional suggestions:

@behroozazarkhalili

Copy link
Copy Markdown
Author

Thanks @apcamargo, both good points.

Extensions: I'll add .afa. On dropping the allowlist entirely — the extension list also feeds the packaged-module auto-detection (how load_dataset picks the FASTA builder for a folder of files), so I can't remove it outright without breaking that inference. What I can do is relax it (broaden the set, and not hard-fail on unlisted extensions when the builder is selected explicitly). This also ties into the open question above re: decoding via BioPython — if we go that route, format detection leans on content/handle rather than the filename, which makes the extension list less load-bearing. I'll align the final behavior with whatever we land on there.

ZSTD: Agreed — it's increasingly common for FASTA. Python 3.14 added compression.zstd to the stdlib, so I can add it as a version-gated branch in _open_file (alongside the existing gzip/bz2/xz magic-byte detection) without a new dependency. Thanks for the gist reference.

@lhoestq

lhoestq commented Jul 1, 2026

Copy link
Copy Markdown
Member

Your proposed approach looks good !

Add biopython as an optional dependency. Parsing/decoding returns SeqRecord objects when it's installed.

yes sound sgood

Keep a lightweight fallback (the existing zero-dependency parser) for when biopython isn't present, so the base install stays lean — this addresses @apcamargo's and @UriNeri's point on #7851 that BioPython is heavy/slow for plain FASTA reads.
On the dedicated feature type: I'd be glad to add one (decoding to SeqRecord) — should it live alongside the existing decodable features (like Audio/Image), or would you prefer to start with biopython optional + objects and add the dedicated type as a follow-up?

Users can disable decoding anyways, and in this case they get the raw files bytes that they can parse themselves. Let's use biopython and have the dedicated type

@behroozazarkhalili

Copy link
Copy Markdown
Author

Great — I'll start implementing the biopython-optional + dedicated-type approach across the bio-format PRs. Three things I'd like your call on so I name the public API right the first time:

  1. Feature-type naming. I'm planning a decodable feature (like Audio/Image) that stores raw bytes and decodes to a biopython object when biopython is installed, with decode=False giving raw bytes. Since FASTA/FASTQ decode to Bio.SeqRecord and PDB/mmCIF decode to Bio.PDB.Structure, do you prefer (a) two types, e.g. BioSequence (SeqRecord) and Structure (Bio.PDB), or (b) one general Biopython/SeqRecord-style type parameterized by format? I'll follow whichever you prefer.

  2. Merge feat(fasta): add lightweight FASTA file format support #7923 + Add lightweight FASTQ file format support #7924? You mentioned FASTA and FASTQ are similar enough to be one PR — happy to consolidate them into feat(fasta): add lightweight FASTA file format support #7923 and close Add lightweight FASTQ file format support #7924, or keep them separate. Your call.

  3. Where should the feature type live — added to core src/datasets/features/ now (registered in features/__init__.py), or start with biopython-optional decoding inside the packaged modules and add the dedicated feature type as a follow-up PR?

I'll wire the shared foundation (optional biopython dep + availability flag) meanwhile since that's name-independent.

Per review feedback (@apcamargo), add .afa — a common extension for multiple
sequence alignments in FASTA format. Wire it into the extension→builder
auto-detection map so .afa files load as FASTA, and add it to the builder's
EXTENSIONS list. Covered by a new test asserting both the mapping and that an
.afa file loads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants