feat(fasta): add lightweight FASTA file format support#7923
feat(fasta): add lightweight FASTA file format support#7923behroozazarkhalili wants to merge 3 commits into
Conversation
Add native support for loading FASTA biological sequence files with zero external dependencies. This addresses feedback from PR huggingface#7851. Key features: - Pure Python parser based on Heng Li's readfq.py (no BioPython dependency) - Uses pa.large_string() for sequences to handle UTF-8 overflow with long genomes - Adaptive byte-based batching (max_batch_bytes=256MB) prevents Parquet page size errors with very large sequences like complete viral genomes - Supports gzip, bzip2, and xz compression via magic byte detection - Column filtering: select subset of [id, description, sequence] Supported extensions: .fa, .fasta, .fna, .ffn, .faa, .frn
FastaConfig was missing the __post_init__ method that calls super().__post_init__(). This is required to inherit BuilderConfig's validation for: - Invalid config name characters (InvalidConfigName) - data_files type validation (ValueError) This aligns with the pattern used in ArrowConfig, MmcifFolderConfig, and other packaged module configs. Also includes minor style formatting from ruff.
570485f to
980fdec
Compare
|
As fo the other PR I think we can have biopython as an optional dependency for the parsing Also I mentioned in #7851 that maybe it would be nice to have a dedicated type for this kind of data and decode the data as |
|
Is there any advantage in using biopython for parsing? It tends to be much slower than Heng Li's native parser (on top of adding an extra dependency) |
|
Mostly convenience for users, since they get python objects they are used to. But let me know if you think otherwise |
|
Thanks! I agree with decoding to BioPython objects ( Proposed approach for this PR (and the sibling FASTQ/PDB/mmCIF ones):
Also noted your suggestion to merge #7923 + #7924 since FASTA/FASTQ are so similar — happy to consolidate them into one PR if you prefer. Will hold the PDB/mmCIF "one row = one structure" restructuring for those PRs specifically. |
|
Two additional suggestions:
|
|
Thanks @apcamargo, both good points. Extensions: I'll add ZSTD: Agreed — it's increasingly common for FASTA. Python 3.14 added |
|
Your proposed approach looks good !
yes sound sgood
Users can disable decoding anyways, and in this case they get the raw files bytes that they can parse themselves. Let's use biopython and have the dedicated type |
|
Great — I'll start implementing the biopython-optional + dedicated-type approach across the bio-format PRs. Three things I'd like your call on so I name the public API right the first time:
I'll wire the shared foundation (optional |
Per review feedback (@apcamargo), add .afa — a common extension for multiple sequence alignments in FASTA format. Wire it into the extension→builder auto-detection map so .afa files load as FASTA, and add it to the builder's EXTENSIONS list. Covered by a new test asserting both the mapping and that an .afa file loads.
Summary
This PR adds support for loading FASTA files directly with
load_dataset(), addressing feedback from #7851.FASTA is a text-based format for representing nucleotide sequences (DNA/RNA) or peptide sequences (proteins), widely used in bioinformatics.
Key Features
large_stringArrow type to handle viral genomes and long sequences (fixes UTF-8 overflow)max_batch_bytesparameter (default 256MB) prevents Parquet page size errors with very large sequencesTechnical Decisions (Addressing #7851 Feedback)
pa.large_string()for sequence columnmax_batch_bytesColumns
id>)descriptionsequenceSupported Extensions
.fa,.fasta,.fna,.ffn,.faa,.frn(and compressed variants)Usage
Test Plan
large_stringtypemax_batch_bytes)All 22 tests passing.
cc: @georgia-hf