ProFASTA is a Python library for working with FASTA files containing protein records. It prioritizes simplicity while providing a practical set of features for proteomics-based mass spectrometry workflows.
Core functionality includes:
- Parsing and writing FASTA files via
profasta.io - Structured header parsing via a registry of built-in and user-defined parsers
- A protein database (
ProteinDatabase) for managing entries loaded from one or more FASTA files - Decoy database generation by sequence reversal
- Header validation for non-ASCII characters
ProFASTA is developed as part of the computational toolbox for the Mass Spectrometry Facility at the Max Perutz Labs (University of Vienna).
If ProFASTA doesn't meet your requirements, consider exploring these alternative Python packages with a focus on protein-containing FASTA files:
- fastapy is a lightweight package with no dependencies that offers FASTA reading functionality.
- protfasta is another library with no dependencies that provides reading functionality along with basic validation (e.g., duplicate headers, conversion of non-canonical amino acids). The library also allows writing FASTA files with the ability to specify the sequence line length.
- pyteomics is a feature-rich package that provides tools to handle various sorts of proteomics data. It provides functions for FASTA reading, automatic parsing of headers (in various formats defined at uniprot.org), writing, and generation of decoy entries. Note that pyteomics is a large package with many dependencies.
Python >= 3.11
ProFASTA has no dependencies beyond the Python standard library.
Install from PyPI:
pip install profasta
The profasta.io.parse_fasta function reads a FASTA file and yields FastaRecord objects. Sequences are automatically normalized: letters are converted to uppercase, spaces are removed, and trailing * characters are stripped.
import profasta.io
with open("proteins.fasta", "r") as f:
for record in profasta.io.parse_fasta(f):
print(record.header, record.sequence)ProFASTA uses a registry system for header parsers and writers. Built-in parsers are registered under the following names:
| Name | Description |
|---|---|
"default" |
Splits on the first whitespace; never fails |
"uniprot" |
Strict UniProt format parser |
"uniprot_like" |
Tolerant UniProt-like format parser |
Built-in writers follow the same naming convention.
Note: A
DecoyWriteris available but not registered by default. It prepends "rev_" to the header string. You can manually register it or create a version with a custom tag usingDecoyWriter.with_tag("your_tag").
Custom parsers and writers can be registered via:
profasta.parser.register_parser("my_parser", MyParser)
profasta.parser.register_writer("my_writer", MyWriter)A parser must implement a parse(header: str) -> ParsedHeader classmethod, and a writer must implement a write(parsed_header: ParsedHeader) -> str classmethod.
The following header_fields keys are available after parsing with UniprotParser or UniprotLikeParser. Guaranteed fields are always present; optional fields are only populated if the corresponding tag was found in the header.
Note: All fields are stored as strings, including
organism_identifier.
| Field | UniprotParser | UniprotLikeParser | Notes |
|---|---|---|---|
db |
guaranteed | guaranteed | Source database identifier |
identifier |
guaranteed | guaranteed | Accession or unique ID |
entry_name |
guaranteed | guaranteed | Entry name |
protein_name |
guaranteed | optional | Text before the first tag field |
organism_name |
optional | optional | OS field |
organism_identifier |
optional | optional | OX field (NCBI taxonomy ID) |
gene_name |
optional | optional | GN field |
protein_existence |
optional | optional | PE field |
sequence_version |
optional | optional | SV field |
The ProteinDatabase class provides a dict-like interface for managing protein entries loaded from FASTA files:
import profasta
db = profasta.ProteinDatabase()
db.add_fasta("proteins.fasta", header_parser="uniprot")
entry = db["O75385"]
print(entry.header_fields["gene_name"]) # ULK1Multiple FASTA files can be added to the same database. Entries with unparseable headers can be skipped using skip_invalid=True.
A ProteinDatabase can also be created directly from one or more FASTA files using the from_fasta convenience constructor:
fasta_paths = ["proteome1.fasta", "proteome2.fasta"]
db = profasta.ProteinDatabase.from_fasta(*fasta_paths, header_parser="uniprot")Entries can be filtered by a condition using the filter method, which returns a new ProteinDatabase:
human_db = db.filter(lambda e: e.header_fields.get("organism_identifier") == "9606")The profasta.validation module provides a function for checking FASTA records for non-ASCII characters in their headers, which can cause issues in downstream processing:
import profasta.validation
with open("proteins.fasta", "r") as f:
records = list(profasta.io.parse_fasta(f))
issues = profasta.validation.find_header_ascii_issues(records)
for issue in issues:
print(issue.header, issue.non_ascii_characters)import profasta
db = profasta.ProteinDatabase()
db.add_fasta("./examples/uniprot_hsapiens_10entries.fasta", header_parser="uniprot")
entry = db["O75385"]
print(entry.header_fields["gene_name"]) # ULK1A common proteomics workflow is to combine one or more FASTA files and append reversed decoy sequences. Use profasta.write_decoy_fasta to write decoy entries directly to a FASTA file:
import profasta
# Load one or more forward databases
db = profasta.ProteinDatabase()
db.add_fasta("proteome.fasta", header_parser="uniprot")
db.add_fasta("additional.fasta", header_parser="uniprot")
# Write the forward entries, then append decoy entries with reversed sequences
output_path = "combined_with_decoys.fasta"
db.write_fasta(output_path, header_writer="default")
profasta.write_decoy_fasta(db, output_path, append=True)Decoy headers are automatically prefixed with rev_. A custom prefix can be set via the decoy_tag argument:
profasta.write_decoy_fasta(db, output_path, append=True, decoy_tag="decoy_")- Juraj Ahel - @xeniorn