ProFASTA

Introduction

ProFASTA is a Python library for working with FASTA files containing protein records. It prioritizes simplicity while providing a practical set of features for proteomics-based mass spectrometry workflows.

Core functionality includes:

Parsing and writing FASTA files via profasta.io
Structured header parsing via a registry of built-in and user-defined parsers
A protein database (ProteinDatabase) for managing entries loaded from one or more FASTA files
Decoy database generation by sequence reversal
Header validation for non-ASCII characters

ProFASTA is developed as part of the computational toolbox for the Mass Spectrometry Facility at the Max Perutz Labs (University of Vienna).

Similar projects

If ProFASTA doesn't meet your requirements, consider exploring these alternative Python packages with a focus on protein-containing FASTA files:

fastapy is a lightweight package with no dependencies that offers FASTA reading functionality.
protfasta is another library with no dependencies that provides reading functionality along with basic validation (e.g., duplicate headers, conversion of non-canonical amino acids). The library also allows writing FASTA files with the ability to specify the sequence line length.
pyteomics is a feature-rich package that provides tools to handle various sorts of proteomics data. It provides functions for FASTA reading, automatic parsing of headers (in various formats defined at uniprot.org), writing, and generation of decoy entries. Note that pyteomics is a large package with many dependencies.

Requirements

Python >= 3.11

ProFASTA has no dependencies beyond the Python standard library.

Installation

Install from PyPI:

pip install profasta

Key concepts

FASTA parsing

The profasta.io.parse_fasta function reads a FASTA file and yields FastaRecord objects. Sequences are automatically normalized: letters are converted to uppercase, spaces are removed, and trailing * characters are stripped.

import profasta.io

with open("proteins.fasta", "r") as f:
    for record in profasta.io.parse_fasta(f):
        print(record.header, record.sequence)

Header parsers and the registry

ProFASTA uses a registry system for header parsers and writers. Built-in parsers are registered under the following names:

Name	Description
`"default"`	Splits on the first whitespace; never fails
`"uniprot"`	Strict UniProt format parser
`"uniprot_like"`	Tolerant UniProt-like format parser

Built-in writers follow the same naming convention.

Note: A DecoyWriter is available but not registered by default. It prepends "rev_" to the header string. You can manually register it or create a version with a custom tag using DecoyWriter.with_tag("your_tag").

Custom parsers and writers can be registered via:

profasta.parser.register_parser("my_parser", MyParser)
profasta.parser.register_writer("my_writer", MyWriter)

A parser must implement a parse(header: str) -> ParsedHeader classmethod, and a writer must implement a write(parsed_header: ParsedHeader) -> str classmethod.

Parsed Uniprot header fields

The following header_fields keys are available after parsing with UniprotParser or UniprotLikeParser. Guaranteed fields are always present; optional fields are only populated if the corresponding tag was found in the header.

Note: All fields are stored as strings, including organism_identifier.

Field	UniprotParser	UniprotLikeParser	Notes
`db`	guaranteed	guaranteed	Source database identifier
`identifier`	guaranteed	guaranteed	Accession or unique ID
`entry_name`	guaranteed	guaranteed	Entry name
`protein_name`	guaranteed	optional	Text before the first tag field
`organism_name`	optional	optional	OS field
`organism_identifier`	optional	optional	OX field (NCBI taxonomy ID)
`gene_name`	optional	optional	GN field
`protein_existence`	optional	optional	PE field
`sequence_version`	optional	optional	SV field

ProteinDatabase

The ProteinDatabase class provides a dict-like interface for managing protein entries loaded from FASTA files:

import profasta

db = profasta.ProteinDatabase()
db.add_fasta("proteins.fasta", header_parser="uniprot")

entry = db["O75385"]
print(entry.header_fields["gene_name"])  # ULK1

Multiple FASTA files can be added to the same database. Entries with unparseable headers can be skipped using skip_invalid=True.

A ProteinDatabase can also be created directly from one or more FASTA files using the from_fasta convenience constructor:

fasta_paths = ["proteome1.fasta", "proteome2.fasta"]
db = profasta.ProteinDatabase.from_fasta(*fasta_paths, header_parser="uniprot")

Entries can be filtered by a condition using the filter method, which returns a new ProteinDatabase:

human_db = db.filter(lambda e: e.header_fields.get("organism_identifier") == "9606")

Header validation

The profasta.validation module provides a function for checking FASTA records for non-ASCII characters in their headers, which can cause issues in downstream processing:

import profasta.validation

with open("proteins.fasta", "r") as f:
    records = list(profasta.io.parse_fasta(f))

issues = profasta.validation.find_header_ascii_issues(records)
for issue in issues:
    print(issue.header, issue.non_ascii_characters)

Usage examples

Load a UniProt FASTA file and access a protein entry

import profasta

db = profasta.ProteinDatabase()
db.add_fasta("./examples/uniprot_hsapiens_10entries.fasta", header_parser="uniprot")

entry = db["O75385"]
print(entry.header_fields["gene_name"])  # ULK1

Combine multiple FASTA files and add decoy entries

A common proteomics workflow is to combine one or more FASTA files and append reversed decoy sequences. Use profasta.write_decoy_fasta to write decoy entries directly to a FASTA file:

import profasta

# Load one or more forward databases
db = profasta.ProteinDatabase()
db.add_fasta("proteome.fasta", header_parser="uniprot")
db.add_fasta("additional.fasta", header_parser="uniprot")

# Write the forward entries, then append decoy entries with reversed sequences
output_path = "combined_with_decoys.fasta"
db.write_fasta(output_path, header_writer="default")
profasta.write_decoy_fasta(db, output_path, append=True)

Decoy headers are automatically prefixed with rev_. A custom prefix can be set via the decoy_tag argument:

profasta.write_decoy_fasta(db, output_path, append=True, decoy_tag="decoy_")

Contributors

Juraj Ahel - @xeniorn

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github/workflows		.github/workflows
examples		examples
profasta		profasta
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProFASTA

Introduction

Similar projects

Requirements

Installation

Key concepts

FASTA parsing

Header parsers and the registry

Parsed Uniprot header fields

ProteinDatabase

Header validation

Usage examples

Load a UniProt FASTA file and access a protein entry

Combine multiple FASTA files and add decoy entries

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ProFASTA

Introduction

Similar projects

Requirements

Installation

Key concepts

FASTA parsing

Header parsers and the registry

Parsed Uniprot header fields

ProteinDatabase

Header validation

Usage examples

Load a UniProt FASTA file and access a protein entry

Combine multiple FASTA files and add decoy entries

Contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages