Skip to content

hollenstein/profasta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProFASTA

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Python Version from PEP 621 TOML pypi CI

Introduction

ProFASTA is a Python library for working with FASTA files containing protein records. It prioritizes simplicity while providing a practical set of features for proteomics-based mass spectrometry workflows.

Core functionality includes:

  • Parsing and writing FASTA files via profasta.io
  • Structured header parsing via a registry of built-in and user-defined parsers
  • A protein database (ProteinDatabase) for managing entries loaded from one or more FASTA files
  • Decoy database generation by sequence reversal
  • Header validation for non-ASCII characters

ProFASTA is developed as part of the computational toolbox for the Mass Spectrometry Facility at the Max Perutz Labs (University of Vienna).

Similar projects

If ProFASTA doesn't meet your requirements, consider exploring these alternative Python packages with a focus on protein-containing FASTA files:

  • fastapy is a lightweight package with no dependencies that offers FASTA reading functionality.
  • protfasta is another library with no dependencies that provides reading functionality along with basic validation (e.g., duplicate headers, conversion of non-canonical amino acids). The library also allows writing FASTA files with the ability to specify the sequence line length.
  • pyteomics is a feature-rich package that provides tools to handle various sorts of proteomics data. It provides functions for FASTA reading, automatic parsing of headers (in various formats defined at uniprot.org), writing, and generation of decoy entries. Note that pyteomics is a large package with many dependencies.

Requirements

Python >= 3.11

ProFASTA has no dependencies beyond the Python standard library.

Installation

Install from PyPI:

pip install profasta

Key concepts

FASTA parsing

The profasta.io.parse_fasta function reads a FASTA file and yields FastaRecord objects. Sequences are automatically normalized: letters are converted to uppercase, spaces are removed, and trailing * characters are stripped.

import profasta.io

with open("proteins.fasta", "r") as f:
    for record in profasta.io.parse_fasta(f):
        print(record.header, record.sequence)

Header parsers and the registry

ProFASTA uses a registry system for header parsers and writers. Built-in parsers are registered under the following names:

Name Description
"default" Splits on the first whitespace; never fails
"uniprot" Strict UniProt format parser
"uniprot_like" Tolerant UniProt-like format parser

Built-in writers follow the same naming convention.

Note: A DecoyWriter is available but not registered by default. It prepends "rev_" to the header string. You can manually register it or create a version with a custom tag using DecoyWriter.with_tag("your_tag").

Custom parsers and writers can be registered via:

profasta.parser.register_parser("my_parser", MyParser)
profasta.parser.register_writer("my_writer", MyWriter)

A parser must implement a parse(header: str) -> ParsedHeader classmethod, and a writer must implement a write(parsed_header: ParsedHeader) -> str classmethod.

Parsed Uniprot header fields

The following header_fields keys are available after parsing with UniprotParser or UniprotLikeParser. Guaranteed fields are always present; optional fields are only populated if the corresponding tag was found in the header.

Note: All fields are stored as strings, including organism_identifier.

Field UniprotParser UniprotLikeParser Notes
db guaranteed guaranteed Source database identifier
identifier guaranteed guaranteed Accession or unique ID
entry_name guaranteed guaranteed Entry name
protein_name guaranteed optional Text before the first tag field
organism_name optional optional OS field
organism_identifier optional optional OX field (NCBI taxonomy ID)
gene_name optional optional GN field
protein_existence optional optional PE field
sequence_version optional optional SV field

ProteinDatabase

The ProteinDatabase class provides a dict-like interface for managing protein entries loaded from FASTA files:

import profasta

db = profasta.ProteinDatabase()
db.add_fasta("proteins.fasta", header_parser="uniprot")

entry = db["O75385"]
print(entry.header_fields["gene_name"])  # ULK1

Multiple FASTA files can be added to the same database. Entries with unparseable headers can be skipped using skip_invalid=True.

A ProteinDatabase can also be created directly from one or more FASTA files using the from_fasta convenience constructor:

fasta_paths = ["proteome1.fasta", "proteome2.fasta"]
db = profasta.ProteinDatabase.from_fasta(*fasta_paths, header_parser="uniprot")

Entries can be filtered by a condition using the filter method, which returns a new ProteinDatabase:

human_db = db.filter(lambda e: e.header_fields.get("organism_identifier") == "9606")

Header validation

The profasta.validation module provides a function for checking FASTA records for non-ASCII characters in their headers, which can cause issues in downstream processing:

import profasta.validation

with open("proteins.fasta", "r") as f:
    records = list(profasta.io.parse_fasta(f))

issues = profasta.validation.find_header_ascii_issues(records)
for issue in issues:
    print(issue.header, issue.non_ascii_characters)

Usage examples

Load a UniProt FASTA file and access a protein entry

import profasta

db = profasta.ProteinDatabase()
db.add_fasta("./examples/uniprot_hsapiens_10entries.fasta", header_parser="uniprot")

entry = db["O75385"]
print(entry.header_fields["gene_name"])  # ULK1

Combine multiple FASTA files and add decoy entries

A common proteomics workflow is to combine one or more FASTA files and append reversed decoy sequences. Use profasta.write_decoy_fasta to write decoy entries directly to a FASTA file:

import profasta

# Load one or more forward databases
db = profasta.ProteinDatabase()
db.add_fasta("proteome.fasta", header_parser="uniprot")
db.add_fasta("additional.fasta", header_parser="uniprot")

# Write the forward entries, then append decoy entries with reversed sequences
output_path = "combined_with_decoys.fasta"
db.write_fasta(output_path, header_writer="default")
profasta.write_decoy_fasta(db, output_path, append=True)

Decoy headers are automatically prefixed with rev_. A custom prefix can be set via the decoy_tag argument:

profasta.write_decoy_fasta(db, output_path, append=True, decoy_tag="decoy_")

Contributors

About

ProFASTA is a Python library for working with FASTA files containing protein records.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages