Skip to content

FullLengthFanatic/tecap

Repository files navigation

tecap

tests DOI

3' terminal exon capture diagnostics for long-read single-cell RNA-seq.

tecap classifies long-read alignments by where their 3' end lands relative to the terminal exon (TE), its UTR, and a polyA site atlas. It decomposes capture failures into nine mechanism buckets (successful capture, truncation at a real polyA site, internal priming in the UTR, internal priming in the CDS, alternative polyadenylation, upstream-exon mispriming, intronic mispriming, downstream readthrough) and measures reference base composition downstream of each cleavage site to distinguish classic A-tract internal priming from moderate-A priming characteristic of saturating-local-concentration oligo-dT chemistries (10x GEM droplets, BD Rhapsody capture beads).

Designed for PacBio Iso-Seq / Kinnex and Oxford Nanopore cDNA BAMs. Direct-RNA sequencing is explicitly unsupported (no RT, no priming artifact to diagnose).

Mechanisms

Every classified read lands in exactly one of nine buckets, defined by where its 3' end falls relative to the terminal exon (TE), the TE's UTR / CDS, and the nearest annotated PolyASite cluster.

Bucket What it means Why it matters
Captured 3' end in the TE; read covers >=50% of it. Successful full-length capture of the mRNA 3' end; the goal of any 3'-end protocol.
MechA-correct 3' end in the TE 3' UTR within +-25 bp of an annotated polyA cluster, but read covers <50% of TE. Truncated transcript that nonetheless terminates at a real polyA site; common with degraded input or short-fragment library prep.
MechA-internalUTR 3' end in the TE 3' UTR but not at any annotated polyA cluster. Internal oligo-dT priming on an A-rich stretch in the UTR; classic mispriming signature.
IP-TE-CDS 3' end inside the terminal exon's CDS portion. Internal priming on the coding portion of the TE; strong mispriming signal.
MechA-noCDS 3' end inside the TE of a non-coding gene. Reported separately so the coding-gene buckets stay clean.
MechB-APA 3' end upstream of the TE at an annotated polyA cluster on an upstream exon. Alternative polyadenylation isoform; biological, not a mispriming artifact.
MechB-exon 3' end on an upstream exon, no nearby polyA cluster. Internal priming on an upstream exon.
MechB-aspecific 3' end upstream of the TE in an intron or gene flank. Pre-mRNA priming or off-target alignment.
MechC 3' end downstream of the TE end. Read-through, unannotated 3' UTR extension, or alignment artifact.

The basecomp subcommand also splits Captured / MechA / MechB-APA reads by whether their cluster carries a canonical AAUAAA-like hexamer (PAS+/-).

Run tecap explain to print these definitions on the terminal, or tecap explain --mechanism MechA-correct --format json for a single entry.

Reading the plots

  • {sample}_terminal_exon.png — three panels: bucket fractions, read-length density (Captured vs MechA-correct), and rates by 3' UTR length bin. Mispriming bias concentrates in the long-UTR bins.
  • {sample}_mecha_scatter.png — read length vs TE coverage for MechA-correct reads only; reads above the dashed coverage threshold get promoted to Captured.
  • {sample}_basecomp.png — eight panels, one per bucket, showing %A in the reference window downstream of cleavage. Grey band (30-50% A): moderate-A priming. Dashed line (>=60% A): classical A-tract priming. Mispriming buckets enriched in the grey band but not past the dashed line are characteristic of saturating-local-concentration oligo-dT chemistries (10x GEM droplets, BD Rhapsody capture beads); free oligo-dT at standard concentrations (bulk Iso-Seq) mis-primes preferentially past the dashed line on classical A-tracts.
  • comparison_*.png — same panels, multiple samples grouped on the same axes. Generated by tecap compare or tecap report (multi-sample mode).

Install

pip install git+https://github.com/FullLengthFanatic/tecap@v0.3.0

Development install:

git clone https://github.com/FullLengthFanatic/tecap
cd tecap
pip install -e .[dev]
pytest

Quick start

# Classify reads. References are auto-fetched on first run and cached
# under ~/.cache/tecap/GRCh38/.
tecap classify \
    --bam sample.bam \
    --genome GRCh38 \
    --gtf-version 45 \
    --sample S1 \
    --out-dir results/ \
    --threads 8 \
    --platform cdna-pacbio \
    --verbose

# Or pass references explicitly (no auto-download):
tecap classify \
    --bam sample.bam \
    --gtf gencode.v45.annotation.gtf.gz \
    --polya-sites atlas.clusters.3.0.GRCh38.GENCODE_42.bed.gz \
    --sample S1 --out-dir results/ --threads 8

# Measure base composition in the 20 nt window downstream of each cleavage site
tecap basecomp \
    --bam sample.bam \
    --genome GRCh38 \
    --gtf-version 45 \
    --fasta GRCh38.primary_assembly.genome.fa.gz \
    --sample S1 \
    --out-dir results/ \
    --threads 8 \
    --verbose

# Render a self-contained HTML report (per-sample)
tecap report \
    --classify-json results/S1_terminal_exon.json \
    --basecomp-json results/S1_basecomp.json \
    --out-html results/S1_report.html

# Cross-sample HTML report (space-separated paths)
tecap report \
    --classify-json results/A_terminal_exon.json results/B_terminal_exon.json \
    --basecomp-json results/A_basecomp.json results/B_basecomp.json \
    --out-html results/compare.html

# Print the mechanism glossary
tecap explain
tecap explain --mechanism MechA-correct --format json

# Cross-sample comparison plots only (no HTML)
tecap compare \
    --mode classify \
    --inputs results/A_terminal_exon.json,results/B_terminal_exon.json \
    --out-dir results/

# Fetch references explicitly (otherwise --genome handles this)
tecap download-atlas \
    --genome GRCh38 \
    --gtf-version 45 \
    --out-dir ref/

Outputs

Per sample (classify):

  • {sample}_terminal_exon.json — bucket counts, fractions, PAS split, UTR-length stratification, orientation sanity check, read-length medians.
  • {sample}_terminal_exon.png — 3-panel summary plot.
  • {sample}_mecha_scatter.png — read length vs TE coverage for MechA-correct reads.
  • {sample}_tecap_mqc.json — MultiQC custom-content table (auto-detected by the _mqc.json suffix).
  • {sample}_per_gene.tsv (optional, with --per-gene-table) — per-gene bucket counts.

Per sample (basecomp):

  • {sample}_basecomp.json — %A histograms per bucket, medians, >=60% and 30-50% fractions.
  • {sample}_basecomp.png — 8-panel histogram grid.

Cross-sample:

  • comparison_terminal_exon.png — grouped bars across samples.
  • comparison_basecomp.png — per-bucket histogram overlays.

Example plots

Outputs from a 4-sample run on tecap compare: 10x Kinnex (10x_FL_v02_full), BD Rhapsody Kinnex (BD46_FS_SEQ), PacBio Kinnex bulk cerebellum, PacBio Kinnex bulk heart. All human GRCh38, all sequenced as FL Kinnex / MAS-ISO / PacBio HiFi.

Terminal-exon bucket fractions and UTR-bin MechA-correct rates across the four samples.

%A in the 20 nt reference window downstream of each cleavage site, per bucket, for the four samples. Grey band: 30-50% A (moderate-A priming). Dashed line: 60% A (classical A-tract priming). MechB_aspecific shows the chemistry split: 10x and BD Rhapsody enriched in the grey band, bulk Iso-Seq enriched past the dashed line.

HTML report (tecap report):

  • Self-contained .html per sample (and per comparison) with embedded PNGs, executive summary tiles, mechanism legend, per-bucket tables, PAS split, and UTR-length stratification. Single file, no JS.

Citation

If you use tecap, please cite the GitHub release DOI (see CITATION.cff).

License

MIT