Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 81 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,24 @@
<h1>
<picture>
<source media="(prefers-color-scheme: dark)" srcset="docs/images/nf-core-nottocode_logo_dark.png">
<img alt="nf-core/nottocode" src="docs/images/nf-core-nottocode_logo_light.png">
<source media="(prefers-color-scheme: dark)" srcset="docs/images/nf-core-not2code_logo_dark.png">
<img alt="nf-core/not2code" src="docs/images/nf-core-not2code_logo_light.png">
</picture>
</h1>

[![GitHub Actions CI Status](https://github.com/nf-core/nottocode/actions/workflows/ci.yml/badge.svg)](https://github.com/nf-core/nottocode/actions/workflows/ci.yml)
[![GitHub Actions Linting Status](https://github.com/nf-core/nottocode/actions/workflows/linting.yml/badge.svg)](https://github.com/nf-core/nottocode/actions/workflows/linting.yml)[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/nottocode/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/not2code/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
[![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)

[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.04.0-23aa62.svg)](https://www.nextflow.io/)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
[![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/nf-core/nottocode)
[![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/nf-core/not2code)

[![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23nottocode-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/nottocode)[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?labelColor=000000&logo=twitter)](https://twitter.com/nf_core)[![Follow on Mastodon](https://img.shields.io/badge/mastodon-nf__core-6364ff?labelColor=FFFFFF&logo=mastodon)](https://mstdn.science/@nf_core)[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?labelColor=000000&logo=youtube)](https://www.youtube.com/c/nf-core)
[![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23not2code-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/not2code)[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?labelColor=000000&logo=twitter)](https://twitter.com/nf_core)[![Follow on Mastodon](https://img.shields.io/badge/mastodon-nf__core-6364ff?labelColor=FFFFFF&logo=mastodon)](https://mstdn.science/@nf_core)[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?labelColor=000000&logo=youtube)](https://www.youtube.com/c/nf-core)

## Introduction

**nf-core/nottocode** is a bioinformatics designed to annotate long non-coding RNAs (lncRNAs) from transcriptome assemblies. It works by filtering, comparing, and characterizing transcripts from GTF files, which can be generated by tools such as stringtie via nf-core/rnaseq.
**not2code** is a bioinformatics pipeline designed to annotate long non-coding RNAs (lncRNAs) from transcriptome assemblies from small bulk RNA-seq data. It works by filtering, comparing, and characterizing transcripts from GTF files, which can be generated by tools such as StringTie via nf-core/rnaseq.

The pipeline performs the following key steps:

Expand Down Expand Up @@ -87,56 +86,106 @@ graph TD
> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.

The `not2code` pipeline requires assembled transcriptomes (GTF files) as input. A typical workflow involves running `nf-core/rnaseq` first to align reads and assemble transcripts, followed by `dalmolingroup/not2code` to identify lncRNAs.

First, prepare a samplesheet with your input data that looks as follows:
### 1. Run nf-core/rnaseq

`samplesheet.csv`:
First, run `nf-core/rnaseq` with an aligner like HISAT2. You should also ensure StringTie is properly parameterized to preserve transcripts that are candidates for lncRNAs (e.g. modifying minimum junction coverage, minimum read coverage, and minimum length).

Here is an example of creating a custom configuration for StringTie and running the `rnaseq` pipeline:

```bash
# Create custom StringTie configuration
echo "process {
withName: 'NFCORE_RNASEQ:RNASEQ:STRINGTIE_STRINGTIE' {
ext.args = {
[
'-j 3',
'-c 3',
'-m 200',
params.stringtie_extra_args ?: ''
].join(' ').trim()
}
}
}" > stringtie.config

# Run nf-core/rnaseq
nextflow run nf-core/rnaseq \
-profile docker \
-r 3.19.0 \
--input rnaseq_samplesheet.csv \
--outdir results_rnaseq \
--fasta /path/to/genome.fna \
--gtf /path/to/genome.gtf \
--trimmer fastp \
--aligner hisat2 \
--skip_pseudo_alignment \
-c stringtie.config
```

### 2. Prepare the input samplesheet

After the `rnaseq` pipeline finishes, you need to prepare a samplesheet for `not2code`. The samplesheet requires two columns: `sample` and `gtf`. You can create this automatically from the `rnaseq` StringTie outputs using the following bash script:

```bash
# Output samplesheet file
output="gtf_samplesheet.csv"

# Write header
echo "sample,gtf" > "$output"

# List .transcripts.gtf files and process each one
for path in results_rnaseq/hisat2/stringtie/*transcripts.gtf; do
sample=$(basename "$path" .transcripts.gtf)
echo "$sample,$(realpath $path)" >> "$output"
done
```

This will create a `gtf_samplesheet.csv` that looks like this:

```csv
sample,gtf
CONTROL_REP1,AEG588A1.gtf
CONTROL_REP1,/path/to/results_rnaseq/hisat2/stringtie/CONTROL_REP1.transcripts.gtf
```

Each row represents a gtf file.

Now, you can run the pipeline using:
### 3. Run nf-core/not2code

<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
Now, you can run the `not2code` pipeline using the generated samplesheet:

```bash
nextflow run nf-core/nottocode \
--input /path/to/samplesheet.csv \
--reference_gtf /path/to/genome.gff \
nextflow run dalmolingroup/not2code \
-profile docker \
--input gtf_samplesheet.csv \
--reference_gff /path/to/genome.gff \
--reference_gtf /path/to/genome.gtf \
--reference_genome /path/to/genome.fna \
--outdir results \
--pfam_db /path/to/Pfam-A.hmm \
--outdir results_not2code
```


## Pipeline output

To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/nottocode/results) tab on the nf-core website pipeline page.
For more details about the output files and reports, please refer to the
[output documentation](https://nf-co.re/nottocode/output).
To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/not2code/results) tab on the nf-core website pipeline page.

## Credits
The pipeline will create a simplified output structure under the designated `--outdir` parameter (e.g. `results/`):

nf-core/nottocode was developed by @gleisonm and @rafaellaferraz.
* `lncRNA_candidates/`: Contains the final GTF files for identified lncRNAs.
* `coding_potential/`: Contains the evaluation of coding/non-coding capability from tools like CPC2, PLEK, and HMMER scans.
* `transcriptome_assembly/`: Contains intermediate merged transcriptomes, and GTF files from gffcompare and TPM filtering.
* `multiqc/`: General aggregate quality control report.
* `pipeline_info/`: Generic logs and reports relating to the execution of the Nextflow pipeline.

We thank all contributors to the nf-core community for support and feedback.

## Contributions and Support
For more details about the output files and reports, please refer to the
[output documentation](https://nf-co.re/not2code/output).

If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
## Credits

For further information or help, don't hesitate to get in touch on the [Slack `#nottocode` channel](https://nfcore.slack.com/channels/nottocode) (you can join with [this invite](https://nf-co.re/join/slack)).
not2code was developed by @gleisonm and @rafaellaferraz.

## Citations

<!-- TODO nf-core: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file. -->
<!-- If you use nf-core/nottocode for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX) -->

<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
If you use dalmolingroup/not2code for your analysis, please cite it using its Digital Object Identifier (DOI).

An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.

Expand Down
74 changes: 74 additions & 0 deletions assets/generate_test_hmm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
#!/usr/bin/env python3
"""
Script para gerar um HMM proteico válido para testes do pipeline not2code.
Requer hmmbuild (pacote HMMER) instalado no ambiente.

Uso:
micromamba activate nextflow
python3 assets/generate_test_hmm.py

Ou diretamente com singularity:
singularity exec \
/home/gmdazevedo/scratch/singularity_images/depot.galaxyproject.org-singularity-hmmer-3.4--hdbdd923_1.img \
hmmbuild --amino assets/test_protein.hmm /dev/stdin << 'EOF'
# STOCKHOLM 1.0
seq1 MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWSTPSELGHAGLNGDILVWNPVLEDAFELSSMGIRVDADTLKHQLALTGDEDRLELEWHQALLRGEMPQTIGGGIGQSRLTMLLLQLPHIGQVQAGVWPAAVRESVPSLL
//
EOF
"""

import subprocess
import sys
import os
import textwrap

# Sequence of MSTRG.5.1.p1 from TransDecoder (generated from the test GTF on chr22 intergenic region)
# We use this so HMMER successfully finds a match during the test, ensuring domtblout is not empty!
TEST_SEQUENCE = """\
SLRLLGREHPINYPGPIYARGLSHSCSTALLSHWHRELLGAPGFQIGSRESPLFSDPQTLLWILIVLSP\
ISLLGTHWEVSRIPVRDSTTISLILTLTFFLFLPS
""".replace('\n', '')

STOCKHOLM_CONTENT = f"""# STOCKHOLM 1.0
#=GF ID test_protein_hmm
#=GF AC TEST0001.1
#=GF DE Test protein HMM for not2code pipeline - amino acid profile
test_seq {TEST_SEQUENCE}
//
"""

def main():
# Caminho para o arquivo de saída
script_dir = os.path.dirname(os.path.abspath(__file__))
output_hmm = os.path.join(script_dir, "test_protein.hmm")
sto_file = os.path.join(script_dir, "test_protein.sto")

print(f"Gerando HMM proteico de teste em: {output_hmm}")

# Escrever o alinhamento Stockholm
with open(sto_file, 'w') as f:
f.write(STOCKHOLM_CONTENT)
print(f"Alinhamento Stockholm criado: {sto_file}")

print("\nFile sto created. Please run hmmbuild manually:")
print(f"hmmbuild --amino {output_hmm} {sto_file}")

# Verificar o arquivo gerado
if os.path.exists(output_hmm):
size = os.path.getsize(output_hmm)
print(f"Tamanho do arquivo: {size} bytes")

# Mostrar o cabeçalho do HMM
with open(output_hmm) as f:
lines = f.readlines()[:8]
print("Primeiras linhas do HMM:")
for line in lines:
print(f" {line}", end='')

# Limpar arquivo temporário
os.remove(sto_file)
print("\nConcluído! Use este arquivo em test.config como pfam_db.")
print(f'pfam_db = "{output_hmm}"')

if __name__ == "__main__":
main()
41 changes: 24 additions & 17 deletions assets/lncselect.R
Original file line number Diff line number Diff line change
Expand Up @@ -47,12 +47,17 @@ pfam <- readr::read_table(pfam_domtblout, comment = "#", col_names = FALSE)
# "acc", "description"
# )


transcripts_with_domain <- as.data.frame(pfam) %>%
dplyr::filter(X7 <= pfam_evalue) %>%
mutate(transcript_id = str_remove(X1, "\\.p[0-9]+$")) %>% # Remove .p1, .p2 etc
distinct(transcript_id) %>%
mutate(domain_present = "yes")
# Handle empty pfam file (no Pfam domain hits - common with test data)
if (nrow(pfam) == 0 || !("X7" %in% colnames(pfam))) {
transcripts_with_domain <- data.frame(transcript_id = character(0),
domain_present = character(0))
} else {
transcripts_with_domain <- as.data.frame(pfam) %>%
dplyr::filter(X7 <= pfam_evalue) %>%
mutate(transcript_id = str_remove(X1, "\\.p[0-9]+$")) %>% # Remove .p1, .p2 etc
distinct(transcript_id) %>%
mutate(domain_present = "yes")
}

gtf_complete <- merge(dup_reference_annot, transcripts_with_domain, by = "transcript_id", all.x = T)
gtf_complete <- merge(gtf_complete, PLEK[,c("transcript_id", "PLEK")], by = "transcript_id", all.x = T)
Expand All @@ -70,8 +75,8 @@ gtf_lncRNA <- gtf_complete %>% filter(transcript_id %in% lncRNA) %>%
mutate(transcript_biotype = ifelse(type == "transcript", "lncRNA", ""))


rtracklayer::export(gtf_lncRNA, "gtf_lncRNA.gtf", format = gtf)
rtracklayer::export(gtf_complete, "gtf_complete.gtf", format = gtf)
rtracklayer::export(gtf_lncRNA, "gtf_lncRNA.gtf", format = "GTF")
rtracklayer::export(gtf_complete, "gtf_complete.gtf", format = "GTF")

rm(dup_reference_annot, gtf_complete, lncRNA)

Expand Down Expand Up @@ -122,7 +127,7 @@ combined_stringtie_ncbi <- dplyr::bind_rows(reference_filter, gtf_lncRNA)
sum(gtf_lncRNA$transcript_id %in% reference$transcript_id) #0 is expected

# Exportar como novo GTF
rtracklayer::export(combined_stringtie_ncbi, "combined_stringtie_ncbi.gtf", format = gtf)
rtracklayer::export(combined_stringtie_ncbi, "combined_stringtie_ncbi.gtf", format = "GTF")

#rm(combined_stringtie_ncbi, gtf_lncRNA, reference, reference_filter)

Expand Down Expand Up @@ -154,18 +159,20 @@ combined_stringtie_ncbi_new_names <- combined_stringtie_ncbi_new_names %>%
dplyr::rename(gene_id_stringtie = gene_id) %>%
dplyr::rename(transcript_id = transcript_id_new) %>%
dplyr::rename(gene_id = gene_id_new) %>%
dplyr::select(seqnames,start, end,width,strand,
source, type,score,phase,gene_id,
transcript_id,db_xref,gbkey,locus_tag,partial,
orig_protein_id,orig_transcript_id,product, transcript_biotype,exon_number,
pseudo,PLEK,transcript_length,peptide_length,CPC2,
xloc,class_code,tss_id,gene_name,cmp_ref,
transcript_id_stringtie,gene_id_stringtie)
dplyr::select(any_of(c(seqnames="seqnames", start="start", end="end", width="width", strand="strand",
source="source", type="type", score="score", phase="phase", gene_id="gene_id",
transcript_id="transcript_id", db_xref="db_xref", gbkey="gbkey", locus_tag="locus_tag", partial="partial",
orig_protein_id="orig_protein_id", orig_transcript_id="orig_transcript_id", product="product",
transcript_biotype="transcript_biotype", exon_number="exon_number",
pseudo="pseudo", PLEK="PLEK", transcript_length="transcript_length",
peptide_length="peptide_length", CPC2="CPC2",
xloc="xloc", class_code="class_code", tss_id="tss_id", gene_name="gene_name", cmp_ref="cmp_ref",
transcript_id_stringtie="transcript_id_stringtie", gene_id_stringtie="gene_id_stringtie")))

combined_stringtie_ncbi_new_names %>% dplyr::select(gene_id,gene_id_stringtie,
transcript_id, transcript_id_stringtie, class_code)

rtracklayer::export(combined_stringtie_ncbi_new_names, "combined_stringtie_ncbi_new_names.gtf", format = gtf)
rtracklayer::export(combined_stringtie_ncbi_new_names, "combined_stringtie_ncbi_new_names.gtf", format = "GTF")

cat('"', Sys.getenv("TASK_PROCESS", "SELECT_LNCRNAS"), '":\n', file="versions.yml", sep="")
cat(' r-base: "', R.version.string, '"\n', file="versions.yml", append=TRUE, sep="")
Expand Down
Loading
Loading