dalmolingroup · gleisonm · Apr 23, 2026 · Apr 14, 2026 · Apr 15, 2026 · Apr 15, 2026
diff --git a/README.md b/README.md
@@ -1,25 +1,24 @@
 <h1>
   <picture>
-    <source media="(prefers-color-scheme: dark)" srcset="docs/images/nf-core-nottocode_logo_dark.png">
-    <img alt="nf-core/nottocode" src="docs/images/nf-core-nottocode_logo_light.png">
+    <source media="(prefers-color-scheme: dark)" srcset="docs/images/nf-core-not2code_logo_dark.png">
+    <img alt="nf-core/not2code" src="docs/images/nf-core-not2code_logo_light.png">
   </picture>
 </h1>
 
-[![GitHub Actions CI Status](https://github.com/nf-core/nottocode/actions/workflows/ci.yml/badge.svg)](https://github.com/nf-core/nottocode/actions/workflows/ci.yml)
-[![GitHub Actions Linting Status](https://github.com/nf-core/nottocode/actions/workflows/linting.yml/badge.svg)](https://github.com/nf-core/nottocode/actions/workflows/linting.yml)[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/nottocode/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
+[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/not2code/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
 [![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)
 
 [![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.04.0-23aa62.svg)](https://www.nextflow.io/)
 [![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
 [![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
 [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
-[![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/nf-core/nottocode)
+[![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/nf-core/not2code)
 
-[![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23nottocode-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/nottocode)[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?labelColor=000000&logo=twitter)](https://twitter.com/nf_core)[![Follow on Mastodon](https://img.shields.io/badge/mastodon-nf__core-6364ff?labelColor=FFFFFF&logo=mastodon)](https://mstdn.science/@nf_core)[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?labelColor=000000&logo=youtube)](https://www.youtube.com/c/nf-core)
+[![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23not2code-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/not2code)[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?labelColor=000000&logo=twitter)](https://twitter.com/nf_core)[![Follow on Mastodon](https://img.shields.io/badge/mastodon-nf__core-6364ff?labelColor=FFFFFF&logo=mastodon)](https://mstdn.science/@nf_core)[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?labelColor=000000&logo=youtube)](https://www.youtube.com/c/nf-core)
 
 ## Introduction
 
-**nf-core/nottocode** is a bioinformatics designed to annotate long non-coding RNAs (lncRNAs) from transcriptome assemblies. It works by filtering, comparing, and characterizing transcripts from GTF files, which can be generated by tools such as stringtie via nf-core/rnaseq.
+**not2code** is a bioinformatics pipeline designed to annotate long non-coding RNAs (lncRNAs) from transcriptome assemblies from small bulk RNA-seq data. It works by filtering, comparing, and characterizing transcripts from GTF files, which can be generated by tools such as StringTie via nf-core/rnaseq.
 
 The pipeline performs the following key steps:
 
@@ -87,56 +86,106 @@ graph TD
 > [!NOTE]
 > If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
 
+The `not2code` pipeline requires assembled transcriptomes (GTF files) as input. A typical workflow involves running `nf-core/rnaseq` first to align reads and assemble transcripts, followed by `dalmolingroup/not2code` to identify lncRNAs.
 
-First, prepare a samplesheet with your input data that looks as follows:
+### 1. Run nf-core/rnaseq
 
-`samplesheet.csv`:
+First, run `nf-core/rnaseq` with an aligner like HISAT2. You should also ensure StringTie is properly parameterized to preserve transcripts that are candidates for lncRNAs (e.g. modifying minimum junction coverage, minimum read coverage, and minimum length). 
+
+Here is an example of creating a custom configuration for StringTie and running the `rnaseq` pipeline:
+
+```bash
+# Create custom StringTie configuration
+echo "process {
+    withName: 'NFCORE_RNASEQ:RNASEQ:STRINGTIE_STRINGTIE' {
+        ext.args = { 
+            [
+                '-j 3',
+                '-c 3',
+                '-m 200',
+                params.stringtie_extra_args ?: ''
+            ].join(' ').trim() 
+        }
+    }
+}" > stringtie.config
+
+# Run nf-core/rnaseq
+nextflow run nf-core/rnaseq \
+  -profile docker \
+  -r 3.19.0 \
+  --input rnaseq_samplesheet.csv \
+  --outdir results_rnaseq \
+  --fasta /path/to/genome.fna \
+  --gtf /path/to/genome.gtf \
+  --trimmer fastp \
+  --aligner hisat2 \
+  --skip_pseudo_alignment \
+  -c stringtie.config
+```
+
+### 2. Prepare the input samplesheet
+
+After the `rnaseq` pipeline finishes, you need to prepare a samplesheet for `not2code`. The samplesheet requires two columns: `sample` and `gtf`. You can create this automatically from the `rnaseq` StringTie outputs using the following bash script:
+
+```bash
+# Output samplesheet file
+output="gtf_samplesheet.csv"
+
+# Write header
+echo "sample,gtf" > "$output"
+
+# List .transcripts.gtf files and process each one
+for path in results_rnaseq/hisat2/stringtie/*transcripts.gtf; do
+    sample=$(basename "$path" .transcripts.gtf)
+    echo "$sample,$(realpath $path)" >> "$output"
+done
+```
+
+This will create a `gtf_samplesheet.csv` that looks like this:
 
 ```csv
 sample,gtf
-CONTROL_REP1,AEG588A1.gtf
+CONTROL_REP1,/path/to/results_rnaseq/hisat2/stringtie/CONTROL_REP1.transcripts.gtf
 ```
 
-Each row represents a gtf file.
-
-Now, you can run the pipeline using:
+### 3. Run nf-core/not2code
 
-<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
+Now, you can run the `not2code` pipeline using the generated samplesheet:
 
 ```bash
-nextflow run nf-core/nottocode \
-  --input /path/to/samplesheet.csv \
-  --reference_gtf /path/to/genome.gff \
+nextflow run dalmolingroup/not2code \
+  -profile docker \
+  --input gtf_samplesheet.csv \
+  --reference_gff /path/to/genome.gff \
+  --reference_gtf /path/to/genome.gtf \
   --reference_genome /path/to/genome.fna \
-  --outdir results \
   --pfam_db /path/to/Pfam-A.hmm \
+  --outdir results_not2code
 ```
 
 
 ## Pipeline output
 
-To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/nottocode/results) tab on the nf-core website pipeline page.
-For more details about the output files and reports, please refer to the
-[output documentation](https://nf-co.re/nottocode/output).
+To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/not2code/results) tab on the nf-core website pipeline page.
 
-## Credits
+The pipeline will create a simplified output structure under the designated `--outdir` parameter (e.g. `results/`):
 
-nf-core/nottocode was developed by @gleisonm and @rafaellaferraz.
+* `lncRNA_candidates/`: Contains the final GTF files for identified lncRNAs.
+* `coding_potential/`: Contains the evaluation of coding/non-coding capability from tools like CPC2, PLEK, and HMMER scans.
+* `transcriptome_assembly/`: Contains intermediate merged transcriptomes, and GTF files from gffcompare and TPM filtering.
+* `multiqc/`: General aggregate quality control report.
+* `pipeline_info/`: Generic logs and reports relating to the execution of the Nextflow pipeline.
 
-We thank all contributors to the nf-core community for support and feedback.
-
-## Contributions and Support
+For more details about the output files and reports, please refer to the
+[output documentation](https://nf-co.re/not2code/output).
 
-If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
+## Credits
 
-For further information or help, don't hesitate to get in touch on the [Slack `#nottocode` channel](https://nfcore.slack.com/channels/nottocode) (you can join with [this invite](https://nf-co.re/join/slack)).
+not2code was developed by @gleisonm and @rafaellaferraz.
 
 ## Citations
 
-<!-- TODO nf-core: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file. -->
-<!-- If you use nf-core/nottocode for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX) -->
-
-<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
+If you use dalmolingroup/not2code for your analysis, please cite it using its Digital Object Identifier (DOI).
 
 An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
 

diff --git a/assets/generate_test_hmm.py b/assets/generate_test_hmm.py
@@ -0,0 +1,74 @@
+#!/usr/bin/env python3
+"""
+Script para gerar um HMM proteico válido para testes do pipeline not2code.
+Requer hmmbuild (pacote HMMER) instalado no ambiente.
+
+Uso:
+    micromamba activate nextflow
+    python3 assets/generate_test_hmm.py
+
+Ou diretamente com singularity:
+    singularity exec \
+        /home/gmdazevedo/scratch/singularity_images/depot.galaxyproject.org-singularity-hmmer-3.4--hdbdd923_1.img \
+        hmmbuild --amino assets/test_protein.hmm /dev/stdin << 'EOF'
+# STOCKHOLM 1.0
+seq1  MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWSTPSELGHAGLNGDILVWNPVLEDAFELSSMGIRVDADTLKHQLALTGDEDRLELEWHQALLRGEMPQTIGGGIGQSRLTMLLLQLPHIGQVQAGVWPAAVRESVPSLL
+//
+EOF
+"""
+
+import subprocess
+import sys
+import os
+import textwrap
+
+# Sequence of MSTRG.5.1.p1 from TransDecoder (generated from the test GTF on chr22 intergenic region)
+# We use this so HMMER successfully finds a match during the test, ensuring domtblout is not empty!
+TEST_SEQUENCE = """\
+SLRLLGREHPINYPGPIYARGLSHSCSTALLSHWHRELLGAPGFQIGSRESPLFSDPQTLLWILIVLSP\
+ISLLGTHWEVSRIPVRDSTTISLILTLTFFLFLPS
+""".replace('\n', '')
+
+STOCKHOLM_CONTENT = f"""# STOCKHOLM 1.0
+#=GF ID   test_protein_hmm
+#=GF AC   TEST0001.1
+#=GF DE   Test protein HMM for not2code pipeline - amino acid profile
+test_seq  {TEST_SEQUENCE}
+//
+"""
+
+def main():
+    # Caminho para o arquivo de saída
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    output_hmm = os.path.join(script_dir, "test_protein.hmm")
+    sto_file = os.path.join(script_dir, "test_protein.sto")
+
+    print(f"Gerando HMM proteico de teste em: {output_hmm}")
+
+    # Escrever o alinhamento Stockholm
+    with open(sto_file, 'w') as f:
+        f.write(STOCKHOLM_CONTENT)
+    print(f"Alinhamento Stockholm criado: {sto_file}")
+
+    print("\nFile sto created. Please run hmmbuild manually:")
+    print(f"hmmbuild --amino {output_hmm} {sto_file}")
+
+    # Verificar o arquivo gerado
+    if os.path.exists(output_hmm):
+        size = os.path.getsize(output_hmm)
+        print(f"Tamanho do arquivo: {size} bytes")
+
+        # Mostrar o cabeçalho do HMM
+        with open(output_hmm) as f:
+            lines = f.readlines()[:8]
+            print("Primeiras linhas do HMM:")
+            for line in lines:
+                print(f"  {line}", end='')
+
+    # Limpar arquivo temporário
+    os.remove(sto_file)
+    print("\nConcluído! Use este arquivo em test.config como pfam_db.")
+    print(f'pfam_db = "{output_hmm}"')
+
+if __name__ == "__main__":
+    main()
diff --git a/assets/lncselect.R b/assets/lncselect.R
@@ -47,12 +47,17 @@ pfam <- readr::read_table(pfam_domtblout, comment = "#", col_names = FALSE)
 #  "acc", "description"
 # )
 
-
-transcripts_with_domain <- as.data.frame(pfam) %>%
-  dplyr::filter(X7 <= pfam_evalue) %>%
-  mutate(transcript_id = str_remove(X1, "\\.p[0-9]+$")) %>% # Remove .p1, .p2 etc
-  distinct(transcript_id) %>% 
-  mutate(domain_present = "yes")
+# Handle empty pfam file (no Pfam domain hits - common with test data)
+if (nrow(pfam) == 0 || !("X7" %in% colnames(pfam))) {
+  transcripts_with_domain <- data.frame(transcript_id = character(0),
+                                        domain_present = character(0))
+} else {
+  transcripts_with_domain <- as.data.frame(pfam) %>%
+    dplyr::filter(X7 <= pfam_evalue) %>%
+    mutate(transcript_id = str_remove(X1, "\\.p[0-9]+$")) %>% # Remove .p1, .p2 etc
+    distinct(transcript_id) %>%
+    mutate(domain_present = "yes")
+}
 
 gtf_complete <- merge(dup_reference_annot, transcripts_with_domain, by = "transcript_id", all.x = T)
 gtf_complete <- merge(gtf_complete, PLEK[,c("transcript_id", "PLEK")], by = "transcript_id", all.x = T)
@@ -70,8 +75,8 @@ gtf_lncRNA <- gtf_complete %>% filter(transcript_id %in% lncRNA) %>%
   mutate(transcript_biotype = ifelse(type == "transcript", "lncRNA", ""))
 
 
-rtracklayer::export(gtf_lncRNA, "gtf_lncRNA.gtf", format = gtf)
-rtracklayer::export(gtf_complete, "gtf_complete.gtf", format = gtf)
+rtracklayer::export(gtf_lncRNA, "gtf_lncRNA.gtf", format = "GTF")
+rtracklayer::export(gtf_complete, "gtf_complete.gtf", format = "GTF")
 
 rm(dup_reference_annot, gtf_complete, lncRNA)
 
@@ -122,7 +127,7 @@ combined_stringtie_ncbi <- dplyr::bind_rows(reference_filter, gtf_lncRNA)
 sum(gtf_lncRNA$transcript_id %in% reference$transcript_id) #0 is expected
 
 # Exportar como novo GTF
-rtracklayer::export(combined_stringtie_ncbi, "combined_stringtie_ncbi.gtf", format = gtf)
+rtracklayer::export(combined_stringtie_ncbi, "combined_stringtie_ncbi.gtf", format = "GTF")
 
 #rm(combined_stringtie_ncbi, gtf_lncRNA, reference, reference_filter)
 
@@ -154,18 +159,20 @@ combined_stringtie_ncbi_new_names <- combined_stringtie_ncbi_new_names %>%
   dplyr::rename(gene_id_stringtie = gene_id) %>% 
   dplyr::rename(transcript_id = transcript_id_new) %>%
   dplyr::rename(gene_id = gene_id_new) %>% 
-  dplyr::select(seqnames,start, end,width,strand,
-                source, type,score,phase,gene_id,
-                transcript_id,db_xref,gbkey,locus_tag,partial,
-                orig_protein_id,orig_transcript_id,product, transcript_biotype,exon_number,
-                pseudo,PLEK,transcript_length,peptide_length,CPC2,
-                xloc,class_code,tss_id,gene_name,cmp_ref,
-                transcript_id_stringtie,gene_id_stringtie)
+  dplyr::select(any_of(c(seqnames="seqnames", start="start", end="end", width="width", strand="strand",
+                source="source", type="type", score="score", phase="phase", gene_id="gene_id",
+                transcript_id="transcript_id", db_xref="db_xref", gbkey="gbkey", locus_tag="locus_tag", partial="partial",
+                orig_protein_id="orig_protein_id", orig_transcript_id="orig_transcript_id", product="product",
+                transcript_biotype="transcript_biotype", exon_number="exon_number",
+                pseudo="pseudo", PLEK="PLEK", transcript_length="transcript_length",
+                peptide_length="peptide_length", CPC2="CPC2",
+                xloc="xloc", class_code="class_code", tss_id="tss_id", gene_name="gene_name", cmp_ref="cmp_ref",
+                transcript_id_stringtie="transcript_id_stringtie", gene_id_stringtie="gene_id_stringtie")))
 
 combined_stringtie_ncbi_new_names %>% dplyr::select(gene_id,gene_id_stringtie,
                                                     transcript_id, transcript_id_stringtie, class_code)
 
-rtracklayer::export(combined_stringtie_ncbi_new_names, "combined_stringtie_ncbi_new_names.gtf", format = gtf)
+rtracklayer::export(combined_stringtie_ncbi_new_names, "combined_stringtie_ncbi_new_names.gtf", format = "GTF")
 
     cat('"', Sys.getenv("TASK_PROCESS", "SELECT_LNCRNAS"), '":\n', file="versions.yml", sep="")
     cat('    r-base: "', R.version.string, '"\n', file="versions.yml", append=TRUE, sep="")