Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ plugins {
}

group = "net.maizegenetics"
version = "0.2.9"
version = "0.2.10"

repositories {
mavenCentral()
Expand Down
86 changes: 62 additions & 24 deletions docs/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,11 @@ seq_sim setup-environment -w my_workdir

## align-assemblies (Step 01)

Aligns multiple query assemblies to a reference genome using AnchorWave and minimap2.
Aligns multiple query assemblies to a reference genome via the PHGv2
[`align-assemblies`](https://phg.maizegenetics.net/build_and_load/#align-assemblies-parameters)
command, which itself drives AnchorWave + minimap2 under the hood. This wrapper
keeps seq_sim's CLI surface (`--ref-gff`, `--ref-fasta`, `--query-fasta`, ...)
and the `maf_file_paths.txt` output contract that downstream steps depend on.

**Usage:**
```bash
Expand All @@ -119,31 +123,47 @@ seq_sim align-assemblies [OPTIONS]

**Options:**
- `--work-dir`, `-w`: Working directory (default: `seq_sim_work`)
- `--ref-gff`, `-g`: Reference GFF file (required)
- `--ref-fasta`, `-r`: Reference FASTA file (required)
- `--query-fasta`, `-q`: Query input (required) - can be a single FASTA (`.fa`, `.fasta`, `.fna`), a directory of FASTAs, or a text file listing one path per line
- `--threads`, `-t`: Number of threads to use (default: 1)
- `--ref-gff`, `-g`: Reference GFF file (required, forwarded as PHGv2 `--gff`)
- `--ref-fasta`, `-r`: Reference FASTA file (required, forwarded as PHGv2 `--reference-file`). For best results this should be the output of `phg prepare-assemblies`.
- `--query-fasta`, `-q`: Query input (required) - can be a single FASTA (`.fa`, `.fasta`, `.fna`), a directory of FASTAs, or a text file listing one path per line. Translated to a PHGv2 `--assembly-file-list` internally.
- `--threads`, `-t`: Total number of threads available to PHGv2 (`--total-threads`, default: 1)
- `--in-parallel`: How many alignments to run in parallel (PHGv2 `--in-parallel`). If omitted, PHGv2 picks a value from system memory and thread count.
- `--ref-max-align-cov`: Maximum reference genome alignment coverage for AnchorWave `proali` (PHGv2 `--ref-max-align-cov`, default: 1)
- `--query-max-align-cov`: Maximum query genome alignment coverage for AnchorWave `proali` (PHGv2 `--query-max-align-cov`, default: 1)
- `--conda-env-prefix`: Path to a Conda env containing PHGv2's runtime deps (anchorwave, minimap2, samtools, ...). Defaults to the `phgv2-conda` env in its standard location.
- `--just-ref-prep`: Only run PHGv2's reference-prep phase and stop. Useful for SLURM array workflows; no per-query MAFs and no `maf_file_paths.txt` are produced.
- `--output-dir`, `-o`: Custom output directory (default: `<work-dir>/output/01_anchorwave_results`)

**What it does:**
1. Extracts CDS sequences from reference GFF using `anchorwave gff2seq`
2. Aligns reference to CDS with `minimap2` (once for all queries)
3. For each query, runs `minimap2` and `anchorwave proali` to produce alignments
4. Generates `maf_file_paths.txt` listing all produced MAF files
1. Collects the query FASTA list from `--query-fasta` and writes it as
`<output-dir>/assemblies_list.txt` (the PHGv2 `--assembly-file-list`).
2. Invokes `phg align-assemblies` from `<work-dir>/src/phg_v2/bin/phg`. PHGv2
then runs `anchorwave gff2seq`, `minimap2`, and `anchorwave proali`
internally.
3. Collects the resulting `.maf` files PHGv2 wrote to the output directory and
produces `maf_file_paths.txt` so downstream steps (`maf-to-gvcf`,
`create-chain-files`) continue to work unchanged.

**Output:**
- `<work-dir>/output/01_anchorwave_results/{refBase}_cds.fa`
- `<work-dir>/output/01_anchorwave_results/{refBase}.sam`
- `<work-dir>/output/01_anchorwave_results/{queryName}/` containing `{queryName}.sam`, `*.anchors`, `*.maf`, `*.f.maf`
- `<work-dir>/output/01_anchorwave_results/assemblies_list.txt` (PHGv2 assembly-file-list, generated by this wrapper)
- `<work-dir>/output/01_anchorwave_results/{queryName}.maf` (per-query alignment, one file each)
- `<work-dir>/output/01_anchorwave_results/{queryName}.sam`
- `<work-dir>/output/01_anchorwave_results/{queryName}_{refBase}.anchorspro`
- `<work-dir>/output/01_anchorwave_results/{queryName}.svg` (dot plot)
- `<work-dir>/output/01_anchorwave_results/ref.cds.fasta`, `{refBase}.sam` (reference-prep outputs)
- `<work-dir>/output/01_anchorwave_results/maf_file_paths.txt`
- `<work-dir>/logs/01_align_assemblies.log`

**Examples:**
```bash
# Directory of queries
# Directory of queries, 8 threads
seq_sim align-assemblies -g ref.gff -r ref.fa -q queries/ -t 8

# Text list of query paths
seq_sim align-assemblies -g ref.gff -r ref.fa -q queries.txt -t 4
# Text list of query paths, 4 threads, run 2 alignments in parallel
seq_sim align-assemblies -g ref.gff -r ref.fa -q queries.txt -t 4 --in-parallel 2

# Reference-prep only (for SLURM array workflows)
seq_sim align-assemblies -g ref.gff -r ref.fa -q queries.txt --just-ref-prep
```

---
Expand Down Expand Up @@ -409,7 +429,12 @@ seq_sim format-recombined-fastas \
## align-mutated-assemblies (Step 10)

Realigns the formatted recombined (or otherwise mutated) FASTA files back to
the reference genome. This is the first step of the PS4G creation workflow.
the reference genome via the PHGv2
[`align-assemblies`](https://phg.maizegenetics.net/build_and_load/#align-assemblies-parameters)
command, which itself drives AnchorWave + minimap2 under the hood. This is the
first step of the PS4G creation workflow. The wrapper keeps seq_sim's existing
CLI surface and the `maf_file_paths.txt` output contract that step 11
(`mutated-maf-to-gvcf`) depends on.

**Usage:**
```bash
Expand All @@ -418,23 +443,36 @@ seq_sim align-mutated-assemblies [OPTIONS]

**Options:**
- `--work-dir`, `-w`: Working directory (default: `seq_sim_work`)
- `--ref-gff`, `-g`: Reference GFF file (required)
- `--ref-fasta`, `-r`: Reference FASTA file (required)
- `--fasta-input`, `-f`: FASTA input (required) - single file, directory, or text list
- `--threads`, `-t`: Number of threads to use (default: 1)
- `--output-dir`, `-o`: Custom output directory (default: `work_dir/output/10_mutated_alignment_results`)
- `--ref-gff`, `-g`: Reference GFF file (required, forwarded as PHGv2 `--gff`)
- `--ref-fasta`, `-r`: Reference FASTA file (required, forwarded as PHGv2 `--reference-file`). For best results this should be the output of `phg prepare-assemblies`.
- `--fasta-input`, `-f`: FASTA input (required) - single file, directory, or text list. Translated to a PHGv2 `--assembly-file-list` internally.
- `--threads`, `-t`: Total number of threads available to PHGv2 (`--total-threads`, default: 1)
- `--in-parallel`: How many alignments to run in parallel (PHGv2 `--in-parallel`). If omitted, PHGv2 picks a value from system memory and thread count.
- `--ref-max-align-cov`: Maximum reference genome alignment coverage for AnchorWave `proali` (PHGv2 `--ref-max-align-cov`, default: 1)
- `--query-max-align-cov`: Maximum query genome alignment coverage for AnchorWave `proali` (PHGv2 `--query-max-align-cov`, default: 1)
- `--conda-env-prefix`: Path to a Conda env containing PHGv2's runtime deps. Defaults to the `phgv2-conda` env in its standard location.
- `--just-ref-prep`: Only run PHGv2's reference-prep phase and stop. No per-query MAFs and no `maf_file_paths.txt` are produced.
- `--output-dir`, `-o`: Custom output directory (default: `<work-dir>/output/10_mutated_alignment_results`)

**Output:**
- `<work-dir>/output/10_mutated_alignment_results/{refBase}_cds.fa`
- `<work-dir>/output/10_mutated_alignment_results/{refBase}.sam`
- `<work-dir>/output/10_mutated_alignment_results/{fastaName}/` containing alignments
- `<work-dir>/output/10_mutated_alignment_results/assemblies_list.txt` (PHGv2 assembly-file-list, generated by this wrapper)
- `<work-dir>/output/10_mutated_alignment_results/{fastaName}.maf` (per-FASTA alignment, one file each)
- `<work-dir>/output/10_mutated_alignment_results/{fastaName}.sam`
- `<work-dir>/output/10_mutated_alignment_results/{fastaName}_{refBase}.anchorspro`
- `<work-dir>/output/10_mutated_alignment_results/{fastaName}.svg` (dot plot)
- `<work-dir>/output/10_mutated_alignment_results/ref.cds.fasta`, `{refBase}.sam` (reference-prep outputs)
- `<work-dir>/output/10_mutated_alignment_results/maf_file_paths.txt`
- `<work-dir>/logs/10_align_mutated_assemblies.log`

**Example:**
```bash
seq_sim align-mutated-assemblies \
-g ref.gff -r ref.fa -f seq_sim_work/output/09_formatted_fastas/ -t 8

# Run 2 alignments in parallel with 4 total threads
seq_sim align-mutated-assemblies \
-g ref.gff -r ref.fa -f seq_sim_work/output/09_formatted_fastas/ \
-t 4 --in-parallel 2
```

---
Expand Down
14 changes: 12 additions & 2 deletions pipeline_config.example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,12 @@ align_assemblies:
ref_gff: "path/to/reference.gff" # Required: Reference GFF annotation file
ref_fasta: "path/to/reference.fa" # Required: Reference FASTA file
query_fasta: "path/to/queries.txt" # Required: Single query file, directory, or text list
threads: 1 # Optional: Number of threads (default: 1)
threads: 1 # Optional: PHGv2 --total-threads (default: 1)
# in_parallel: 2 # Optional: PHGv2 --in-parallel (omit for auto-tuning)
# ref_max_align_cov: 1 # Optional: PHGv2 --ref-max-align-cov (proali -R, default 1)
# query_max_align_cov: 1 # Optional: PHGv2 --query-max-align-cov (proali -Q, default 1)
# conda_env_prefix: "/path/to/env" # Optional: PHGv2 --conda-env-prefix (overrides phgv2-conda)
# just_ref_prep: false # Optional: PHGv2 --just-ref-prep (ref-prep only, no MAFs)
# output: "/custom/output/path/" # Optional: Custom output directory

# Step 2: MAF to GVCF Conversion
Expand Down Expand Up @@ -191,7 +196,12 @@ align_mutated_assemblies:
# Uses align_assemblies.ref_fasta if not specified
# fasta_input: "/custom/fasta/input/" # Optional: Query FASTA input (file, directory, or text list)
# Uses format_recombined_fastas output if not specified
threads: 1 # Optional: Number of threads (default: 1)
threads: 1 # Optional: PHGv2 --total-threads (default: 1)
# in_parallel: 2 # Optional: PHGv2 --in-parallel (omit for auto-tuning)
# ref_max_align_cov: 1 # Optional: PHGv2 --ref-max-align-cov (proali -R, default 1)
# query_max_align_cov: 1 # Optional: PHGv2 --query-max-align-cov (proali -Q, default 1)
# conda_env_prefix: "/path/to/env" # Optional: PHGv2 --conda-env-prefix (overrides phgv2-conda)
# just_ref_prep: false # Optional: PHGv2 --just-ref-prep (ref-prep only, no MAFs)
# output: "/custom/alignment/output/" # Optional: Custom output directory

# Step 11: Mutated MAF to GVCF Conversion
Expand Down
Loading
Loading