Use this new version → https://nf-co.re/pairgenomealign/
In 2024 we re-implemented this pipeline as a proper nf-core pipeline. Use it unless you need old releases for reproducibility.
This pipeline uses addParams
to set module parameters, and it was deprecated
in Nextflow version 24.10.0
. Use an older version (for instance 24.04.4
)
if you want to suppress the deprecation warnings. Be warned that further
versions of Nextflow might fail to work with this pipeline.
We developed this pipeline was used to study genome scrambling in the marine zooplankton Oikopleura dioica.
Extreme genome scrambling in marine planktonic Oikopleura dioica cryptic species. Charles Plessy, Michael J. Mansfield, Aleksandra Bliznina, Aki Masunaga, Charlotte West, Yongkai Tan, Andrew W. Liu, Jan Grašič, María Sara del Río Pisula, Gaspar Sánchez-Serna, Marc Fabrega-Torrus, Alfonso Ferrández-Roldán, Vittoria Roncalli, Pavla Navratilova, Eric M. Thompson, Takeshi Onuma, Hiroki Nishida, Cristian Cañestro, Nicholas M. Luscombe. Genome Res. 2024. 34: 426-440; doi:10.1101/2023.05.09.539028. PubMed ID: 38621828 OIST research news article
And also please cite the LAST papers.
For each query genome, this pipeline will align it to the target genome, post-process the alignments and produce dot plots visualisations at different steps of the workflow. Each file contains a name suffix that indicates in which order they were created.
00.train
is the alignment parameters computed bylast-train
(optional)01.m2m_aln
is the many-to-many alignment between target and query genomes. (optional through the--m2m
option)02.m2m_plot
(optional)03.m2o_aln
is the many-to-one alignment regions of the target genome are matched at most once by the query genome.04.m2o_plot
(optional)05.o2o_aln
is the one-to-one alignment between the target and query genomes.06.o2o_plot
(optional)05b.o2m_aln
is the one-to-many alignment between the target and query genomes (optional).06b.o2m_plot
(optional)07.o2o_postmasked_aln
is a filtered one-to-one alignment where low-confidence matches made mostly of masked regions are removed. (optional)08.o2o_postmasked_plot
(optional)
The poly-N regions longer than 9 bases in each genome sequence are marked in
pale red in the dot-plots. These often indicate contig boundaries in
scaffolds. This is done with seqtk cutN
and its output is provided in the
seqtk
directory.
-
--target
: path or URL to one genome file in FASTA format. It will be indexed. -
--input
: path to a sample sheet in tab-separated format with one header lineid file
, and one row per genome (ID and path or URL to FASTA file).— or —
--query
: path or URL to one genome file in FASTA format.
-
--with_windowmasker
optionally soft-masks the genome for interspersed repeats with lowercase characters using thewindowmasker
tool of the BLAST suite (https://pubmed.ncbi.nlm.nih.gov/16287941/). The original soft-masking is erased, to match the behaviour of the pipeline when this option is not selected. -
--keepOriginalMask
keeps the original soft-masked characters and prevents LAST from adding extra masks with TANTAN. -
--seed
selects the name of the LAST seed The default (YASS
) searches for “long-and-weak similarities” that “allow for mismatches but not gaps”. Among alternatives, there areNEAR
for “short-and-strong (near-identical) similarities … with many gaps (insertions and deletions)”,MAM8
to find “weak similarities with high sensitivity, but low speed and high memory usage” orRY128
that “reduces run time and memory use, by only seeking seeds at ~1/128 of positions in each sequence”, which is useful when the purpose of running this pipeline is only to generate whole-genome dotplots, or when sensitivity for tiny fragments may be unnecessary or undesirable. Setting the seed toPSEUDO
triggers protein-to-DNA alignment mode (experimental). -
--lastal_args
defaults to-C2
and is applied to both the calls tolast-train
andlastal
, like in the LAST cookbook and the last-genome-alignments tutorial. -
--lastal_extr_args
(default:-D1e9
) is only passed tolastal
and can be used for arguments that are not recognised bylast-train
. -
--lastal_params
: path to a file containing alignment parameters computed bylast-train
or a scoring matrix. If this option is not used, the pipeline will runlast-train
for each query. -
--m2m
: (default: false) Compute and output the many-to-many alignment. This adds time and can comsume considerable amount of space; use only if you need that data. -
--o2m
: (default: false) Also compute the one-to-many alignments and dotplots. This is sometimes useful when troubleshooting the preparation of diploid assemblies. -
--one_to_one_only
: do not copy the other alignments to the results folder, thus saving disk space. -
By default,
last-split
runs with-m1e-5
to omit alignments with mismap probability > 10−5, but this can be overriden with the--last_split_mismap
option. -
--last_split_args
defaults to empty value and is not very useful at the moment, but is kept for backwards compatibility. It can be used to pass options tolast-split
. Note that if you used--m2m false
(which is the default), the split parameters have to be passed in--lastal_extra_args
and have different names (see split options in the lastal documentation). -
The dotplots can be modified by overriding defaults and passing new arguments via the
--dotplot_options
argument. Defaults and available options can be seen on the manual page of thelast-dotplot
program. By default in this pipeline, the sequences of the query genome are sorted and oriented by their alignment to the target genome (--sort2=3 --strands2=1
). For readability, their names are written horizontally (--rot2=h
). -
Use
--skip_dotplot_m2m
,--skip_dotplot_m2o
,--skip_dotplot_o2o
--skip_dotplot_o2m
to skip the production of the dot plots that can be computationally expensive and visually uninformative on large genomes with shared repeats. File suffixes (see above) will not change. -
By default the LAST index is named
target
and the ouput files are named from the query IDs. Use the--targetName
option to provide a name that will be used for the LAST index and that will be prefixed to the query IDs with a___
separator. -
Use
--postmask
to filter out the one-to-one alignments that contain a significant fraction of soft-masked (lowercased) sequences, using thelast-postmask
tool. This is not necessary iflastdb
was run with the-c
option, which is the default since version7.0.0
.
Fixed arguments (taken from the LAST cookbook and the LAST tuning manual)
-
The
lastdb
step soft-masks simple repeats by default, (-c -R01
). It indexes both strands (-S2
), which increases speed at the expense of memory usage. -
The
last-train
commands runs with--revsym
as the DNA strands play equivalent roles in the studied genomes, unless the--read_align
option is selected. -
last-split
runs with-fMAF+
to make it show per-base mismap probabilities, except in read alignment mode (see below).
The --read_align
option can be used to align sequencing reads to a genome.
The output will be a single alignment file with a many-to-one relationship
between the target genome and the query reads. The alignment process is
similar with the --skip_m2m
mode, with the difference that the scoring matrix
computed by last-train
is allowed to be asymmetric. FASTA and FASTQ
formats are allowed, and by default the quality values are ignored. This can
be changed by passing keep
, sanger
, solexa
, or illumina
as an argument
to --read_align
as described in the lastal
documentation. The default
seeding scheme is used but it may be a good idea to use RY128
instead to speed
up the alignment.
nextflow run oist/plessy_pairwiseGenomeComparison -r main \
--input samplesheet.tsv \
--target sequencefile.fa \
[-profile yourInstitution]
This pipeline can use the institutional profiles defined in nf-core (https://github.com/nf-core/configs#documentation)
Note that your tests may fail if you do not set the -profile
option to a
configuration suitable for your system. See https://nf-co.re/configs for
common ones. You also need to ensure that your work directory is writable by
your compute nodes, by setting the -work-dir
option appropriately.
nextflow run oist/plessy_pairwiseGenomeComparison -r main \
--target https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.fasta \
--query https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/illumina/fasta/contigs.fasta
nextflow run ./main.nf \
--input testInput.tsv \
--target https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.fasta
nextflow run oist/plessy_pairwiseGenomeComparison -r main \
--target https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.fasta \
--query https://github.com/nf-core/test-datasets/raw/modules/data/genomics/sarscov2/nanopore/fastq/test_2.fastq.gz \
--read_align
nextflow run oist/plessy_pairwiseGenomeComparison -r main \
--target https://github.com/nf-core/test-datasets/raw/modules/data/genomics/sarscov2/genome/proteome.fasta.gz \
--query https://github.com/nf-core/test-datasets/raw/modules/data/genomics/sarscov2/genome/genome.fasta \
--seed PSEUDO
The results directory contains three reports generated by Nextflow:
report.html
informs on the pipeline, its version, some metrics about execution time.timeline.html
displays the execution times like a Gantt chart.trace.tsv
provides the raw data and can be displayed with thecolumn -ts$'\t'
command.
Computation resources allocated to the processe are set with standard nf-core
labels in the nextflow.config
file of the pipeline. To
override their value, create a configuration file in your local directory and
add it to the run's configuration with the -c
option.
For instance, with file called overrideLabels.nf
containing the following:
process {
withLabel:process_high {
time = 3.d
}
}
The command nextflow -c overrideLabels.nf run …
would set the execution time
limit for the training and alignment (whose module declare the process_high
label) to 3 days instead of the 1 hour default.
The pipeline aligns two large lungfish genomes, Lepidosiren paradoxa (80 Gbp) and Protopterus annectens (40 Gbp) using at most 200 GB memory.
lastdb 83.8 GB
last-train 46.4 GB
lastal --split 174.7 GB
last-split 63.9 GB
last-dotplot 5.7 GB
I apply semantic versioning to this pipeline:
-
Major increment when the interface changes in a way that is backwards-incompatible, in the sense that a run with the same command and the same data would produce a different result (except for non-deterministic computations).
-
Minor increment for any other change of the interface, such as additions of new functionalities.
-
Patch increment for changes that do not modify the interface (bug fixes, minor software and module updates, documentation changes, etc.)