-
Notifications
You must be signed in to change notification settings - Fork 7
Trycycler
WARNING : Trycycler is not an assembler. Reading the Trycycler wiki before attempting this workflow is highly recommended.
Trycycler divides input fastq files and runs each subset through a variety of assemblers and then attempts to reconcile differences in supposedly similar sequences. For a species of bacteria like Salmonella enterica or Escherichia coli with a genome that is roughly 5M bases, 'Trycycler subsample' generally requires over 10,000 reads. If the subsample step fails, fastq files are instead divided into subsets randomly with rasusa.
nextflow run UPHL-BioNGS/Donut_Falls -profile singularity --assembler trycycler --remove remove.txt -resume
Trycycler will produce a beautiful genome sequence, but often needs manual input during reconcile. There are many ways to adjust the discrepancies observed, but the only one supported by Donut Falls is to remove problematic sequences from comparison. Please do not ignore errors on this step!
Lastly, -resume is your friend!
There is a an optional-until-required comma-delimited file specified by 'params.remove' with a specified format of sample, cluster, and name of sequence separated by commas.
Below are some sample errors and the values that would be in the 'remove' file that will be named 'remove.csv'. Trycycler warnings via the command line are colored for clarity, but those font changes are harder to read when a terminal does not support those font embellishments or in nextflow tower error reports (which is where these were copied and pasted from).
All nextflow errors follow a similar format:
- Cause summary
- The command that was used
- Exit status
- Standard out (stdout)
- Standard error (stderr) ***
- The work directory (work dir)
- Tip to use
-resume
*** The most important section in the reconcile section is what was printed to the screen.
In this example, the indels are two large when comparing each sequence to each other sequence.
The nextflow error
Error executing process > 'trycycler:reconcile (1326935_cluster_001)'
Caused by:
Process `trycycler:reconcile (1326935_cluster_001)` terminated with an error exit status (1)
Command executed:
trycycler --version
if [ -f "remove.txt" ]
then
while read line
do
cluster=$(echo $line | cut -f 2 -d ,)
file=$(echo $line | cut -f 3 -d ,)
if [ -f "$cluster/1_contigs/$file.fasta" ] ; then mv $cluster/1_contigs/$file.fasta $cluster/1_contigs/$file.fasta_remove ; fi
done < <(grep ^1326935, remove.txt)
fi
num_fasta=$(ls cluster_001/1_contigs/*.fasta | wc -l)
echo "There are $num_fasta in cluster_001 for 1326935"
if [ "$num_fasta" -ge "4" ]
then
trycycler reconcile --reads 1326935_filtered.fastq.gz --cluster_dir cluster_001 --threads 12
ls
ls cluster_001/2_all_seqs.fasta
else
echo "1326935 cluster cluster_001 only had $num_fasta fastas"
mv cluster_001 cluster_001_cluster_too_small
fi
Command exit status:
1
Command output:
Trycycler v0.5.3
There are 12 in cluster_001 for 1326935
1326935_filtered.fastq.gz
cluster_001
remove.txt
Command error:
G_1326935_miniasm_07_utg000001c vs K_1326935_flye_10_edge_1... 99.98% identity, max indel = 115
G_1326935_miniasm_07_utg000001c vs L_1326935_raven_04_Utg602... 99.95% identity, max indel = 258
H_1326935_flye_02_edge_1 vs I_1326935_miniasm_11_utg000001c... 99.98% identity, max indel = 175
H_1326935_flye_02_edge_1 vs J_1326935_flye_06_edge_1... 99.99% identity, max indel = 2
H_1326935_flye_02_edge_1 vs K_1326935_flye_10_edge_1... 100.00% identity, max indel = 1
H_1326935_flye_02_edge_1 vs L_1326935_raven_04_Utg602... 99.95% identity, max indel = 258
I_1326935_miniasm_11_utg000001c vs J_1326935_flye_06_edge_1... 99.98% identity, max indel = 258
I_1326935_miniasm_11_utg000001c vs K_1326935_flye_10_edge_1... 99.98% identity, max indel = 258
I_1326935_miniasm_11_utg000001c vs L_1326935_raven_04_Utg602... 99.96% identity, max indel = 194
J_1326935_flye_06_edge_1 vs K_1326935_flye_10_edge_1... 99.99% identity, max indel = 2
J_1326935_flye_06_edge_1 vs L_1326935_raven_04_Utg602... 99.95% identity, max indel = 258
K_1326935_flye_10_edge_1 vs L_1326935_raven_04_Utg602... 99.95% identity, max indel = 258
Pairwise identities:
A_1: [2m100.00%[0m 99.93% 99.97% 99.93% 99.92% 99.97% 99.97% 99.97% 99.96% 99.97% 99.97% 99.93%
B_1: 99.93% [2m100.00%[0m 99.95% 99.91% 99.90% 99.95% 99.95% 99.95% 99.96% 99.95% 99.95% 99.93%
C_1: 99.97% 99.95% [2m100.00%[0m 99.94% 99.94% 99.99% 99.98% 99.99% 99.98% 99.99% 99.99% 99.95%
D_1326935_raven_08_Utg658: 99.93% 99.91% 99.94% [2m100.00%[0m 99.90% 99.95% 99.94% 99.95% 99.94% 99.95% 99.95% 99.92%
E_1326935_raven_12_Utg616: 99.92% 99.90% 99.94% 99.90% [2m100.00%[0m 99.94% 99.93% 99.94% 99.93% 99.94% 99.94% 99.91%
F_1326935_miniasm_03_utg000001c: 99.97% 99.95% 99.99% 99.95% 99.94% [2m100.00%[0m 99.99% 99.99% 99.98% 99.99% 99.99% 99.95%
G_1326935_miniasm_07_utg000001c: 99.97% 99.95% 99.98% 99.94% 99.93% 99.99% [2m100.00%[0m 99.99% 99.98% 99.99% 99.98% 99.95%
H_1326935_flye_02_edge_1: 99.97% 99.95% 99.99% 99.95% 99.94% 99.99% 99.99% [2m100.00%[0m 99.98% 99.99% 100.00% 99.95%
I_1326935_miniasm_11_utg000001c: 99.96% 99.96% 99.98% 99.94% 99.93% 99.98% 99.98% 99.98% [2m100.00%[0m 99.98% 99.98% 99.96%
J_1326935_flye_06_edge_1: 99.97% 99.95% 99.99% 99.95% 99.94% 99.99% 99.99% 99.99% 99.98% [2m100.00%[0m 99.99% 99.95%
K_1326935_flye_10_edge_1: 99.97% 99.95% 99.99% 99.95% 99.94% 99.99% 99.98% 100.00% 99.98% 99.99% [2m100.00%[0m 99.95%
L_1326935_raven_04_Utg602: 99.93% 99.93% 99.95% 99.92% 99.91% 99.95% 99.95% 99.95% 99.96% 99.95% 99.95% [2m100.00%[0m
Maximum insertion/deletion sizes:
A_1: [2m 0[0m 375 375 375 375 375 375 375 375 375 375 375
B_1: 375 [2m 0[0m 258 258 972 258 258 258 256 258 258 256
C_1: 375 258 [2m 0[0m 121 972 121 121 121 175 121 121 258
D_1326935_raven_08_Utg658: 375 258 121 [2m 0[0m 972 [31m1145[0m [31m1150[0m [31m1145[0m 883 [31m1139[0m [31m1145[0m 969
E_1326935_raven_12_Utg616: 375 972 972 972 [2m 0[0m 260 260 260 260 260 260 260
F_1326935_miniasm_03_utg000001c: 375 258 121 [31m1145[0m 260 [2m 0[0m 115 4 175 4 4 258
G_1326935_miniasm_07_utg000001c: 375 258 121 [31m1150[0m 260 115 [2m 0[0m 115 175 115 115 258
H_1326935_flye_02_edge_1: 375 258 121 [31m1145[0m 260 4 115 [2m 0[0m 175 2 1 258
I_1326935_miniasm_11_utg000001c: 375 256 175 883 260 175 175 175 [2m 0[0m 258 258 194
J_1326935_flye_06_edge_1: 375 258 121 [31m1139[0m 260 4 115 2 258 [2m 0[0m 2 258
K_1326935_flye_10_edge_1: 375 258 121 [31m1145[0m 260 4 115 1 258 2 [2m 0[0m 258
L_1326935_raven_04_Utg602: 375 256 258 969 260 258 258 258 194 258 258 [2m 0[0m
Error: some pairwise indels are greater than the maximum allowed value of 1000.
Please remove offending sequences or raise the --max_indel_size threshold and
try again.
1326935_filtered.fastq.gz
cluster_001
remove.txt
ls: cluster_001/2_all_seqs.fasta: No such file or directory
Work dir:
/Volumes/IDGenomics_NAS/testing_DF/create_test_files/work/2c/15b0239767a1ba8c7776fb78138641
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
The sample with the large indel size is 1326935, the cluster is cluster_001, and the sequence associated with the error is D_1326935_raven_08_Utg658. To remove this sequence, a line consisting of '1326935,cluster_011,D_1326935_raven_08_Utg658' would be added to 'remove.txt'. There is also the option to increase the allowed indel size by adjusting params.trycycler_reconcile in a config file with something like params.trycycler_reconcile = '--max_indel_size 1200'
.
In this example, the start and end of one sequence were found in multiple places in the other sequences.
The nextflow error
Error executing process > 'trycycler:reconcile (1326933-2_cluster_001)'
Caused by:
Process `trycycler:reconcile (1326933-2_cluster_001)` terminated with an error exit status (1)
Command executed:
trycycler --version
if [ -f "remove.txt" ]
then
while read line
do
cluster=$(echo $line | cut -f 2 -d ,)
file=$(echo $line | cut -f 3 -d ,)
if [ -f "$cluster/1_contigs/$file.fasta" ] ; then mv $cluster/1_contigs/$file.fasta $cluster/1_contigs/$file.fasta_remove ; fi
done < <(grep ^1326933-2, remove.txt)
fi
num_fasta=$(ls cluster_001/1_contigs/*.fasta | wc -l)
echo "There are $num_fasta in cluster_001 for 1326933-2"
if [ "$num_fasta" -ge "4" ]
then
trycycler reconcile --reads 1326933-2_filtered.fastq.gz --cluster_dir cluster_001 --threads 12
ls
ls cluster_001/2_all_seqs.fasta
else
echo "1326933-2 cluster cluster_001 only had $num_fasta fastas"
mv cluster_001 cluster_001_cluster_too_small
fi
Command exit status:
1
Command output:
Trycycler v0.5.3
There are 12 in cluster_001 for 1326933-2
1326933-2_filtered.fastq.gz
cluster_001
remove.txt
Command error:
no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
using G_1326933-2_raven_12_Utg596:
no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
using H_1326933-2_miniasm_07_utg000001c:
no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
using I_1326933-2_miniasm_11_utg000001c:
no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
using J_1326933-2_flye_06_edge_1:
no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
using K_1326933-2_flye_10_edge_1:
no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
using L_1326933-2_flye_02_edge_2:
no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
circularisation complete (3,965,323 bp)
Circularising E_1326933-2_miniasm_03_utg000001l:
using A_1:
unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in A_1
using B_1:
unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in B_1
using C_1:
unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in C_1
using D_1326933-2_raven_08_Utg600:
unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in D_1326933-2_raven_08_Utg600
using F_1326933-2_raven_04_Utg570:
unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in F_1326933-2_raven_04_Utg570
using G_1326933-2_raven_12_Utg596:
unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in G_1326933-2_raven_12_Utg596
using H_1326933-2_miniasm_07_utg000001c:
unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in H_1326933-2_miniasm_07_utg000001c
using I_1326933-2_miniasm_11_utg000001c:
unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in I_1326933-2_miniasm_11_utg000001c
using J_1326933-2_flye_06_edge_1:
unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in J_1326933-2_flye_06_edge_1
using K_1326933-2_flye_10_edge_1:
unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in K_1326933-2_flye_10_edge_1
using L_1326933-2_flye_02_edge_2:
unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in L_1326933-2_flye_02_edge_2
Error: failed to circularise sequence E_1326933-2_miniasm_03_utg000001l because
its start/end sequences were found in multiple ambiguous places in other
sequences. This is likely because E_1326933-2_miniasm_03_utg000001l starts/ends
in a repetitive region. You can either manually repair its circularisation (and
ensure it does not start/end in a repetitive region) or exclude the sequence
altogether and try again.
1326933-2_filtered.fastq.gz
cluster_001
remove.txt
ls: cluster_001/2_all_seqs.fasta: No such file or directory
Work dir:
/Volumes/IDGenomics_NAS/testing_DF/create_test_files/work/8d/32a34210b544ea4abb4b703a567c2b
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
In this example, the sample is '1326933-2', the cluster is 'cluster_001', and the problematic sequence is 'E_1326933-2_miniasm_03_utg000001l'. To remove this sequence, a line consisting of '1326933-2,cluster_001,E_1326933-2_miniasm_03_utg000001l' would be added to 'remove.txt' and the workflow would be started again with -resume
.
In this example, the length of the sequences are not similar enough.
The nextflow error
Error executing process > 'trycycler:reconcile (1326933-2_cluster_002)'
Caused by:
Process `trycycler:reconcile (1326933-2_cluster_002)` terminated with an error exit status (1)
Command executed:
trycycler --version
if [ -f "remove.txt" ]
then
while read line
do
cluster=$(echo $line | cut -f 2 -d ,)
file=$(echo $line | cut -f 3 -d ,)
if [ -f "$cluster/1_contigs/$file.fasta" ] ; then mv $cluster/1_contigs/$file.fasta $cluster/1_contigs/$file.fasta_remove ; fi
done < <(grep ^1326933-2, remove.txt)
fi
num_fasta=$(ls cluster_002/1_contigs/*.fasta | wc -l)
echo "There are $num_fasta in cluster_002 for 1326933-2"
if [ "$num_fasta" -ge "4" ]
then
trycycler reconcile --reads 1326933-2_filtered.fastq.gz --cluster_dir cluster_002 --threads 12
ls
ls cluster_002/2_all_seqs.fasta
else
echo "1326933-2 cluster cluster_002 only had $num_fasta fastas"
mv cluster_002 cluster_002_cluster_too_small
fi
Command exit status:
1
Command output:
Trycycler v0.5.3
There are 9 in cluster_002 for 1326933-2
1326933-2_filtered.fastq.gz
cluster_002
remove.txt
Command error:
[93m[1m[4mStarting Trycycler reconcile[0m [2m(2023-03-10 12:59:51)[0m
[2m Trycycler reconcile is a tool for reconciling multiple alternative contigs[0m
[2mwith each other.[0m
Input reads: 1326933-2_filtered.fastq.gz
size = 354,044,517 bytes
Input contigs:
cluster_002/1_contigs/A_2.fasta (19,660 bp)
cluster_002/1_contigs/B_2.fasta (19,662 bp)
cluster_002/1_contigs/E_1326933-2_miniasm_03_utg000002c.fasta (19,662 bp)
cluster_002/1_contigs/F_1326933-2_raven_04_Utg572.fasta (19,661 bp)
cluster_002/1_contigs/G_1326933-2_raven_12_Utg598.fasta (19,663 bp)
cluster_002/1_contigs/H_1326933-2_miniasm_07_utg000002c.fasta (19,662 bp)
cluster_002/1_contigs/I_1326933-2_miniasm_11_utg000002c.fasta (19,664 bp)
cluster_002/1_contigs/K_1326933-2_flye_10_edge_2.fasta (19,660 bp)
cluster_002/1_contigs/L_1326933-2_flye_02_edge_3.fasta (13,852 bp)
Checking required software:
minimap2: v2.23-r1111
[93m[1m[4mInitial check of contigs[0m [2m(2023-03-10 12:59:51)[0m
[2m Before proceeding, Trycycler ensures that the input contigs appear[0m
[2msufficiently close to each other to make a consensus. If not, the program will[0m
[2mquit and the user must fix the input contigs (make them more similar to each[0m
[2mother) or exclude some before trying again.[0m
Relative sequence lengths:
A_2: [2m1.000[0m 1.000 1.000 1.000 1.000 1.000 1.000 1.000 [31m1.419[0m
B_2: 1.000 [2m1.000[0m 1.000 1.000 1.000 1.000 1.000 1.000 [31m1.419[0m
E_1326933-2_miniasm_03_utg000002c: 1.000 1.000 [2m1.000[0m 1.000 1.000 1.000 1.000 1.000 [31m1.419[0m
F_1326933-2_raven_04_Utg572: 1.000 1.000 1.000 [2m1.000[0m 1.000 1.000 1.000 1.000 [31m1.419[0m
G_1326933-2_raven_12_Utg598: 1.000 1.000 1.000 1.000 [2m1.000[0m 1.000 1.000 1.000 [31m1.420[0m
H_1326933-2_miniasm_07_utg000002c: 1.000 1.000 1.000 1.000 1.000 [2m1.000[0m 1.000 1.000 [31m1.419[0m
I_1326933-2_miniasm_11_utg000002c: 1.000 1.000 1.000 1.000 1.000 1.000 [2m1.000[0m 1.000 [31m1.420[0m
K_1326933-2_flye_10_edge_2: 1.000 1.000 1.000 1.000 1.000 1.000 1.000 [2m1.000[0m [31m1.419[0m
L_1326933-2_flye_02_edge_3: [31m0.705[0m [31m0.705[0m [31m0.705[0m [31m0.705[0m [31m0.704[0m [31m0.705[0m [31m0.704[0m [31m0.705[0m [2m1.000[0m
Error: there is too much length difference between contigs. You must either
exclude or repair the offending contig sequences and then try running trycycler
reconcile again. If one of the sequences is too long, it could be due to
excessive circularisation overlap, and trimming that overlap may allow
trycycler reconcile to continue.
1326933-2_filtered.fastq.gz
cluster_002
remove.txt
ls: cluster_002/2_all_seqs.fasta: No such file or directory
Work dir:
/Volumes/IDGenomics_NAS/testing_DF/create_test_files/work/b8/12ea5639ea2fa7fb1a055483b51ae0
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
In this example, the sample is '1326933-2', the cluster is 'cluster_002', and the problematic sequence is 'L_1326933-2_flye_02_edge_3' because it is much smaller. To remove this sequence, a line consisting of '1326933-2,cluster_002,L_1326933-2_flye_02_edge_3' would be added to 'remove.txt' and the workflow would be started again with -resume
.
If the only problems were from the examples presented above, remove.txt would have the lines of
1326935,cluster_011,D_1326935_raven_08_Utg658
1326933-2,cluster_001,E_1326933-2_miniasm_03_utg000001l
1326933-2,cluster_002,L_1326933-2_flye_02_edge_3
The corresponding command line would then be something like the following (DO NOT FORGET TO use '-resume'!)
nextflow run UPHL-BioNGS/Donut_Falls -profile singularity --assembler trycycler --remove remove.txt -resume
---
Trycycler
---
flowchart LR
A[filtered fastq] --> B[subsample]
B --> C[assemble with flye]
B --> D[assemble with unicycler]
B --> E[assemble with miniasm and minipolish]
B --> F[assemble with raven]
C --> G[cluster]
D --> G[cluster]
E --> G[cluster]
F --> G[cluster]
G --> H[reconcile]
H -- remove sequences --> H
H --> I[msa]
I --> J[partician]
A --> J
J --> K[consensus]
K --> L[combine fasta]
L --> M[polish]
params.rasusa_options = '--frac 80'
params.trycycler_subsample_options = ''
params.trycycler_cluster_options = ''
params.trycycler_consensus_options = ''
params.trycycler_dotplot_options = ''
params.trycycler_msa_options = ''
params.trycycler_partition_options = ''
params.trycycler_reconcile_options = ''