Skip to content

Trycycler

Young edited this page Mar 14, 2023 · 8 revisions

Running Trycycler

WARNING : Trycycler is not an assembler. Reading the Trycycler wiki before attempting this workflow is highly recommended.

Trycycler divides input fastq files and runs each subset through a variety of assemblers and then attempts to reconcile differences in supposedly similar sequences. For a species of bacteria like Salmonella enterica or Escherichia coli with a genome that is roughly 5M bases, 'Trycycler subsample' generally requires over 10,000 reads. If the subsample step fails, fastq files are instead divided into subsets randomly with rasusa.

Usage

nextflow run UPHL-BioNGS/Donut_Falls -profile singularity --assembler trycycler --remove remove.txt -resume

Trycycler will produce a beautiful genome sequence, but often needs manual input during reconcile. There are many ways to adjust the discrepancies observed, but the only one supported by Donut Falls is to remove problematic sequences from comparison. Please do not ignore errors on this step!

Lastly, -resume is your friend!

Removing problem sequences

There is a an optional-until-required comma-delimited file specified by 'params.remove' with a specified format of sample, cluster, and name of sequence separated by commas.

Below are some sample errors and the values that would be in the 'remove' file that will be named 'remove.csv'. Trycycler warnings via the command line are colored for clarity, but those font changes are harder to read when a terminal does not support those font embellishments or in nextflow tower error reports (which is where these were copied and pasted from).

All nextflow errors follow a similar format:

  • Cause summary
  • The command that was used
  • Exit status
  • Standard out (stdout)
  • Standard error (stderr) ***
  • The work directory (work dir)
  • Tip to use -resume

*** The most important section in the reconcile section is what was printed to the screen.

Sample error: pairwise indels

In this example, the indels are two large when comparing each sequence to each other sequence.

The nextflow error
Error executing process > 'trycycler:reconcile (1326935_cluster_001)'

Caused by:
  Process `trycycler:reconcile (1326935_cluster_001)` terminated with an error exit status (1)

Command executed:

  trycycler --version 
  
  if [ -f "remove.txt" ]
  then
    while read line
    do
      cluster=$(echo $line | cut -f 2 -d ,)
      file=$(echo $line | cut -f 3 -d ,)
      if [ -f "$cluster/1_contigs/$file.fasta" ] ; then mv $cluster/1_contigs/$file.fasta $cluster/1_contigs/$file.fasta_remove ; fi
    done < <(grep ^1326935, remove.txt)
  fi
  
  num_fasta=$(ls cluster_001/1_contigs/*.fasta | wc -l)
  echo "There are $num_fasta in cluster_001 for 1326935"
  if [ "$num_fasta" -ge "4" ]
  then
    trycycler reconcile          --reads 1326935_filtered.fastq.gz         --cluster_dir cluster_001         --threads 12
  
      ls
  
      ls cluster_001/2_all_seqs.fasta
  else
    echo "1326935 cluster cluster_001 only had $num_fasta fastas"
    mv cluster_001 cluster_001_cluster_too_small
  fi

Command exit status:
  1

Command output:
  Trycycler v0.5.3
  There are 12 in cluster_001 for 1326935
  1326935_filtered.fastq.gz
  cluster_001
  remove.txt

Command error:
  G_1326935_miniasm_07_utg000001c vs K_1326935_flye_10_edge_1...        99.98% identity, max indel = 115
  G_1326935_miniasm_07_utg000001c vs L_1326935_raven_04_Utg602...       99.95% identity, max indel = 258
         H_1326935_flye_02_edge_1 vs I_1326935_miniasm_11_utg000001c... 99.98% identity, max indel = 175
         H_1326935_flye_02_edge_1 vs J_1326935_flye_06_edge_1...        99.99% identity, max indel = 2
         H_1326935_flye_02_edge_1 vs K_1326935_flye_10_edge_1...        100.00% identity, max indel = 1
         H_1326935_flye_02_edge_1 vs L_1326935_raven_04_Utg602...       99.95% identity, max indel = 258
  I_1326935_miniasm_11_utg000001c vs J_1326935_flye_06_edge_1...        99.98% identity, max indel = 258
  I_1326935_miniasm_11_utg000001c vs K_1326935_flye_10_edge_1...        99.98% identity, max indel = 258
  I_1326935_miniasm_11_utg000001c vs L_1326935_raven_04_Utg602...       99.96% identity, max indel = 194
         J_1326935_flye_06_edge_1 vs K_1326935_flye_10_edge_1...        99.99% identity, max indel = 2
         J_1326935_flye_06_edge_1 vs L_1326935_raven_04_Utg602...       99.95% identity, max indel = 258
         K_1326935_flye_10_edge_1 vs L_1326935_raven_04_Utg602...       99.95% identity, max indel = 258
  
  Pairwise identities:
    A_1:                             [2m100.00%[0m   99.93%   99.97%   99.93%   99.92%   99.97%   99.97%   99.97%   99.96%   99.97%   99.97%   99.93%
    B_1:                              99.93%  [2m100.00%[0m   99.95%   99.91%   99.90%   99.95%   99.95%   99.95%   99.96%   99.95%   99.95%   99.93%
    C_1:                              99.97%   99.95%  [2m100.00%[0m   99.94%   99.94%   99.99%   99.98%   99.99%   99.98%   99.99%   99.99%   99.95%
    D_1326935_raven_08_Utg658:        99.93%   99.91%   99.94%  [2m100.00%[0m   99.90%   99.95%   99.94%   99.95%   99.94%   99.95%   99.95%   99.92%
    E_1326935_raven_12_Utg616:        99.92%   99.90%   99.94%   99.90%  [2m100.00%[0m   99.94%   99.93%   99.94%   99.93%   99.94%   99.94%   99.91%
    F_1326935_miniasm_03_utg000001c:  99.97%   99.95%   99.99%   99.95%   99.94%  [2m100.00%[0m   99.99%   99.99%   99.98%   99.99%   99.99%   99.95%
    G_1326935_miniasm_07_utg000001c:  99.97%   99.95%   99.98%   99.94%   99.93%   99.99%  [2m100.00%[0m   99.99%   99.98%   99.99%   99.98%   99.95%
    H_1326935_flye_02_edge_1:         99.97%   99.95%   99.99%   99.95%   99.94%   99.99%   99.99%  [2m100.00%[0m   99.98%   99.99%  100.00%   99.95%
    I_1326935_miniasm_11_utg000001c:  99.96%   99.96%   99.98%   99.94%   99.93%   99.98%   99.98%   99.98%  [2m100.00%[0m   99.98%   99.98%   99.96%
    J_1326935_flye_06_edge_1:         99.97%   99.95%   99.99%   99.95%   99.94%   99.99%   99.99%   99.99%   99.98%  [2m100.00%[0m   99.99%   99.95%
    K_1326935_flye_10_edge_1:         99.97%   99.95%   99.99%   99.95%   99.94%   99.99%   99.98%  100.00%   99.98%   99.99%  [2m100.00%[0m   99.95%
    L_1326935_raven_04_Utg602:        99.93%   99.93%   99.95%   99.92%   99.91%   99.95%   99.95%   99.95%   99.96%   99.95%   99.95%  [2m100.00%[0m
  
  Maximum insertion/deletion sizes:
    A_1:                             [2m   0[0m   375   375   375   375   375   375   375   375   375   375   375
    B_1:                              375  [2m   0[0m   258   258   972   258   258   258   256   258   258   256
    C_1:                              375   258  [2m   0[0m   121   972   121   121   121   175   121   121   258
    D_1326935_raven_08_Utg658:        375   258   121  [2m   0[0m   972  [31m1145[0m  [31m1150[0m  [31m1145[0m   883  [31m1139[0m  [31m1145[0m   969
    E_1326935_raven_12_Utg616:        375   972   972   972  [2m   0[0m   260   260   260   260   260   260   260
    F_1326935_miniasm_03_utg000001c:  375   258   121  [31m1145[0m   260  [2m   0[0m   115     4   175     4     4   258
    G_1326935_miniasm_07_utg000001c:  375   258   121  [31m1150[0m   260   115  [2m   0[0m   115   175   115   115   258
    H_1326935_flye_02_edge_1:         375   258   121  [31m1145[0m   260     4   115  [2m   0[0m   175     2     1   258
    I_1326935_miniasm_11_utg000001c:  375   256   175   883   260   175   175   175  [2m   0[0m   258   258   194
    J_1326935_flye_06_edge_1:         375   258   121  [31m1139[0m   260     4   115     2   258  [2m   0[0m     2   258
    K_1326935_flye_10_edge_1:         375   258   121  [31m1145[0m   260     4   115     1   258     2  [2m   0[0m   258
    L_1326935_raven_04_Utg602:        375   256   258   969   260   258   258   258   194   258   258  [2m   0[0m
  
  
  Error: some pairwise indels are greater than the maximum allowed value of 1000.
  Please remove offending sequences or raise the --max_indel_size threshold and
  try again.
  
  1326935_filtered.fastq.gz
  cluster_001
  remove.txt
  ls: cluster_001/2_all_seqs.fasta: No such file or directory

Work dir:
  /Volumes/IDGenomics_NAS/testing_DF/create_test_files/work/2c/15b0239767a1ba8c7776fb78138641

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

The sample with the large indel size is 1326935, the cluster is cluster_001, and the sequence associated with the error is D_1326935_raven_08_Utg658. To remove this sequence, a line consisting of '1326935,cluster_011,D_1326935_raven_08_Utg658' would be added to 'remove.txt'. There is also the option to increase the allowed indel size by adjusting params.trycycler_reconcile in a config file with something like params.trycycler_reconcile = '--max_indel_size 1200'.

Sample error: start and end were found in multiple places

In this example, the start and end of one sequence were found in multiple places in the other sequences.

The nextflow error
Error executing process > 'trycycler:reconcile (1326933-2_cluster_001)'

Caused by:
  Process `trycycler:reconcile (1326933-2_cluster_001)` terminated with an error exit status (1)

Command executed:

  trycycler --version 
  
  if [ -f "remove.txt" ]
  then
    while read line
    do
      cluster=$(echo $line | cut -f 2 -d ,)
      file=$(echo $line | cut -f 3 -d ,)
      if [ -f "$cluster/1_contigs/$file.fasta" ] ; then mv $cluster/1_contigs/$file.fasta $cluster/1_contigs/$file.fasta_remove ; fi
    done < <(grep ^1326933-2, remove.txt)
  fi
  
  num_fasta=$(ls cluster_001/1_contigs/*.fasta | wc -l)
  echo "There are $num_fasta in cluster_001 for 1326933-2"
  if [ "$num_fasta" -ge "4" ]
  then
    trycycler reconcile          --reads 1326933-2_filtered.fastq.gz         --cluster_dir cluster_001         --threads 12
  
      ls
  
      ls cluster_001/2_all_seqs.fasta
  else
    echo "1326933-2 cluster cluster_001 only had $num_fasta fastas"
    mv cluster_001 cluster_001_cluster_too_small
  fi

Command exit status:
  1

Command output:
  Trycycler v0.5.3
  There are 12 in cluster_001 for 1326933-2
  1326933-2_filtered.fastq.gz
  cluster_001
  remove.txt

Command error:
      no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
    using G_1326933-2_raven_12_Utg596:
      no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
    using H_1326933-2_miniasm_07_utg000001c:
      no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
    using I_1326933-2_miniasm_11_utg000001c:
      no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
    using J_1326933-2_flye_06_edge_1:
      no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
    using K_1326933-2_flye_10_edge_1:
      no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
    using L_1326933-2_flye_02_edge_2:
      no adjustment needed (D_1326933-2_raven_08_Utg600 is already circular)
    circularisation complete (3,965,323 bp)
  
  Circularising E_1326933-2_miniasm_03_utg000001l:
    using A_1:
      unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in A_1
    using B_1:
      unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in B_1
    using C_1:
      unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in C_1
    using D_1326933-2_raven_08_Utg600:
      unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in D_1326933-2_raven_08_Utg600
    using F_1326933-2_raven_04_Utg570:
      unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in F_1326933-2_raven_04_Utg570
    using G_1326933-2_raven_12_Utg596:
      unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in G_1326933-2_raven_12_Utg596
    using H_1326933-2_miniasm_07_utg000001c:
      unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in H_1326933-2_miniasm_07_utg000001c
    using I_1326933-2_miniasm_11_utg000001c:
      unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in I_1326933-2_miniasm_11_utg000001c
    using J_1326933-2_flye_06_edge_1:
      unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in J_1326933-2_flye_06_edge_1
    using K_1326933-2_flye_10_edge_1:
      unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in K_1326933-2_flye_10_edge_1
    using L_1326933-2_flye_02_edge_2:
      unable to circularise: E_1326933-2_miniasm_03_utg000001l's start and end were found in multiple places in L_1326933-2_flye_02_edge_2
  
  Error: failed to circularise sequence E_1326933-2_miniasm_03_utg000001l because
  its start/end sequences were found in multiple ambiguous places in other
  sequences. This is likely because E_1326933-2_miniasm_03_utg000001l starts/ends
  in a repetitive region. You can either manually repair its circularisation (and
  ensure it does not start/end in a repetitive region) or exclude the sequence
  altogether and try again.
  
  1326933-2_filtered.fastq.gz
  cluster_001
  remove.txt
  ls: cluster_001/2_all_seqs.fasta: No such file or directory

Work dir:
  /Volumes/IDGenomics_NAS/testing_DF/create_test_files/work/8d/32a34210b544ea4abb4b703a567c2b

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

In this example, the sample is '1326933-2', the cluster is 'cluster_001', and the problematic sequence is 'E_1326933-2_miniasm_03_utg000001l'. To remove this sequence, a line consisting of '1326933-2,cluster_001,E_1326933-2_miniasm_03_utg000001l' would be added to 'remove.txt' and the workflow would be started again with -resume.

Sample error: length difference

In this example, the length of the sequences are not similar enough.

The nextflow error

Error executing process > 'trycycler:reconcile (1326933-2_cluster_002)'

Caused by:
  Process `trycycler:reconcile (1326933-2_cluster_002)` terminated with an error exit status (1)

Command executed:

  trycycler --version 
  
  if [ -f "remove.txt" ]
  then
    while read line
    do
      cluster=$(echo $line | cut -f 2 -d ,)
      file=$(echo $line | cut -f 3 -d ,)
      if [ -f "$cluster/1_contigs/$file.fasta" ] ; then mv $cluster/1_contigs/$file.fasta $cluster/1_contigs/$file.fasta_remove ; fi
    done < <(grep ^1326933-2, remove.txt)
  fi
  
  num_fasta=$(ls cluster_002/1_contigs/*.fasta | wc -l)
  echo "There are $num_fasta in cluster_002 for 1326933-2"
  if [ "$num_fasta" -ge "4" ]
  then
    trycycler reconcile          --reads 1326933-2_filtered.fastq.gz         --cluster_dir cluster_002         --threads 12
  
      ls
  
      ls cluster_002/2_all_seqs.fasta
  else
    echo "1326933-2 cluster cluster_002 only had $num_fasta fastas"
    mv cluster_002 cluster_002_cluster_too_small
  fi

Command exit status:
  1

Command output:
  Trycycler v0.5.3
  There are 9 in cluster_002 for 1326933-2
  1326933-2_filtered.fastq.gz
  cluster_002
  remove.txt

Command error:
  [93m[1m[4mStarting Trycycler reconcile[0m [2m(2023-03-10 12:59:51)[0m
  [2m    Trycycler reconcile is a tool for reconciling multiple alternative contigs[0m
  [2mwith each other.[0m
  
  Input reads: 1326933-2_filtered.fastq.gz
    size = 354,044,517 bytes
  
  Input contigs:
    cluster_002/1_contigs/A_2.fasta (19,660 bp)
    cluster_002/1_contigs/B_2.fasta (19,662 bp)
    cluster_002/1_contigs/E_1326933-2_miniasm_03_utg000002c.fasta (19,662 bp)
    cluster_002/1_contigs/F_1326933-2_raven_04_Utg572.fasta (19,661 bp)
    cluster_002/1_contigs/G_1326933-2_raven_12_Utg598.fasta (19,663 bp)
    cluster_002/1_contigs/H_1326933-2_miniasm_07_utg000002c.fasta (19,662 bp)
    cluster_002/1_contigs/I_1326933-2_miniasm_11_utg000002c.fasta (19,664 bp)
    cluster_002/1_contigs/K_1326933-2_flye_10_edge_2.fasta (19,660 bp)
    cluster_002/1_contigs/L_1326933-2_flye_02_edge_3.fasta (13,852 bp)
  
  Checking required software:
    minimap2: v2.23-r1111
  
  
  [93m[1m[4mInitial check of contigs[0m [2m(2023-03-10 12:59:51)[0m
  [2m    Before proceeding, Trycycler ensures that the input contigs appear[0m
  [2msufficiently close to each other to make a consensus. If not, the program will[0m
  [2mquit and the user must fix the input contigs (make them more similar to each[0m
  [2mother) or exclude some before trying again.[0m
  
  Relative sequence lengths:
    A_2:                               [2m1.000[0m  1.000  1.000  1.000  1.000  1.000  1.000  1.000  [31m1.419[0m
    B_2:                               1.000  [2m1.000[0m  1.000  1.000  1.000  1.000  1.000  1.000  [31m1.419[0m
    E_1326933-2_miniasm_03_utg000002c: 1.000  1.000  [2m1.000[0m  1.000  1.000  1.000  1.000  1.000  [31m1.419[0m
    F_1326933-2_raven_04_Utg572:       1.000  1.000  1.000  [2m1.000[0m  1.000  1.000  1.000  1.000  [31m1.419[0m
    G_1326933-2_raven_12_Utg598:       1.000  1.000  1.000  1.000  [2m1.000[0m  1.000  1.000  1.000  [31m1.420[0m
    H_1326933-2_miniasm_07_utg000002c: 1.000  1.000  1.000  1.000  1.000  [2m1.000[0m  1.000  1.000  [31m1.419[0m
    I_1326933-2_miniasm_11_utg000002c: 1.000  1.000  1.000  1.000  1.000  1.000  [2m1.000[0m  1.000  [31m1.420[0m
    K_1326933-2_flye_10_edge_2:        1.000  1.000  1.000  1.000  1.000  1.000  1.000  [2m1.000[0m  [31m1.419[0m
    L_1326933-2_flye_02_edge_3:        [31m0.705[0m  [31m0.705[0m  [31m0.705[0m  [31m0.705[0m  [31m0.704[0m  [31m0.705[0m  [31m0.704[0m  [31m0.705[0m  [2m1.000[0m
  
  
  Error: there is too much length difference between contigs. You must either
  exclude or repair the offending contig sequences and then try running trycycler
  reconcile again. If one of the sequences is too long, it could be due to
  excessive circularisation overlap, and trimming that overlap may allow
  trycycler reconcile to continue.
  
  1326933-2_filtered.fastq.gz
  cluster_002
  remove.txt
  ls: cluster_002/2_all_seqs.fasta: No such file or directory

Work dir:
  /Volumes/IDGenomics_NAS/testing_DF/create_test_files/work/b8/12ea5639ea2fa7fb1a055483b51ae0

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

In this example, the sample is '1326933-2', the cluster is 'cluster_002', and the problematic sequence is 'L_1326933-2_flye_02_edge_3' because it is much smaller. To remove this sequence, a line consisting of '1326933-2,cluster_002,L_1326933-2_flye_02_edge_3' would be added to 'remove.txt' and the workflow would be started again with -resume.

remove file

If the only problems were from the examples presented above, remove.txt would have the lines of

1326935,cluster_011,D_1326935_raven_08_Utg658
1326933-2,cluster_001,E_1326933-2_miniasm_03_utg000001l
1326933-2,cluster_002,L_1326933-2_flye_02_edge_3

The corresponding command line would then be something like the following (DO NOT FORGET TO use '-resume'!)

nextflow run UPHL-BioNGS/Donut_Falls -profile singularity --assembler trycycler --remove remove.txt -resume

The Trycycler workflow

Loading
---
Trycycler
---
flowchart LR

A[filtered fastq] --> B[subsample]
B --> C[assemble with flye]
B --> D[assemble with unicycler]
B --> E[assemble with miniasm and minipolish]
B --> F[assemble with raven]
C --> G[cluster]
D --> G[cluster]
E --> G[cluster]
F --> G[cluster]
G --> H[reconcile]
H -- remove sequences --> H
H --> I[msa]
I --> J[partician]
A --> J
J --> K[consensus]
K --> L[combine fasta]
L --> M[polish]

Relevant parameters (params) and their default values

params.rasusa_options              = '--frac 80'
params.trycycler_subsample_options = ''
params.trycycler_cluster_options   = ''
params.trycycler_consensus_options = ''
params.trycycler_dotplot_options   = ''
params.trycycler_msa_options       = ''
params.trycycler_partition_options = ''
params.trycycler_reconcile_options = ''
Clone this wiki locally