Report incorrectly reporting compressed read length rather than raw #1906

javibio-git · 2021-02-25T20:46:41Z

Dear developers,

I have HiFi reads of a Drosophila genome with an average read length of ~16 kb and ran HiCanu -pacbio-hifi to assemble the genome. But I noticed that the average median size of my reads was ~11 kb after trimming.

Is this striking reduction on length typical? or is there a way to tweak the filtering parameters to make canu more permissive?

Thank you!

Javier

The text was updated successfully, but these errors were encountered:

skoren · 2021-02-25T22:36:48Z

I'm not sure what version of Canu you're running but the 2.1.1 release does not do any trimming for -pacbio-hifi data by default. I suggest updating to that if you have not. If you have, what is the full command you're using?

javibio-git · 2021-02-25T22:57:39Z

The version I used is this one:
canu snapshot v2.2-development +82 changes (r10191 af771ef2c17d7d8e639d9c9d4897a7f609b5eb55)

And the command I used is:
canu -p pref -d assembly genomeSize=150m -pacbio-hifi reads.fasta.gz

Javier

skoren · 2021-02-25T23:01:20Z

That won't do trimming, what part of the report you are referring to as trimming reads? I expect you're looking at reads in homopolymer compressed space, not trimmed reads.

javibio-git · 2021-02-25T23:16:06Z

Yes, I was referring the the homopolymer compressed space, my mistake. As far as I understand this is basically reducing (or compressing) all homopolymers, right? But now the question is, how this is affecting the assembly process? and/or if this compression is critical so it controls for sequencing errors but sacrificing what may be actual genome sequence. I may be misunderstanding this but if you can clarify, it would greatly appreciated.

This is the section of the .report file I am looking at.

[TRIMMING/READS]
--
-- In sequence store './dper111_35.seqStore':
--   Found 454225 reads.
--   Found 7500009589 bases (50 times coverage).
--    Histogram of corrected-compressed reads:
--
--    G=5376204280                       sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010        13192     39346    537626564  ||       1015-1530           24|-
--    00020        12706     80923   1075243992  ||       1531-2046           10|-
--    00030        12350    123867   1612868765  ||       2047-2562           12|-
--    00040        12056    167939   2150492329  ||       2563-3078            5|-
--    00050        11801    213019   2688103902  ||       3079-3594            5|-
--    00060        11568    259036   3225724752  ||       3595-4110           11|-
--    00070        11346    305963   3763344991  ||       4111-4626           13|-
--    00080        11111    353838   4300970177  ||       4627-5142           10|-
--    00090        10821    402827   4838584596  ||       5143-5658           24|-
--    00100         1015    454224   5376204280  ||       5659-6174           47|-
--    001.000x              454225   5376204280  ||       6175-6690           56|-
--                                               ||       6691-7206           71|-
--                                               ||       7207-7722           77|-
--                                               ||       7723-8238          106|-
--                                               ||       8239-8754          159|-
--                                               ||       8755-9270          347|-
--                                               ||       9271-9786         1243|-
--                                               ||       9787-10302        7980|-----
--                                               ||      10303-10818       40831|-------------------------
--                                               ||      10819-11334       94871|----------------------------------------------------------
--                                               ||      11335-11850      104514|---------------------------------------------------------------
--                                               ||      11851-12366       82245|--------------------------------------------------
--                                               ||      12367-12882       58034|-----------------------------------
--                                               ||      12883-13398       36340|----------------------
--                                               ||      13399-13914       18563|------------
--                                               ||      13915-14430        6557|----
--                                               ||      14431-14946        1550|-
--                                               ||      14947-15462         410|-
--                                               ||      15463-15978          87|-
--                                               ||      15979-16494          16|-
--                                               ||      16495-17010           0|
--                                               ||      17011-17526           0|
--                                               ||      17527-18042           0|
--                                               ||      18043-18558           0|
--                                               ||      18559-19074           0|
--                                               ||      19075-19590           0|
--                                               ||      19591-20106           0|
--                                               ||      20107-20622           0|
--                                               ||      20623-21138           0|
--                                               ||      21139-21654           0|
--                                               ||      21655-22170           1|-
--                                               ||      22171-22686           2|-
--                                               ||      22687-23202           1|-
--                                               ||      23203-23718           1|-
--                                               ||      23719-24234           0|
--                                               ||      24235-24750           0|
--                                               ||      24751-25266           1|-
--                                               ||      25267-25782           0|
--                                               ||      25783-26298           0|
--                                               ||      26299-26814           1|-
--

Thanks

Javier

skoren · 2021-02-25T23:29:24Z

Ah yes, that is just the report of the reads being loaded. It should report something like:

-- Correction skipped; not enabled.
--
-- Trimming skipped; not enabled.
--

so you know the reads haven't been trimmed. It looks like it is reporting compressed reads which are about 70% of the length of the original reads. This looks like a bug in the tip version, not present in the 2.1.1 release. It shouldn't show the compressed lengths in the report as this is confusing to users. This shouldn't affect the assembly, the reads will get uncompressed after consensus. However, unless you encounter a specific bug in 2.1.1 I'd recommend using the verified release version for a production assembly. I'll leave this open though so we fix the incorrect report.

javibio-git · 2021-02-25T23:47:41Z

Thank you for the clarification! I'll use the verified release then.

Best

brianwalenz · 2021-02-26T01:04:23Z

Fixed! To confirm, it was just a reporting issue.

skoren changed the title ~~Question on the overlap-based error correction / overlap filtering for HiFi reads.~~ Report incorrectly reporting compressed read length rather than raw Feb 25, 2021

skoren added the bug label Feb 25, 2021

skoren assigned brianwalenz Feb 25, 2021

brianwalenz added a commit that referenced this issue Feb 26, 2021

Request uncompressed lengths for read length histograms. Issue #1906.

f1eadb3

brianwalenz closed this as completed Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report incorrectly reporting compressed read length rather than raw #1906

Report incorrectly reporting compressed read length rather than raw #1906

javibio-git commented Feb 25, 2021

skoren commented Feb 25, 2021

javibio-git commented Feb 25, 2021

skoren commented Feb 25, 2021

javibio-git commented Feb 25, 2021

skoren commented Feb 25, 2021 •

edited

Loading

javibio-git commented Feb 25, 2021

brianwalenz commented Feb 26, 2021

Report incorrectly reporting compressed read length rather than raw #1906

Report incorrectly reporting compressed read length rather than raw #1906

Comments

javibio-git commented Feb 25, 2021

skoren commented Feb 25, 2021

javibio-git commented Feb 25, 2021

skoren commented Feb 25, 2021

javibio-git commented Feb 25, 2021

skoren commented Feb 25, 2021 • edited Loading

javibio-git commented Feb 25, 2021

brianwalenz commented Feb 26, 2021

skoren commented Feb 25, 2021 •

edited

Loading