Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report incorrectly reporting compressed read length rather than raw #1906

Closed
javibio-git opened this issue Feb 25, 2021 · 7 comments
Closed
Assignees
Labels

Comments

@javibio-git
Copy link

Dear developers,

I have HiFi reads of a Drosophila genome with an average read length of ~16 kb and ran HiCanu -pacbio-hifi to assemble the genome. But I noticed that the average median size of my reads was ~11 kb after trimming.

Is this striking reduction on length typical? or is there a way to tweak the filtering parameters to make canu more permissive?

Thank you!

Javier

@skoren
Copy link
Member

skoren commented Feb 25, 2021

I'm not sure what version of Canu you're running but the 2.1.1 release does not do any trimming for -pacbio-hifi data by default. I suggest updating to that if you have not. If you have, what is the full command you're using?

@javibio-git
Copy link
Author

The version I used is this one:
canu snapshot v2.2-development +82 changes (r10191 af771ef2c17d7d8e639d9c9d4897a7f609b5eb55)

And the command I used is:
canu -p pref -d assembly genomeSize=150m -pacbio-hifi reads.fasta.gz

Javier

@skoren
Copy link
Member

skoren commented Feb 25, 2021

That won't do trimming, what part of the report you are referring to as trimming reads? I expect you're looking at reads in homopolymer compressed space, not trimmed reads.

@javibio-git
Copy link
Author

Yes, I was referring the the homopolymer compressed space, my mistake. As far as I understand this is basically reducing (or compressing) all homopolymers, right? But now the question is, how this is affecting the assembly process? and/or if this compression is critical so it controls for sequencing errors but sacrificing what may be actual genome sequence. I may be misunderstanding this but if you can clarify, it would greatly appreciated.

This is the section of the .report file I am looking at.

[TRIMMING/READS]
--
-- In sequence store './dper111_35.seqStore':
--   Found 454225 reads.
--   Found 7500009589 bases (50 times coverage).
--    Histogram of corrected-compressed reads:
--
--    G=5376204280                       sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010        13192     39346    537626564  ||       1015-1530           24|-
--    00020        12706     80923   1075243992  ||       1531-2046           10|-
--    00030        12350    123867   1612868765  ||       2047-2562           12|-
--    00040        12056    167939   2150492329  ||       2563-3078            5|-
--    00050        11801    213019   2688103902  ||       3079-3594            5|-
--    00060        11568    259036   3225724752  ||       3595-4110           11|-
--    00070        11346    305963   3763344991  ||       4111-4626           13|-
--    00080        11111    353838   4300970177  ||       4627-5142           10|-
--    00090        10821    402827   4838584596  ||       5143-5658           24|-
--    00100         1015    454224   5376204280  ||       5659-6174           47|-
--    001.000x              454225   5376204280  ||       6175-6690           56|-
--                                               ||       6691-7206           71|-
--                                               ||       7207-7722           77|-
--                                               ||       7723-8238          106|-
--                                               ||       8239-8754          159|-
--                                               ||       8755-9270          347|-
--                                               ||       9271-9786         1243|-
--                                               ||       9787-10302        7980|-----
--                                               ||      10303-10818       40831|-------------------------
--                                               ||      10819-11334       94871|----------------------------------------------------------
--                                               ||      11335-11850      104514|---------------------------------------------------------------
--                                               ||      11851-12366       82245|--------------------------------------------------
--                                               ||      12367-12882       58034|-----------------------------------
--                                               ||      12883-13398       36340|----------------------
--                                               ||      13399-13914       18563|------------
--                                               ||      13915-14430        6557|----
--                                               ||      14431-14946        1550|-
--                                               ||      14947-15462         410|-
--                                               ||      15463-15978          87|-
--                                               ||      15979-16494          16|-
--                                               ||      16495-17010           0|
--                                               ||      17011-17526           0|
--                                               ||      17527-18042           0|
--                                               ||      18043-18558           0|
--                                               ||      18559-19074           0|
--                                               ||      19075-19590           0|
--                                               ||      19591-20106           0|
--                                               ||      20107-20622           0|
--                                               ||      20623-21138           0|
--                                               ||      21139-21654           0|
--                                               ||      21655-22170           1|-
--                                               ||      22171-22686           2|-
--                                               ||      22687-23202           1|-
--                                               ||      23203-23718           1|-
--                                               ||      23719-24234           0|
--                                               ||      24235-24750           0|
--                                               ||      24751-25266           1|-
--                                               ||      25267-25782           0|
--                                               ||      25783-26298           0|
--                                               ||      26299-26814           1|-
--

Thanks

Javier

@skoren
Copy link
Member

skoren commented Feb 25, 2021

Ah yes, that is just the report of the reads being loaded. It should report something like:

-- Correction skipped; not enabled.
--
-- Trimming skipped; not enabled.
--

so you know the reads haven't been trimmed. It looks like it is reporting compressed reads which are about 70% of the length of the original reads. This looks like a bug in the tip version, not present in the 2.1.1 release. It shouldn't show the compressed lengths in the report as this is confusing to users. This shouldn't affect the assembly, the reads will get uncompressed after consensus. However, unless you encounter a specific bug in 2.1.1 I'd recommend using the verified release version for a production assembly. I'll leave this open though so we fix the incorrect report.

@skoren skoren changed the title Question on the overlap-based error correction / overlap filtering for HiFi reads. Report incorrectly reporting compressed read length rather than raw Feb 25, 2021
@skoren skoren added the bug label Feb 25, 2021
@javibio-git
Copy link
Author

Thank you for the clarification! I'll use the verified release then.

Best

@brianwalenz
Copy link
Member

Fixed! To confirm, it was just a reporting issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants