Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative Value of "Input Read Pairs" and "Both Surviving" in log file #44

Open
lxwgcool opened this issue Dec 1, 2022 · 2 comments
Open

Comments

@lxwgcool
Copy link

lxwgcool commented Dec 1, 2022

Hi,

We are using Trimmomatic as a benchmark tool in our pipeline for the trimming of Illumina reads.

We recently found that the metrics of "Input Read Pairs" and "Both Surviving" reported by Trimmomatic log file are negative in one of our flowcells (others are good).

Please check the screenshot below (line 9):
image

The reads we used are human whole genome sequence reads. The size of each reads is around 160GB.

Based on our previous work, these two metrics should be always positive. I have several questions below:
(1) May I know why we get the negative value for these two metrics?
(2) If it is not a bug, how we understand these negative value?
(3) How we get the real number of "Input Read Pairs" and "Both Surviving"? Should we simply reverse the negative to positive?

Thanks so much for your help.
Best regards
Xin

@TonyBolger
Copy link
Collaborator

This looks like an 32-bit integer wrap - i guess you have over 2bn read pairs? The real number is (2^32) added to the numbers shown there.

Input: (2^32) + -1953136673 => 2341830623
Both Surviving: (2^32) + -2013735491 => 2281231805

I hadn't really considered the possibility of >2bn reads in a dataset 10 years ago :)

@lxwgcool
Copy link
Author

lxwgcool commented Dec 2, 2022

Thank you so much for your prompt reply Tony. I think it does make sense.

We are current using Illumina HiSeq platform and generated a lot of sequencing data for different WGS projects. I believe with the development of technology and the strong financial support, we may generate more sequencing data that contain more than 2 billion reads-pair for a single subject.

Thanks for your solution. I will keep using the rule (2^32 + "the negative number") to convert the negative value to the real number of reads-pair in our pipeline. However, with more these big size data generated, I think it may be more convenient for us if you can update the trimmomatic source code and release a new version of trimmomatic to use int 64 to solve this issue.

Thank you so much for your help
Best
Xin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants