Fix to ensure dataframe indices still match after adjustments #551
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR will hopefully solve, or at least further the conversation on how to solve, the bug reported in #436 and possibly #547. For reference, my command, run within the
etal/cnvkit:0.9.7
docker container, was/usr/bin/python3 /code/cnvkit.py batch /inputs/1455918947/tumor.bam --normal /inputs/29426665/normal.bam --targets /inputs/876494580/hla_and_brca_genes_bait.interval_list --method hybrid
. I am happy to provide these input files if you'd like. The resulting error trace:The issue begins when a mask is applied to drop poor quality bins in
cnvlib/fix.py
. In my case, examination of the dataframes after masking showed that their indices were no longer a continuous series from 0, 1, 2, ... x, but instead had gaps. While there are gaps, at this point in the code, bothcnarr
andref_matched
still have matching indices. However, a later section of the code may passcnarr
to the functioncenter_by_window
and set its value to the function's return value. Withincenter_by_window
, the index ofcnarr
is reset. No such reset is applied toref_matched.
This means that when thelog2
column ofref_matched
is subtracted from thelog2
column ofcnarr
, there are "gaps" inref_matched
relative tocnarr
, and the values are set toNaN
incnarr
. Modified code with output to demonstrate:Code
Logs
Thus,
NaN
values are introduced to thelog2
column ofcnarr
, resulting in the Rscript error shown above. This PR adds a flag and conditional that resets the index ofref_matched
if the index ofcnarr
has been reset by a call tocenter_by_window
. This should be safe, sincecenter_by_window
preserves the original order of the rows, and I believe sufficient to solve the problem; however, there may be some things I missed, and I'd appreciate your thoughts.