-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: client-side region-filter on htsget streams #71
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! Looks very good and pragmatically solves the issue!
A corner case that is not yet covered is when the region to filter for only specifies the chromosome, but not both of start and end. For example:
chrom, start, end = parse_region("chr22")
returns chrom = chr22
, start = None
and end=None
.
In bytesio_to_iterator
this causes a TypeError:
(get_start(record) < start) or (get_start(record) > end)
TypeError: '<' not supported between instances of 'int' and 'NoneType'
This could be solved with some more conditionals, but will become very nested because either start
or end
or both
could be NoneType
. Do you have a better idea?
Nice catch, I fixed it with a simple solution, but it trades off performance for better readability as some unnecessary conditions are evaluated. (this is probably not the bottleneck, currently :p ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good to me - I would also definitely prioritize readability here 😄
This PR ensures records streamed from htsget match the query region by applying lazy filter on client side. This is because the htsget server only selects record at the bgzf block level, resulting in out-of-query records.
This is a quickfix and the streaming code will need refactor as it is not maintainable nor efficient.
Fixes #51