-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dump function to write counts to file #30
Conversation
Allow Closes #24 Will revisit |
src/lib.rs
Outdated
/// By default, the records are sorted by count in ascending order. If two records have the same | ||
/// count value, they are sorted by the hash value. If `sortkeys` is set to `True`, sorting is done | ||
/// by the hash key instead. | ||
#[pyo3(signature = (file=None, sortkeys=false))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious about needing to sort - for large k-mer collections, this is going to cost real memory. wouldn't a better default be to not sort, and then allow sorting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that it would be useful to default sort on counts then hashes:
- Gives same dump file irrespective of the order that kmers were consumes (same data, same output)
- Easy to do a quick visual sanity check on the output with head / tail
Opt 1: Add a sortcounts opt (+ logic to deal with both sortkeys and sortcounts being set)
Opt 2: Keep default sortcounts, add opt nosort
Leaning toward Opt 1, can just use sortcounts in the example docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you're up for adding sortcounts option now, that'd be great - but can also be added later. (since I tend to work in bursts, I'll often punt things like this to an issue to be tackled later. goal is to not forget, but not necessarily pressure ourselves to do things quickly :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nosort is also fine, too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- rename func to
.dump()
- Add an option
kmer=True
to output kmers instead of hashes (rather than two separate funcs) later when we have hash:kmer table option. - Default unsorted output.
- Add sortcounts.
- Error if both sortcount and sortkeys set
Add
KmerCountTable
method.dump_hashes()
to write sorted hash:count pairs to tab-delimited output file. Can also return as a list of tuples for use in Python.Added tests.
Addresses issue #24, but does not add an option to output kmer:count pairs at this time.
Example data:
Default sort on counts then on keys:
Optional sort on keys:
Sorted hash:count pairs can be written to a tab-delimited text file by specifying an output target:
If no output file is specified, records are returned as list of (hash,count) tuples (as above).
This list can be convert to a pandas dataframe:
If table is empty, returns empty list: