Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dump function to write counts to file #30

Merged
merged 7 commits into from
Sep 20, 2024
Merged

Add dump function to write counts to file #30

merged 7 commits into from
Sep 20, 2024

Conversation

Adamtaranto
Copy link
Collaborator

Add KmerCountTable method .dump_hashes() to write sorted hash:count pairs to tab-delimited output file. Can also return as a list of tuples for use in Python.

Added tests.

Addresses issue #24, but does not add an option to output kmer:count pairs at this time.

Example data:

import oxli

# Demo table
kct = oxli.KmerCountTable(ksize=4)
kct.count("AAAA")  # Count 'AAAA'
kct.count("TTTT")  # Count revcomp of 'AAAA'
kct.count("AATT")  # Count 'AATT'
kct.count("GGGG")  # Count 'GGGG'
kct.count("GGGG")  # Count again.

# Hashes
#  17832910516274425539 = AAAA/TTTT
# 382727017318141683 = AATT
# 73459868045630124 = GGGG

Default sort on counts then on keys:

kct.dump_hashes()
>>> [(382727017318141683, 1), (73459868045630124, 2), (17832910516274425539, 2)]

Optional sort on keys:

kct.dump_hashes(sortkeys=True)
>>> [(73459868045630124, 2), (382727017318141683, 1), (17832910516274425539, 2)]

Sorted hash:count pairs can be written to a tab-delimited text file by specifying an output target:

# Write tab-delimited records to kct.dump
kct.dump_hashes(file="kct.dump")

If no output file is specified, records are returned as list of (hash,count) tuples (as above).
This list can be convert to a pandas dataframe:

import pandas as pd
table_dump = kct.dump_hashes()
df = pd.DataFrame(table_dump, columns=['Hash', 'Count'])
print(df)
>>>
  '''
                     Hash  Count
  0    382727017318141683      1
  1     73459868045630124      2
  2  17832910516274425539      2
  '''

If table is empty, returns empty list:

empty_kct = oxli.KmerCountTable(ksize=4)

empty_kct.dump_hashes()
>>> []

@Adamtaranto Adamtaranto added the enhancement New feature or request label Sep 16, 2024
@Adamtaranto Adamtaranto requested a review from ctb September 16, 2024 03:35
@Adamtaranto
Copy link
Collaborator Author

Allow Closes #24

Will revisit .dump() for writing canonical kmers in place of hashes when we have a solution for storing hash:kmer pairs in the KmerCountTable.

@Adamtaranto Adamtaranto linked an issue Sep 18, 2024 that may be closed by this pull request
src/lib.rs Outdated
/// By default, the records are sorted by count in ascending order. If two records have the same
/// count value, they are sorted by the hash value. If `sortkeys` is set to `True`, sorting is done
/// by the hash key instead.
#[pyo3(signature = (file=None, sortkeys=false))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious about needing to sort - for large k-mer collections, this is going to cost real memory. wouldn't a better default be to not sort, and then allow sorting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that it would be useful to default sort on counts then hashes:

  • Gives same dump file irrespective of the order that kmers were consumes (same data, same output)
  • Easy to do a quick visual sanity check on the output with head / tail

Opt 1: Add a sortcounts opt (+ logic to deal with both sortkeys and sortcounts being set)
Opt 2: Keep default sortcounts, add opt nosort

Leaning toward Opt 1, can just use sortcounts in the example docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're up for adding sortcounts option now, that'd be great - but can also be added later. (since I tend to work in bursts, I'll often punt things like this to an issue to be tackled later. goal is to not forget, but not necessarily pressure ourselves to do things quickly :))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nosort is also fine, too!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • rename func to .dump()
  • Add an option kmer=True to output kmers instead of hashes (rather than two separate funcs) later when we have hash:kmer table option.
  • Default unsorted output.
  • Add sortcounts.
  • Error if both sortcount and sortkeys set

@Adamtaranto Adamtaranto merged commit a578d97 into main Sep 20, 2024
15 checks passed
@Adamtaranto Adamtaranto deleted the dev_output branch September 20, 2024 09:13
@ctb ctb mentioned this pull request Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add .dump() method
2 participants