Add dump function to write counts to file #30

Adamtaranto · 2024-09-16T03:35:54Z

Add KmerCountTable method .dump_hashes() to write sorted hash:count pairs to tab-delimited output file. Can also return as a list of tuples for use in Python.

Added tests.

Addresses issue #24, but does not add an option to output kmer:count pairs at this time.

Example data:

import oxli

# Demo table
kct = oxli.KmerCountTable(ksize=4)
kct.count("AAAA")  # Count 'AAAA'
kct.count("TTTT")  # Count revcomp of 'AAAA'
kct.count("AATT")  # Count 'AATT'
kct.count("GGGG")  # Count 'GGGG'
kct.count("GGGG")  # Count again.

# Hashes
#  17832910516274425539 = AAAA/TTTT
# 382727017318141683 = AATT
# 73459868045630124 = GGGG

Default sort on counts then on keys:

kct.dump_hashes()
>>> [(382727017318141683, 1), (73459868045630124, 2), (17832910516274425539, 2)]

Optional sort on keys:

kct.dump_hashes(sortkeys=True)
>>> [(73459868045630124, 2), (382727017318141683, 1), (17832910516274425539, 2)]

Sorted hash:count pairs can be written to a tab-delimited text file by specifying an output target:

# Write tab-delimited records to kct.dump
kct.dump_hashes(file="kct.dump")

If no output file is specified, records are returned as list of (hash,count) tuples (as above).
This list can be convert to a pandas dataframe:

import pandas as pd
table_dump = kct.dump_hashes()
df = pd.DataFrame(table_dump, columns=['Hash', 'Count'])
print(df)
>>>
  '''
                     Hash  Count
  0    382727017318141683      1
  1     73459868045630124      2
  2  17832910516274425539      2
  '''

If table is empty, returns empty list:

empty_kct = oxli.KmerCountTable(ksize=4)

empty_kct.dump_hashes()
>>> []

Adamtaranto · 2024-09-18T02:32:40Z

Allow Closes #24

Will revisit .dump() for writing canonical kmers in place of hashes when we have a solution for storing hash:kmer pairs in the KmerCountTable.

ctb · 2024-09-18T14:54:11Z

src/lib.rs

+    /// By default, the records are sorted by count in ascending order. If two records have the same
+    /// count value, they are sorted by the hash value. If `sortkeys` is set to `True`, sorting is done
+    /// by the hash key instead.
+    #[pyo3(signature = (file=None, sortkeys=false))]


curious about needing to sort - for large k-mer collections, this is going to cost real memory. wouldn't a better default be to not sort, and then allow sorting?

My thinking was that it would be useful to default sort on counts then hashes:

Gives same dump file irrespective of the order that kmers were consumes (same data, same output)

Easy to do a quick visual sanity check on the output with head / tail

Opt 1: Add a sortcounts opt (+ logic to deal with both sortkeys and sortcounts being set)
Opt 2: Keep default sortcounts, add opt nosort

Leaning toward Opt 1, can just use sortcounts in the example docs.

if you're up for adding sortcounts option now, that'd be great - but can also be added later. (since I tend to work in bursts, I'll often punt things like this to an issue to be tackled later. goal is to not forget, but not necessarily pressure ourselves to do things quickly :))

nosort is also fine, too!

rename func to .dump()

Add an option kmer=True to output kmers instead of hashes (rather than two separate funcs) later when we have hash:kmer table option.

Default unsorted output.

Add sortcounts.

Error if both sortcount and sortkeys set

Adamtaranto added 2 commits September 16, 2024 12:53

Add dump_hashes function to write sorted records to file or list.

7bd4a29

Add tests for dump_hashes()

b7f5805

Adamtaranto added the enhancement New feature or request label Sep 16, 2024

Adamtaranto requested a review from ctb September 16, 2024 03:35

Merge branch 'main' into dev_output

bcb573c

Adamtaranto linked an issue Sep 18, 2024 that may be closed by this pull request

Add .dump() method #24

Closed

ctb reviewed Sep 18, 2024

View reviewed changes

ctb approved these changes Sep 18, 2024

View reviewed changes

Adamtaranto and others added 4 commits September 20, 2024 18:58

update dump() tests

a490b33

convert dump_hashes to dump, make sort optional.

207272d

Style fixes by Ruff

08cd527

Merge branch 'main' into dev_output

203a350

Adamtaranto merged commit a578d97 into main Sep 20, 2024
15 checks passed

Adamtaranto deleted the dev_output branch September 20, 2024 09:13

ctb mentioned this pull request Sep 23, 2024

MRG: release version 0.3.0 #60

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dump function to write counts to file #30

Add dump function to write counts to file #30

Adamtaranto commented Sep 16, 2024

Adamtaranto commented Sep 18, 2024

ctb Sep 18, 2024

Adamtaranto Sep 19, 2024

ctb Sep 19, 2024

ctb Sep 19, 2024

Adamtaranto Sep 20, 2024

Add dump function to write counts to file #30

Add dump function to write counts to file #30

Conversation

Adamtaranto commented Sep 16, 2024

Adamtaranto commented Sep 18, 2024

ctb Sep 18, 2024

Choose a reason for hiding this comment

Adamtaranto Sep 19, 2024

Choose a reason for hiding this comment

ctb Sep 19, 2024

Choose a reason for hiding this comment

ctb Sep 19, 2024

Choose a reason for hiding this comment

Adamtaranto Sep 20, 2024

Choose a reason for hiding this comment