Skip to content

v0.7.8

Pre-release
Pre-release
Compare
Choose a tag to compare
@MatthewRalston MatthewRalston released this 28 Mar 19:39
· 10 commits to master since this release
84a33ea

kmerdb graph introduced, producing a new file form .kdbg, an edge list. New metadata schema for new format as well. kmerdb view and kmerdb header are compatible with new format.

The goal is to create an weighted graph. Support for assembly and graph visualizations in the future.

After 0.7.6 the .kdb spec will be loosely deprecated. While the .kdb format may remain unchanged (don't know yet), the goal is to produce an adjacency list structure from only the k-mer counts and the 'neighbor' k-mer ids. After the format revision (mostly to the --all-metadata option), a new command kmerdb graph will be applied to generate a on-disk representation of an adjacency list.

  • What does this mean?

At this point, the new feature is in the planning stage, and it is not known if backwards compatibility (< 0.7.7) will be supported. One goal is to create an adjacency list structure on disk from the --all-metadata augmented .kdb format. It is not clear yet if cycles will be permitted in the graph structure, or if a distinct "offset" flag will be used. An example follows.

  • 0.7.6 .kdb format
    col1 is row number, col2 is sort order, col3 is k-mer id, col4 is k-mer count, col5 (--all-metadata) featured a loosely specified 'neighbor' JSON field, consisting of a dictionary with "A", "C", "T" "G" etc. keys and it was poorly implemented. Basically, the neighboring (left side and right side) k-mer ids were provided.
1    1    1    123
  • 0.7.7+ .kdbg
    col1 is unique row number, col2 is k-mer id (may be repeated), col3 is a .csv field of possible adjacent row-ids, corresponding to the k-mer id's (col2) neighbors in kmer-space. col4 represents a possible solution for the graph traversal that produces a Hamiltonian (whatever) walk through the graph recapitulating either the exact (.fasta) assembly solution OR a potential solution to the assembly from available data and a feasible solution either using networkx or somehow a custom graph traversal algorithm that minimized the penalty of omitting rows/k-mers based on the suggestion of the shortest path to visit each k-mer once but that also? maximizes the number of rows visited? I'm not sure yet how this will be specifically implemented, as the .kdbg format is the first step.
1    1234    2345,3456,...    3