KmerDB (.kdb) [TURN THAT TSHI UP]

1/28/25 | [0.8.13] pre-alignment : adjustments to coordinate matching precursors, index file is now a simple linked list. just 1 vector of 4tuples across all sequences and basepairs multiplied together.

Releasing on PyPI as 0.8.13

Coord_obj_array = [(fasta_id, seqlen, i, kmer_id, is_minimizer), …]

The coordinate object array is a normalized table data structure

its a list of 5-tuples (flexible to change)

[Acceptance tested] for the kmerdb minimizer function

[Unimplemented] the kmerdb alignment function

1/24/25 | alignment

Step 1: revise minimizer index file

fasta_id, fasta_coord, kmer_id, is_minimizer

Step 2: create minimizer match file

kmer_id, is_minimizer, is_match, fasta_id1, fasta_coord1, fasta_id2, fasta_coord2

Step 3: use n minimizer matches in alignment table

fasta_id1, fasta_id2, common_minimizer_count, common_minimizer_coords_array

1/13/25 | Minimizers (lexicographic with window size)

Use the best -n scoring (wrt original genome/kdb) minimizers by match count to seed alignment

More specifically, a match is seeded by the n most optimal scores by optimal number of matches of minimizer counts

match is intersection of minimizer optima with group of max score SW alignment.

match is restricted to n (upper bound ln of sequence) or more minimizer matches or next closest (matchmin 4+)

Top scoring is considered a candidate for alignment

Scoring determined by minimixer match and then alignment with score of mismatch and match (+1/-1) and successful SW alignment

Reference matches produced by enumerating all pairs possible sequence alignemnt of query vs reference sequences.

Filtering the alignment seeds/kmers by abundance should be a good heuristic for creating biologically meaningful seed regions for alignment.

The –window-size parameter is important for kmer/minimizer filtering and restricting the number of Smith Waterman alignments that must be done.

Refactor minimizers.py to create a function that takes a fasta sequence as input, and returns a minimizer array. (lift and shift from get_minimizers in 0.8.11)

[0/3] Final alignment command should manually produce temporary files:

[ ] .kdb from reference (abundances)
[ ] .kdbi.1 from reference (minimizers)
[ ] .kdbi.1 from query (seeds/minimizers) :TODO:

10/5/24 | released from hospital (2 days) and working on refactoring kmer module

9/21/24 | somewhat changing from gpt4o-mini to claude sonnet

using sonnet to create llc documents

9/20/24 | strassen

9/5/24 | genome-size estimation

G = T/C_k

T = total kmer count

G = estimated genome size

C_k = C*(read_length-k+1)/read_length (OR the second-peak/best peak’s x-value is the k-mer coverage in the count histogram)

8/8/24 Taking Notes on Xuejiang Xiong Mouse model IBD study

SRA Accession id

SRA051354

What is the purpose of this study?

The goal of this study is to recreate a mouse model of the disease called “Irritable Bowel Disease”, using agents that induce responses and irritation to the point where the induced condition and the condition known as “irritable bowel disorder” are functionally similar.

The mice are NOD (non-obese diabetic) and suceptible to germs. They are colonized with 8 symbiotic bowel microbes, known as Altered Schaedler flora (ASF).

Samples taken from the bowels of these mice reveal the effect of the irritant/inducer agent on the gut microflora as measurable by Illumina High-throughput sequencing (HTS). Specifically, transcriptional libraries are prepared following RiboMinus treatment, to enrich for mRNAs and other non-rRNAs.

The mRNA libraries were processed on a Genome Analyzer IIx in this study. The SRA accession id for the single-end fastq datasets, bulk RNA for metatranscriptomics and assembly, is SRA051354.

The study used

8/3/24 Kolmogorov complexity and Generalized Suffix Arrays

Suffix array

kmerdb should have a suffix array structure, and its own metadata structure, and the columnar info should have references to original data from the k-mer or suffix on k-mer structure.

kolmogorov and Lemplel Ziv complexity definition:

@article{zielezinski2017alignment, title={Alignment-free sequence comparison: benefits, applications, and tools}, author={Zielezinski, Andrzej and Vinga, Susana and Almeida, Jonas and Karlowski, Wojciech M}, journal={Genome biology}, volume={18}, number={1}, pages={186}, year={2017}, publisher={Springer} }

For example, the Kolmogorov complexity of a sequence can be measured by the length of its shortest description. Accordingly, the sequence AAAAAAAAAA can be described in a few words (10 repetitions of A), whereas CGTGATGT presumably has no simpler description than specification nucleotide by nucleotide (1 C, then 1 G and so on). Intuitively, longer sequence descriptions indicate more complexity. However, Kolmogorov did not address the method to find the shortest description of a given string of characters. Therefore, the complexity is most commonly approximated by general compression algorithms (e.g., as implemented in zip or gzip programs) where the length of a compressed sequence gives an estimate of its complexity (i.e., a more complex string will be less compressible) [60]. The calculation of a distance between sequences using complexity (compression) is relatively straightforward (Fig. 2). This procedure takes the sequences being compared (x = ATGTGTG and y = CATGTG) and concatenates them to create one longer sequence (xy = ATGTGTGCATGTG). If x and y are exactly the same, then the complexity (compressed length) of xy will be very close to the complexity of the individual x or y. However, if x and y are dissimilar, then the complexity of xy (length of compressed xy) will tend to the cumulative complexities of x and y. Of course, there are as many different information-based distances as there are methods to calculate complexity. For example, Lempel–Ziv complexity [61] is a popular measure that calculates the number of different subsequences encountered when viewing the sequence from beginning to end (Fig. 2). Once the complexities of the sequences are calculated, a measure of their differences (e.g., the normalized compression distance [62]) can be easily computed. Many DNA-specific compression algorithms are currently being applied to new types of problems [63].

From https://wikipedia.org/wiki/Kolmogorov_complexity:

Kolmogorov complexity comes in two flavors: prefix-free (K(x)) and simple complexity (C(x)) measures. The formal treatment of these metrics and their formulae or estimation techniques will be provided shortly.

8/1/24 Written Lit review, System Reconfigurations

Currently reconfiguring my system and redundancies

Making copies of my installation and configuration/install routines. Trying ubuntu 24.04 LTS version rather than Arch. Better build/configure/make predictability.

Current [TODO]

NEXT Create kmerdb logo using GIMP

State “IN-PROGRESS” from “NEXT” [2024-08-01 Thu 19:04]

Finish logo export

Add logo to README

Add logo to website

7/28/24 [multiplication rule for Markov probability]

needs to be written in documentation

currently writen into appmap as command 11, but not fleshed out.

[TRIAGE] : vsearch align with kmerdb

Use k-mer frequencies to rank similarity to sequences in db.

Proceed from seed match/mismatch to full dynamic programmin smith waterman w/ affine gap penalty

7/16/24 NEW metadata feature for graph subcommand

graph subcommand needs node count explicitly, (k^n, where n is proportional to fastq size in number of reads)

graph in m = 4^k symbols*

[new] metadata fields: unique_kmers, total_kmers, total_nodes, total_edges, possible_edges

AND also printed in final stats

IN PROGRESS 7/11/24 - [LIT REVIEW]

IN PROGRESS D2 metrics, markov sequence prob review

D2 = ∑(I(A, B))

D2s = ∑{ \frac{ (X - \bar{X})(Y - Ybar) }{ \sqrt{ (X - Xbar) + (Y - Ybar) } } (the squareroot of the sum of the standardized X’s is the denominator, numerator is the product of the standardized X and Y counts, then the ratio is summed)

D2* = ∑{ \frac{ (X - Xbar)(Y - Ybar) }{ mhatnhatpwX*pwY } } (w=word, hat = “adjusted”/translated = m - k, X and Y are counts from )

D2z = ( D2(A,B) - E[D2] ) / \sqrt( var(D2) )

WAITING D2shepp = ∑{ \frac{ cwXi - (n-k+1)pwx * cwYi - (n-k+1)pwy }{ \sqrt{ (cwXi - (n-k+1)pwx)² + (cwYi - (n-k+1)pwy)²} }

State “WAITING” from “DONE” [2024-08-01 Thu 18:49]
State “DONE” from “CANCELED” [2024-08-01 Thu 18:49]
State “CANCELED” from “DELEGATED” [2024-08-01 Thu 18:49]

Reinert G. et al. “Alignment-free sequence comparison (1): statistics and power” J. Comput. Biol. 2003 v16 (p1615-1634)

Bibtex format below:

@article{reinert2009alignment, title={Alignment-free sequence comparison (I): statistics and power}, author={Reinert, Gesine and Chew, David and Sun, Fengzhu and Waterman, Michael S}, journal={Journal of Computational Biology}, volume={16}, number={12}, pages={1615–1634}, year={2009}, publisher={Mary Ann Liebert, Inc. 140 Huguenot Street, 3rd Floor New Rochelle, NY 10801 USA} }

core species choices

chicken farm estuary system changes (algination, asphyxia, microbiological changes

anti-human leaky gut syndrome changes.

i.e. looking at the human leaky gut syndrome, but in reverse. What are bioprotective species and niches that provide resilience to leaky-gut syndrome

chemophore SMILES and gastrotoxic footprints

pathology of lupus or auto-immune skin condition microbiome/metagenomic changes.

vaginal microbiome changes

Perspective 1 from reivew on distance metrics

IN PROGRESS 7/10/24 - [IMPORTANT] Needs a choice [cython d2 x graph algorithm features ]:

[Key choice needed]: 1 [ 2 reviews + cython D2 metrics ] path 2 [ 2 reviews + graph algorithm ]

cython d2 metrics including the delta distance : |pab(A)-pab(B)| (Karlin et al, tetra,tri,di- nucleotide frequencies)

(describe Karlin delta, algorithm to calculate)

Karlin delta first requires the least ambiguous k-mer (4-mer) frequency, i.e. the frequency of self

next required is the most ambiguous k-mer (4-mer) frequency, that with terminal residues identical, but internal residues as N, thus summing frequencies of recursively associated k-mers (4-mers)

next, all k-1 (trinucleotide), and k-2 mer (dinucleotide) frequencies are required, proceeding from outside in, such that internal residues again tend towards N, such that all combinations of residues are visited by the faNc trinucleotide frequency, with a - adenine and c - cytosine fixed, and the internal position of the trinucleotide specified as N, thus summing so that [ f(atc), f(aac), f(acc), and f(agc) = f(aNc) ].

this specifies the numerator for the tetranucleotide frequency (lowercause tau)

the denominator is only the most specific tetra and 1-neighboring trinucleotide frequencies, and the mononucleotide frequencies. [ f(acc) f(accg) f(ccg) f(a) f(c) f(t) f(g) ]

new graph file format specification ( walk,path is a subclass of unlabeled graph, where node labels can be visited, path order, and progressive or retro in the walk.

contig generator method, and contig boundary definition specification

6/28/24 - [ …whoops, forgot the date by 3 x24hr blocxz. ] okay, so the 0.8.4 release should have the graph labeling done.

graph node labeling and classification, and walk strategy

walk strategy, backtrace, and expansion step node labeling patterns need structure

assembler requires color graph feature unimplemented

index features need expanding

index as a .kdb.gi file?

new datatypes

new jsonschemas required:

product_result

the full product (nx3xm), the square product, the comprehension product

walk product (a label and node order specification)

node product (a node ordering and/or enumeration schema)

permutation (range(n)) => n! (n0, n1, n2, n3) (n0, n2, n1, n3) (n0, n3, n2, n1) … etc. for 24 total permutations of the 4 starting items.

combination (abcd xyz ) => ax ay az bx by … etc. (n!/(n-r)!)

6/14/24 - Revise README.md from changes to profile subcommand for multi-K and generic ‘prefix’ outfile pattern.

Samplesheet

‘–prefix’ Outfile pattern (kmerdb profile -k 8 –output-name example_output <samplesheet|input.fa> => example_output.8.kdb, example_output.9.kdb, etc.

6/11/24 - Index refactor, offset calculations, index table structure

D2*, D2S, and D2 statistics/distances

IN PROGRESS Refactor fileutil/index modules to produce valid index data on file-read

IN PROGRESS Refactor distance.pyx, ensure it compiles and computes successfully

Refactor profile command to accept minK and maxK commands

Refactor profile command to have ‘prefix’ as requried –flag instead of trailing positional argument

Default behavior, on single -k, is to create a file named $PREFIX.$K.kdb

-k is now optional

def profile in init.py must have logic to determine if single or multi-k mode enabled

Alt behavior, on minK and maxK together, is to create files name $PREFIX.$K.kdb as required till minK/maxK is satisfied

6/8/24 - Index + D2

Fix index subcommand, ensure it stores offsets

D2 statistic in Cython, distances.pyx

Presence absence AND exact k-mer profile match

4/25/24 - a small RNAidea (and other RNA families)

k-mer compositions and mutational families in small-RNA rich species

k-mer compositions of riboswitches

k-mer compositions of introns, exons (in eukaryotic) and promoters, terminators, orfs, orf families, and operons.

k-mer distances benchmark

Cython pearson

scikit spearman and correlation distance

statsmodel statistics

sm.add_constant(x1) # The b₀ param in the ordinary Least Squares fit.

results = sm.OLS(y, x).fit()

results.summary()

associated graphics for inferences

pearson vs ols R2 from statsmodel

spearman vs pearson vs k on test dataset. Matrix representation in example_report2.

numpy or cython implementation of regression model.

4/13/24 - Assembly

Networkx

Assemble or markov probability, markov chains, contig definition, locality Leads to better graphing. Can’t get to exact solution. Simplification requires heuristics and design.

>>> from kmerdb import graph, fileutil >>> import numpy as np >>> import networkx as nx >>> import matplotlib.pyplt as plt

>>> kdb = fileutil.open(“path/to/example.8.kdb”, mode=’r’, slurp=True) >>> kdbg = graph.open(“path/to/example.8.kdb”, mode=’r’, slurp=True) >>> kmer_ids = kdb.kmer_ids >>> n1 = kdbg.n1 >>> n2 = kdbg.n2 >>> w = kdbg.w >>> edge_list = list(zip(n1, n2)) >>> G = nx.Graph() >>> G.add_nodes_from(kmer_ids) >>> G.add_edges_from(edge_list) >>> if nx.is_planar(G) is False: >>> raise ValueError(“Need planar graph to continue”) >>> g = nx.generic_graph_view(G)

>>> #nx.is_tournament(g) #should not be a tournament >>> #nx.tournament.hamiltonian_path(g) >>> # Utility function - # of walks >>> # num_walks = number_of_walks(g, length=walk_length) >>> >>> final_g = restricted_view(G, hidden_nodes, hidden_edges) >>> degree_sequence = sorted((d for n, d in G.degree()), reverse=True) >>> dmax = max(degree_sequence) >>> dmax 7 >>> fig = plt.figure(“Degree of Cdiff k-mers for k=8 (Max neighbors = 8)”) >>> axgrid = fig.add_gridspec(5,4) >>> ax0 >>> ax0 = fig.add_subplot(axgrid[3:, :2]) >>> ax0 = fig.add_subplot(axgrid[0:3, :]) >>> Gcc = G.subgraph(sorted(nx.connected_components(G), key=len, reverse=True)[0]) >>> help(nx … )

>>> pos = nx.spring_layout(Gcc, seed=10396953)

Graphics and EDA

Degree analysis

https://networkx.org/documentation/stable/auto_examples/drawing/plot_degree.html#sphx-glr-auto-examples-drawing-plot-degree-py

Circular tree?

https://networkx.org/documentation/stable/auto_examples/graphviz_layout/plot_circular_tree.html#sphx-glr-auto-examples-graphviz-layout-plot-circular-tree-py

Shows

exploratory analysis (EDA

relationships

i.v. : node number

d.v. : degree

for exploratory used to validate degree is 0 at begin and end nodes

used to assess remaining sequences as assembly progresses

this is how I’ll develop my heuristics for a ‘balanced’ progress to the assembly of contigs

from

tree of k-mers

refer to circular tree

cluster variables

degree (balanced assembly)

centrality

Code examples:

Leads to better graphing. Can’t get to exact solution. Simplification requires heuristics and design.

Algorithms

https://networkx.org/documentation/stable/auto_examples/algorithms/index.html

Degree analysis

https://networkx.org/documentation/stable/auto_examples/drawing/plot_degree.html#sphx-glr-auto-examples-drawing-plot-degree-py

Circular tree?

https://networkx.org/documentation/stable/auto_examples/graphviz_layout/plot_circular_tree.html#sphx-glr-auto-examples-graphviz-layout-plot-circular-tree-py

Shows low degree nodes around periphery, which in the example above are rate limiting.

In the case of fasta assembly, there are only two degree 0 nodes, so a perfect solution is implicit.

In the case of fastq, there are many degree 0 nodes (periphery of reads), but the max 8-degree nodes are the ones to solve for.

markov probability

markov chains

contig definition

locality

4/12/24 - 0.8.0 release (see release 0.8.0 on github) and README+

tested, pushed, pull request merged, readme changes made on interface, merged.

4/10/24 - interface cont.

sys.stderr vs logger.log_it(… , “LEVEL”)

VERIFY kmer.py

VERIFY parse.py

VERIFY fileutil.py

everything else…

init.py

graph.py

python_distances.py

4/9/24 - kmerdb+appmap integration

VERIFY kmerdb usage + kmerdb help

shuf

index

Stuff

pass the step, feature and n_logs in from init

Pass the logs list from init or down its callstack as available

3/29/24 - AppMap

IN PROGRESS Appmap.org v0.7.9

IN PROGRESS Header

versions (program version)

Interpreter

package_manager

package manger : pip version : v24.0

DELEGATED version

file of executable (existing in init)

module_root / package_root

loaded modules

dependencies (parse requirements.txt or pyproject.toml)

[v] required

dependencies : {0}

optional

development_dependencies : {1}

relevant env variables (PYTHONPATH)

IN PROGRESS Subcommand and features

subcommand name section

parameters

Supported features:

[X] Features Checkbox

.... (more whitespace)

IN PROGRESS pre-log block

“$1” program arguments

“$ARGV”

Spinner placeholder

s p a c e r o . o . o .( the spinner )

pre-log usage block

Spacer

pre-log usage note

[ G i t h u b ] block

github logo

.--------------------------------------------------.

.mmMMMMMMMMMMMMMmm.

.mMMMMMMMMMMMMMMMMMMMMMMMm.

.mMMMMMMMMMMMMMMMMMMMMMMMMMMMMMm.

.MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM.

.MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM.

MMMMMMMM’ `”MMMMM”“”“”“”MMMM”“` ‘MMMMMMMM

MMMMMMMMM MMMMMMMMM

MMMMMMMMMM: :MMMMMMMMMM

.MMMMMMMMMM MMMMMMMMMM.

MMMMMMMMM” “MMMMMMMMM

MMMMMMMMM MMMMMMMMM

MMMMMMMMMM MMMMMMMMMM

`MMMMMMMMMM MMMMMMMMMMM

MMMMMMMMMMMM. .MMMMMMMMMMMMM

MMMMMM MMMMMMMMMM MMMMMMMMMMMMMMMMMMM

MMMMMM ‘MMMMMMM MMMMMMMMMMMMMMMM`

`MMMMMM “MMMMM MMMMMMMMMMMMMM`

`MMMMMm MMMMMMMMMMMM`

`”MMMMMMMMM MMMMMMMMM”`

`”MMMMMM MMMMMM”`

`”“M M”“`

’--------------------------------------------------’

github header

=======================================================

G i t H u b

======================================================= Repo: kmerdb Feature branch: graph_algo

relevant/pinned issue

Pinned issue: #130

development libraries? <true|false>

Development feature: (suggested feature based on error type)

[ x ] ascii file logging only, .logging module, no ‘rich’ logging yet.

[ X ] Log Block

.logging log

-----------------------------------

[1] .logging module only, and/or sys.stderr lines

last 100, 500, 1000, -n lines of log

L a r g e banner spacer

======================================================

[ 2 ] Footer command Summary and Usage Reminder

Usage reminder (short form usage_notes text)

[ metadata ]

[ metadata description ]

[ x ] Error/exit note

command

params

runtime

logfile

exit_code

Error summary

traceback

call stack (processed from error text??)

loggable_line (also processed)

Relevant issue

[metadata]

key indices | key arrays/structures

loaded modules

traceback

text description of the process (these should be the sys.stderr with the carriage return \r texts)

index-of-error (of the loggable line)

index of error (in the data structure(s)) [ part of metadata ]

str( | loggable line | |

| | traceback | )

– + passed to both ‘rich’ and logging module (to file and stderr)

[ matched syntax in rich between modules and index of error ]

[ 3 ] PROGRAM HAULT, SIGTERM, ERROR CATCHING, BLACKMAGIC x

[ X ] Error Block

TRACEBACK LOGGER 1 : (.logging and sys.stderr calls. needs unified interface, capture traceback, callstack, [ loggable line ], loaded modules, grab module versions from requirements.txt,

Traceback logger 2 : (.rich logger for the traceback, last logged line before sigterm stuff)

L a r ge banner spacer

==========================================================

[ THIS NEEDS BOTH A PLAIN STDERR AND/OR .logging RELATED INTERFACE, AS WELL AS A ‘RICH’ styled output. (this way logs are ASCII and from .logging) (other stderr content may be printed, stylized by “rich”.

Example

[ x ] resume rich text logging to stderr

the reason for the ‘rich’ module would be to show traceback and relevant loggable line and callstack?

---------------------------------------------------------------------------------

....last 20 lines of log

-----------------------------------+---------------------------------------------

|

| traceback

loggable line |

> |

-----------------------------------+---------------------------------------------

Configure kmerdb logger to pass -n, –log-lines from stderr array, collected
Configure kmerdb to log to -l, –log-file as well as stderr/stdout
2. metadata schema
3. usage notes

[ metadata] | command Summary and Usage Reminder

Usage reminder (short form usage_notes text)

[ metadata ]

[ metadata description ]

[ x ] Error/exit note

exit_code

Error summary

tracebackcall stack (processed from error text??)loggable_line (also processed)

Relevant issue

[metadata]

key indices | key arrays/structurestracebacktext description of the process (these should be the sys.stderr with the carriage return \r texts)index-of-error (of the loggable line)index of error (in the data structure(s)) [ part of metadata ]str( | loggable line | || | traceback | )– + passed to both ‘rich’ and logging module (to file and stderr)[ matched syntax in rich between modules and index of error ]

outputs_directory and output_file(s)

[ x ] end rich formatting (avoids double logging to stderr issue)

x why its totally optional at this point.

Logger subfooter

command

params

runtime

Logfile : path/to/logfile.log

“logger” header (logger type, metadata ‘state’ number : int, url of logging configuration README.md, which describes the logging and error blocks)

verbosity level

global/local variables state 1

global/local variables state 2

…etc.

“logger” header (file logger, syntax breakdown,

[ 2 ] Footer note - | ‘metadata’ or ‘data’ or available information at time of program exit. (see below)

`-`-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-==–=-=-=-=-=-=–=-=-=-=-==–=-=—=-=

assembled before program termination, and a collection of descriptor structures necessary for pinpointing “loggable line” i.e. the metadata structures

spacer

[ x ] end of rich text module preference throughout interfaces, captured in a series of logging variable addresses

access to stderr, rich, and other logging facilities

beginning of secondary logging variables (the structured log data) being used to stdout

this way, the most relevant logging variables are printed to stdout first, without the “usage note, static documentative content”

logging to stderr or logging file continues by virtue of Python logging module, (the logging continues, by virtue of message assembly, addressing, and passage through the program branches, part of the nascent “logging fnx” featurer merger with the appmap rom.

And primary variable chain, “the outputs”, part of the data|metadata, and captured as program proceeds taskwise, key variables, indices, are printed in rich text post logging, to make valuable stdout, but logging proceeds both to stderr by virtue of logging internal library module, (1.) the logfile, and (2.) to rich-text enabled (table support, emphasis) stderr.

And the logger_header

------------------------

[ 1 ] | Description of error capture progress (blame?)

internal_errors variable

sigterm/error capture

accumulated log array (.logging determined)

try: caught error

traceback

modules

usage note

[ 2 ] Footer | command Summary and Usage Reminder

Usage reminder (short form usage_notes text)

[ metadata ]

- metadata

- metadata property

[ x ] Error/exit note

exit_code

Error summary

traceback

call stack (processed from error text??)

loggable_line (also processed)

Relevant issue

[metadata] + usage note (short) on each variable, metadata property, array, custom type, and index value

key indices |

key arrays/structures,

python version (? + citation)

loaded modules (hardcoded from pyproject.toml)

-compiler-

traceback

subcommand usage note text description of the process (these should be the sys.stderr with the carriage return \r texts)

index-of-error (of the loggable line)

index of error (in the data structure(s)) [ part of metadata ]

str( | loggable line | |

| | traceback | )

outputs_directory and output_file(s)

Thanks

3/25/24 - finished weighted edge list, planning assembler

Personal Remarks

Today marks the beginning of the end… of the DeBruijn graph format pull-request from branch ‘graph_algo’

I’m doing a little bit better mentally. Learned today about non-stiumlant ADHD meds

In hindsight, I’ve never been diagnosed with ADHD. I have reasonable hyper-focus, but I get derailed with alternate versions of … oops I literally forgot what the psychiatrist calls it when you change tasks and get unfocused. Wow.

I like my new therapist/counselor and her level of care seems nice. Let’s see how the next 3 months goes.

Okay, that’s enough about meTM.

Project remarks

I’m very happy with the recent additions to the the graph_algo branch. The feature ‘seems’ to be working quite well regarding neighboring/subsequent k-mers appended to the id array.

Specifically, I have a –quiet option that will silence most output delivered to the screen in addition to the verbosity setting.

By DEFAULT I print an obnoxious amount of output to the STDERR stream, without the verbosity settings changed from the default of warning level (-v, -vv).

I believe this demonstrates to the user how adjacencies in the id array are considered aka that they have the k-1 subsequence in common.

These assertions introduced in kmerdb.graph are essential to verify that subsequent read counts, propagate an error, which is displayed to the user as a warning

describing the nature of the assertion failures and suggesting the reason why.

More specifically: it should be added to the README.md that the number of assertion failures should roughly equal the number of reads in a .fq file, triggering the issue via k-mer ids from the end of one read and the beginning of the next.

NOTE: ADJACENCY ERRORS DETECTED: Found 24999 ‘improper’ k-mer pairs/adjacencies from the input file(s), where subsequent k-mers in the k-mer id array (produced from the sliding window method over input seqs/reads) did not share k-1 residues in common. These may be introduced in the array from the last k-mer of one seq/read (in the .fa/.fq) and the first k-mer of the next seq/read.

Okay, with this settled, I can now describe any plans for revision to the .kdbg format, as well as a description of a first-pass networkx based solution to graph traversal and stop criterion during contig generation.

With that said, I absolutely need a visualizer at this point to check my work.

Code cleanup

Documentation

Deprecations

strand_specific

all_metadata

IUPAC

README

kmerdb module

[X] kmer.py
- [ ] verbose => quiet
[X] graph.py
[X] parse.py
[ ] __init__.py

README.md

[ ] README.md
- [ ] Document the new IUPAC strategy for ‘kmerdb.kmer._shred_for_graph’
- [ ] Provide

website - matthewralston.github.io/kmerdb

[/] Expanded documentation on subcommands.
- [ ] profile
- [ ] view
- [ ] distance (SWAP ORDER)
- [ ] matrix (SWAP ORDER)
- [ ] NEW! graph
- [ ] kmeans
- [ ] hierarchical
- [ ] probability
[ ] DONT DO YET graph/assembly page
[/] API
- [ ] reading .kdbg or .kdb files
- [ ] writing .kdbg or .kdb files

Assembly algorithm planning

CPU (NetworkX) implementation (overview)

Stop criterion

[ ] when are the necessary traversals are completed
[ ] How do we rank these?

Lost comments

What the sort order of the residue encoding into bits does to the bit encoding of a single letter vs a string

Writing the goals down for the pearson’s r saturation behavior with depth

Implement a square on square matrix functionality on GPU with cupy in pyx?

Cupy

Literally failing to document hidden search/link-traversal features…

Remembering that it’s only supposed to be a k-mer count vector storage medium right now

Scoping scoping where does it end

Is my life’s work pointless?

Losing my best friend because of argument

Sent 1 basic sorry, got an minor acknowledgement.

Smoking habit down to 1 cig a day (just bored, less and less dynamism of focus.

Recalling the CortizoneTM

Apply gently

Reminding myself I don’t believe in these human-type humans. Humans about other humans seems like a soft, subjective, and wishy-washy skill to develop and I don’t trust it.

Control struggle

Time/money management issues mounting

Code maintenance

FEEDBACK COMMENTS [7/7]

[X] util
- [X] merge_metadata_lists [3/3]
  - [X] k
  - [X] meta_metadata_list = meta_metadata_list[i] + metadata
  - [X] new_kmer_metadata
- [X] represent_orderedDict
  - [X] dumper
  - [X] data
[X] kmer
[X] distance
[X] __init__
[X] seqparser
[X] fileutil
[X] parse

Logging

Revisit Sphinx documentation

Unit tests

Acceptance tests

Variable naming

New branch is called numpy_pearson

Still debugging the install at this point.

Stackoverflow issue at: https://stackoverflow.com/questions/70859444/cythonize-installs-so-files-to-wrong-location

Gonna shelve this for a bit until we get a response.

Regression R^2 overhaul (Adjusted R^2)

RMS/SST/n-1

RMS = SSRes/(n-p) = ( yy’ - Bhat’X’y ) / (n-p)

SST / (n-1) = y’y - squareOfSum / n

statsmodels

pos 1 (composite.kdb) vs suspected constituent (.tsv)

simulated metagenome

imbalanced total_kmers between ideal species

Essential features

Genome size estimation

UMAP

RDF : AWS Neptune / Neo4J / rdflib / Berkley DB / MongoDB supp

First, visualize a single read from pos-1 to pos-L

For each first position (pos-1 of each read), locate the pos-2 out of the 4 possible neighbors

Graph database layer

Export to RDF

SemanticWeb

Semantic web is a W3C standard for organizing datasets in the ‘web of data’ as opposed to the ‘web of documents’

Neptune for active app deployment

Development layer - alternate to Neptune for local development.

RDF to Neo4J

AllegroGraph - RDF/SPARQL compliant

Visualization layer (VR)

Need an eventual custom client layer and/or query language

VIS.js and/or D3.js

WebXR

ImmersiveWeb

WebXR device API

Topology of DNA/RNA space datasets

Valve Index ($1000)

Report questions

What is the appropriate distribution for k-mer counts

Do k-mer profiles produce useful clustering information?

How does profile sparseness scale (in bacterial genomes) with k?

Vanila (no-metadata) Profile generation time

Runtime vs reads (fasta, fastq)

Runtime vs filesize

Compare slopes from regression to determine if profiles can be generated from fasta files faster

How do profiles from WGS, simulated Illumina reads, and the assembled genome differ?

Is there good separation Markov-chain probabilities of sequences from different species against a profile?

Bugfix

OLD TODO.org

Outbox

Sparse .kdb

modify slurp

modify profile

Nearest neighbor profile

index class

Probability function

kmerdb shuf

shuffled profiles

Use kdb header

Use shuf on lines printed to temporary file

Hardcode the alternative method to readline:

def KDBReader.readline():

kmer_id, count, metadata = parse_line(self.readline())

assert type(kmer_id) is int, “kmer_id wasn’t an integer when passed in from parse_line”

assert type(count) is int, “count wasn’t an integer when passed in from parse_line”

assert type(metadata) is dict, “metadata wasn’t a dict when passed in from parse_line

return kmer_id, count, metadata

THEN DO THE ACTUAL HARDCODING OF THE ALTERNATIVE WHICH IS AS FOLLOWS

use readline to parse the counts, the count is all you need, populate that into a list

then convert that list into an nd.array and write it plus the index (enumerate) to disk

like you would do in profile

store fasta/fastq offsets in the database

Assessment of probability function

sequence length, starting position, strand

The length of the parameter space theta is 3

I gave the probability function the a MLE estimate of a sequence,

a subsequence for the genome the profile was made of

and then if we vary these parameters while calculating LoRs from same and other species

we can generate a pdf/distribution of the LoRs for other species

Calculate more log-odds ratios

For known sequences against different lengths

For sequences simulated by ideal fasta (through what? we have frequencies, not distributions for the frequencies

We need an error model

The error to minimize

Well we have the probability of any k-mer

so we can walk from here to there

and we can compare the likelihood to a better null model.

An exact error model is to formal at this point. We need a better null model.

But if I don’t and we do the probability of the k-mer than it would be a random walk through k-mer space..

and it would eventually produce the correct sequence through brute force.

The brute force method is to try random walks with the same initialized k-mer.

Then we do

Release 0.0.7

Rmd report1

Results

Distribution fitting / model selection

PCA

kmerdb shuf on 3 of 30 metagenomes for k=1:12 + kPAL figure

Median “distance” between profiles of pairwise comparison

Distribution analysis

Accurately describe kdb counting algorithm

althought the algorithm differs in its approach to fastq k-mer counting from fasta k-mer counting,

First, a selection of sequences is shredded into k-mers in memory

Second, the counts are tallied on-disk using SQLite3.

Third, the SQLite3 database iterator is used to pull row from row out and print line by line into the kdb datastructure.

Fourth, at this point, an index may be created.

Distribution fitting

Cullen-Frey

Negative binomial fit

Poissonian imitation (average, geom. mean, median, mode) [each] vs negative binomial fit to the data

Count normalization

Next, we want to judge the effect of DESeq2 normalization on the counts values.

We use a boxplot to address the null-hypothesis that DESeq2 normalization does not meaningfully harmonize each samples quartiles with one another.

We must check this often when addressing our normalized data because failure to normalize properly

due to an issue that is not library size or total counts,

suggests another issue with the distribution of that sample.

State why we refuse to standardize the data at this point.

kmerdb transitions

transition probabilities of the primary sequence

[kmerdb.probability.transition(kdb, i, j) for i in range(N) for j in range(N)]

def transition(kdbrdr, kdbidx, i, j):

# type check

total = kdb.header[“total_kmer_counts”]

kmer_id_i, count_i, neighbors_i = index.read_line(kdbrdr, kdbidx, i)

kmer_id_j, count_j, neighbors_j = index.read_line(kdbrdr, kdbidx, j)

# now check that i and j are neighbors (i.e. that their transition makes sense)

if kmer_id_j not in neighbors_i[“suffixes”].values():

return 0.0

else:

qj = count_j/total

sum_qix = 0

for char, idx in neighbors_i[“suffixes”].items():

kmer_id, count, _ = index.read_line(kdbrdr, kdbidx, idx)

if kmer_id is None or count is None:

kmer_id = idx

count = 0

sum_qix += count/float(total)

if sum_qix == 0.0:

return 0.0

else:

return qj / sum_qix

kmerdb simulate

generate x fasta sequences of length L

write them to temporary file

read them into kdb file

prefix, suffix = os.path.splitext(filename)

assert suffix == “.kdb”, “provided filename did not end in .kdb”

shutil.move(fasta, prefix + “.fa”)

write kdb file (prefix + “.kdb”)

Rmd report2

algorithm profiling

kdb profile k x time x cpu (z)

we need to choose a range of k that is meaningful and explain why.

the choice of k of 8 - 12 is convenient because it means

we don’t have to pay for extra memory. This will be managable on any number of cores

with at least 32 Gb of memory for about 20 samples.

According to the following graph, the uncompressed value of the sparse matrix in n x 4^k

may take gigabytes per profile in the low double digits.

but the value of these profiles grows exponentially with the increased cost as well.

so when we look at these genomes with this degree of sensitivity, which has been substantial in the literature in the neighborhood of k=10-12,

then suddenly we agree that more characterizations are possible and this places more value on the expected scaling behavior of this program.

The goal is most likely not to reinvent the wheel. Since this is an academic package at this point, we feel that it is necessary and important to couple this with a graph database

We have selected the RDF format going forward and expect that long term use of Amazon Neptune might be an important source of understanding that we can get from users uploading their graphs, sparse or otherwise, to a giant Neptune repository.

It could be an entirely new sequence database format.

kdb distance correlation <fasta|fastq>

profile reads sam/bam

use pysam to iterate over reads, creating a profile in the process.

Likelihood of dataset given prior k-mer profiles

Calculate graph properties indicative of de Bruijn graph collapse

‘kmerdb random’ sequence simulator

given a certain length of sequence N, suggest a sequence that best solves the k-mer abundance graph

Connect this to meme suite

Hypotheses:

Suppose that k-mer spectra have a positive and negative saturation direction.

In this way, more specific signals and antisignals could be surmissed from samples with enough resolution, temporal or otherwise resolved by covariates.

Think of what could happen if the signals and antisignals were resolved on the order of genes, you could detect gene expression levels with it.

kmerize

to use bed/gff features to select reads from bam/bai using pysam

and then creating sparse profiles for each feature

to split a bam according to gff/bed features, and putting that in an output directory

Learn the RDF spec

Think of a specification for each node.

Manifold learning

Isomap (derived from multidimensional scaling (MDS) or Kernel PCA)

Lower dimensional projectsion of the data preserving geodesic distances between all points

(Modified) Locally Linear Embedding

Lower dimensional projection of the data preserving local neighborhood distances

locally_linear_embedding or LocallyLinearEmbedding with method=”modified”

t-SNE

While isomap, LLE, and variants are best tuited to unfold a single continuous low-dimensional manifold

t-SNE will focus on the local structure of the data and will tend to extract clustered local groups of samples.

This ability to group samples based on the local structure might be beneficial to visually disentangle a dataset that comprises several manifolds at once.

Comment code

index class

need b-tree library

https://pythonhosted.org/BTrees/

input dictionary

given a int/float I want fast access to all keys greater than or less than the int/float

e.g. { 345: [line offsets], 346: [lineoffsets} sorted by the int/float

The following searches for all values greater-than(min) or less-than(max), flattening

list(itertools.chain.from_iterable(btree.values(min=int/float)))

kdb annotator class (reworked into index class and better metadata specification)

First, further specify kdb record shape

Second specify kdb metadata shape/types/parsing routines

Annotate bools, floats (probability), tags, ints (connectivity/degree)

Eulerian as a tag or a bool?

Index should be designed to rapidly filter tags, rapidly search/filter/narrow on ints

Index function

kmer id index : parse header offset (done?), then use readline + .tell() to get offset

count index : b-tree

sort k-mers by counts (in memory, not on file), then create b-tree, leafs are k-mer file indices (above)

tag : hash index

float, int indices : similar to count index above6

Operations

Get all neighbors

Remove first/last letter, add one of the 3 other possible letters

6 possible neighbors

is_terminal = True if all neighbors of one direction have 0 count

Eulerian walk (Maybe at the Python level and not the C-api)

Return a group of k-mers that have a complete walk

Format specification

YAML header (first block)

format version

choice of k

file name, sha256 checksums, number of reads, kmers added

comments

kdb_ver: 0.0.1 k: 14 files:

filename: sha256: md5: total_reads: total_kmers: unique_kmers:
filename: …

comments:

kmers (other blocks)

kmer id

count (exclude 0 count kmers?)

yaml metadata/neighboring k-mer ids

toolkit

Reverse strand

utility functions

translate kmers to/from binary encoding

header validation

summary

print information from header

profile

VERIFY new profile is sum of individual profiles

for x in range(len(f.profile)):

final.profile[x] += f.profile[x]

closed

kdb.file.checksums generates checksums of a file

prof=array.array(‘H’); for x in range(4**k): prof.append(0)

prof[sequenceToBinary(kmer)] += 1

total_kmers += 1

total_reads += 1

unique_kmers = 4**k - prof.count(0)

support multiple files

generate streaming profile (file or S3 download to temp)

KDBReader.read_profile

KDBWriter.write_profile

VERIFY similarity

cumulative formulas

these need to be calculated differently for efficiency/memory reasons

repetitive summation/multiplication and not direct to unit vector transformation

1. Pearson correlation coefficient of counts? of unit vector?

2. euclidean distance of unit vectors?

3. sort by count of vector/index and Spearman

jaccard

presence/absence (k-mer is observed in both profiles? it’s in the intersection

similar count within a tolerance… vs Spearman?

MUMi distance

jsonify

transform the debrujin graph into json

Partitioning experiment

Use khmer to partition reads from an example dataset

Similarity metrics between partition fastas and whole profile

Annotate kdb metadata to include Markov probabilities of single sequences to partition

How do we describe or select subgraphs based on the partition information?

Presence of Eulerian walk among partition AND if the eulerian walk extends too far into other partitions

Key reads AND k-mers involved in complex graph structures near partition bridges

Suggestions for deeper sequencing or skew in partition compositions to make up for low depth

Number of partition bridges vs subsampling

Number of partition bridges vs unique k-mer count / partition

Other metrics besides unique k-mer count

Overlap k-mer count

unique k-mers per total k-mers

unique k-mers per partitioned reads

How do we describe subgraph features worth considering, given the partition

Node connectivity stats

kdb filtering ( retrieve only k-mers with partition, connectivity, Markov probability cutoffs, participant in Eulerian walk)

Other functions

Partitionizer (partition fasta and genomic fastas; completeness of each partition’s capture of the ideal composite)

How much more data do I need from each partition to minimize bridges, maximize genomic coverage, and maximize orthogonality to other partitions

Given a partition fasta and a genomic fasta

Could estimate the sequencing depth and complexity required to minimize most partition bridges

Could also estimate the size and partitioning required to maximize partition orthogonality

Sampleizer (one genome fasta; dial up/back efforts in improving this partition/sampling)

Does my sampling protocol for this partition only have enough uniqueness to cover the one major walk, or is most of the data getting in the way of the other species at the current composite compositions?

How much of the genomic profile is covered by the partition?

At a certain orthogonality metric per sampling from the genomic fasta, does the amount of uniqueness orthogonality recovered by additional depth tend to clarify the partition, or obfuscate other operations on leading partitions?

Profilizer (all genome fastas; snapshot/metrics, as composite is improved)

Construct a perfect profile from all genomes and integrate

Similarities between individual profiles and perfect composite (Ideal distance metrics for each profile addition to perfect the composite)

Similarities between imperfect composite and perfect composite (How much orthogonality and completeness is currently recovered)

Similarities between imperfect partitions and perfect composite (How much orthogonality is lost due to current imperfect partitioning)

Similarities between imperfect composite and imperfect partitions (How much orthogonality is lost due to current imperfect partitioning)

walker (calculate Eulerian walks, i.e. walks that maximize path length under constrains (no node visited twice, etc.))

it’s an optimization of some kind

under the constraint of ‘no node visited twice’

maximize walk length (like the number of joins)

Other functions

chimera, duplications, transposon, contamination detection (kPAL)

multiset distance/similarity (kPAL)

Peak detection and modality analysis (single k-mer peak, low neighbors? broad k-mer abundance peaks?)

k-mer spectrum plotting (ggplot? tsv?)

sequencing error vs rare k-mer likelihoods (Kelley et all 2010 https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-11-r116)

kdb filter for repetitive motifs/sequences??

replace header (kdb header replace example.kdb example.yaml)

Leaving the count fields at 0 is okay, should recompute anyway

If the count fields are non-zero, then assume the values are correct

Files

TODO.org

Latest commit

History

TODO.org

File metadata and controls

KmerDB (.kdb) [TURN THAT TSHI UP]

1/28/25 | [0.8.13] pre-alignment : adjustments to coordinate matching precursors, index file is now a simple linked list. just 1 vector of 4tuples across all sequences and basepairs multiplied together.

Releasing on PyPI as 0.8.13

Coord_obj_array = [(fasta_id, seqlen, i, kmer_id, is_minimizer), …]

The coordinate object array is a normalized table data structure

its a list of 5-tuples (flexible to change)

[Acceptance tested] for the kmerdb minimizer function

[Unimplemented] the kmerdb alignment function

1/24/25 | alignment

Step 1: revise minimizer index file

fasta_id, fasta_coord, kmer_id, is_minimizer

Step 2: create minimizer match file

kmer_id, is_minimizer, is_match, fasta_id1, fasta_coord1, fasta_id2, fasta_coord2

Step 3: use n minimizer matches in alignment table

fasta_id1, fasta_id2, common_minimizer_count, common_minimizer_coords_array

1/13/25 | Minimizers (lexicographic with window size)

Use the best -n scoring (wrt original genome/kdb) minimizers by match count to seed alignment

More specifically, a match is seeded by the n most optimal scores by optimal number of matches of minimizer counts

match is intersection of minimizer optima with group of max score SW alignment.

match is restricted to n (upper bound ln of sequence) or more minimizer matches or next closest (matchmin 4+)

Top scoring is considered a candidate for alignment

Scoring determined by minimixer match and then alignment with score of mismatch and match (+1/-1) and successful SW alignment

Reference matches produced by enumerating all pairs possible sequence alignemnt of query vs reference sequences.

Filtering the alignment seeds/kmers by abundance should be a good heuristic for creating biologically meaningful seed regions for alignment.

The –window-size parameter is important for kmer/minimizer filtering and restricting the number of Smith Waterman alignments that must be done.

Refactor minimizers.py to create a function that takes a fasta sequence as input, and returns a minimizer array. (lift and shift from get_minimizers in 0.8.11)

[0/3] Final alignment command should manually produce temporary files:

10/5/24 | released from hospital (2 days) and working on refactoring kmer module

9/21/24 | somewhat changing from gpt4o-mini to claude sonnet

using sonnet to create llc documents

9/20/24 | strassen

9/5/24 | genome-size estimation

G = T/C_k

T = total kmer count

G = estimated genome size

C_k = C*(read_length-k+1)/read_length (OR the second-peak/best peak’s x-value is the k-mer coverage in the count histogram)

8/8/24 Taking Notes on Xuejiang Xiong Mouse model IBD study

SRA Accession id

SRA051354

What is the purpose of this study?

8/3/24 Kolmogorov complexity and Generalized Suffix Arrays

Suffix array

kmerdb should have a suffix array structure, and its own metadata structure, and the columnar info should have references to original data from the k-mer or suffix on k-mer structure.

kolmogorov and Lemplel Ziv complexity definition:

8/1/24 Written Lit review, System Reconfigurations

Currently reconfiguring my system and redundancies

Making copies of my installation and configuration/install routines. Trying ubuntu 24.04 LTS version rather than Arch. Better build/configure/make predictability.

Current [TODO]

NEXT Create kmerdb logo using GIMP

Finish logo export

Add logo to README

Add logo to website

7/28/24 [multiplication rule for Markov probability]

needs to be written in documentation

currently writen into appmap as command 11, but not fleshed out.

[TRIAGE] : vsearch align with kmerdb

Use k-mer frequencies to rank similarity to sequences in db.

Proceed from seed match/mismatch to full dynamic programmin smith waterman w/ affine gap penalty

7/16/24 NEW metadata feature for graph subcommand

graph subcommand needs node count explicitly, (k^n, where n is proportional to fastq size in number of reads)

graph in m = 4^k symbols*

[new] metadata fields: unique_kmers, total_kmers, total_nodes, total_edges, possible_edges

AND also printed in final stats

IN PROGRESS 7/11/24 - [LIT REVIEW]

IN PROGRESS D2 metrics, markov sequence prob review

D2 = ∑(I(A, B))

D2s = ∑{ \frac{ (X - \bar{X})(Y - Ybar) }{ \sqrt{ (X - Xbar) + (Y - Ybar) } } (the squareroot of the sum of the standardized X’s is the denominator, numerator is the product of the standardized X and Y counts, then the ratio is summed)

D2* = ∑{ \frac{ (X - Xbar)(Y - Ybar) }{ mhat*nhat*pwX*pwY } } (w=word, hat = “adjusted”/translated = m - k, X and Y are counts from )

D2z = ( D2(A,B) - E[D2] ) / \sqrt( var(D2) )

WAITING D2shepp = ∑{ \frac{ cwXi - (n-k+1)pwx * cwYi - (n-k+1)pwy }{ \sqrt{ (cwXi - (n-k+1)pwx)2 + (cwYi - (n-k+1)pwy)2} }

Reinert G. et al. “Alignment-free sequence comparison (1): statistics and power” J. Comput. Biol. 2003 v16 (p1615-1634)

Bibtex format below:

core species choices

chicken farm estuary system changes (algination, asphyxia, microbiological changes

D2* = ∑{ \frac{ (X - Xbar)(Y - Ybar) }{ mhatnhatpwX*pwY } } (w=word, hat = “adjusted”/translated = m - k, X and Y are counts from )

WAITING D2shepp = ∑{ \frac{ cwXi - (n-k+1)pwx * cwYi - (n-k+1)pwy }{ \sqrt{ (cwXi - (n-k+1)pwx)² + (cwYi - (n-k+1)pwy)²} }