You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
10/5/24 | released from hospital (2 days) and working on refactoring kmer module
9/21/24 | somewhat changing from gpt4o-mini to claude sonnet
using sonnet to create llc documents
9/20/24 | strassen
9/5/24 | genome-size estimation
G = T/C_k
T = total kmer count
G = estimated genome size
C_k = C*(read_length-k+1)/read_length (OR the second-peak/best peak’s x-value is the k-mer coverage in the count histogram)
8/8/24 Taking Notes on Xuejiang Xiong Mouse model IBD study
SRA Accession id
SRA051354
SRA051354
What is the purpose of this study?
The goal of this study is to recreate a mouse model of the disease called “Irritable Bowel Disease”, using agents that induce responses and irritation to the point where the induced condition and the condition known as “irritable bowel disorder” are functionally similar.
The mice are NOD (non-obese diabetic) and suceptible to germs. They are colonized with 8 symbiotic bowel microbes, known as Altered Schaedler flora (ASF).
Samples taken from the bowels of these mice reveal the effect of the irritant/inducer agent on the gut microflora as measurable by Illumina High-throughput sequencing (HTS). Specifically, transcriptional libraries are prepared following RiboMinus treatment, to enrich for mRNAs and other non-rRNAs.
The mRNA libraries were processed on a Genome Analyzer IIx in this study. The SRA accession id for the single-end fastq datasets, bulk RNA for metatranscriptomics and assembly, is SRA051354.
The study used
8/3/24 Kolmogorov complexity and Generalized Suffix Arrays
Suffix array
kmerdb should have a suffix array structure, and its own metadata structure, and the columnar info should have references to original data from the k-mer or suffix on k-mer structure.
kolmogorov and Lemplel Ziv complexity definition:
@article{zielezinski2017alignment,
title={Alignment-free sequence comparison: benefits, applications, and tools},
author={Zielezinski, Andrzej and Vinga, Susana and Almeida, Jonas and Karlowski, Wojciech M},
journal={Genome biology},
volume={18},
number={1},
pages={186},
year={2017},
publisher={Springer}
}
For example, the Kolmogorov complexity of a sequence can be measured by the length of its shortest description.
Accordingly, the sequence AAAAAAAAAA can be described in a few words (10 repetitions of A), whereas CGTGATGT presumably
has no simpler description than specification nucleotide by nucleotide (1 C, then 1 G and so on). Intuitively, longer
sequence descriptions indicate more complexity. However, Kolmogorov did not address the method to find the shortest
description of a given string of characters. Therefore, the complexity is most commonly approximated by general compression
algorithms (e.g., as implemented in zip or gzip programs) where the length of a compressed sequence gives an estimate of its
complexity (i.e., a more complex string will be less compressible) [60]. The calculation of a distance between sequences using
complexity (compression) is relatively straightforward (Fig. 2). This procedure takes the sequences being compared
(x = ATGTGTG and y = CATGTG) and concatenates them to create one longer sequence (xy = ATGTGTGCATGTG). If x and y are
exactly the same, then the complexity (compressed length) of xy will be very close to the complexity of the individual x or y.
However, if x and y are dissimilar, then the complexity of xy (length of compressed xy) will tend to the cumulative complexities
of x and y. Of course, there are as many different information-based distances as there are methods to calculate complexity.
For example, Lempel–Ziv complexity [61] is a popular measure that calculates the number of different subsequences encountered
when viewing the sequence from beginning to end (Fig. 2). Once the complexities of the sequences are calculated, a measure of
their differences (e.g., the normalized compression distance [62]) can be easily computed. Many DNA-specific compression
algorithms are currently being applied to new types of problems [63].
Kolmogorov complexity comes in two flavors: prefix-free (K(x)) and simple complexity (C(x)) measures. The formal treatment of these metrics and their formulae or estimation techniques will be provided shortly.
8/1/24 Written Lit review, System Reconfigurations
Currently reconfiguring my system and redundancies
Making copies of my installation and configuration/install routines. Trying ubuntu 24.04 LTS version rather than Arch. Better build/configure/make predictability.
Current [TODO]
NEXT Create kmerdb logo using GIMP
State “IN-PROGRESS” from “NEXT” [2024-08-01 Thu 19:04]
Finish logo export
Add logo to README
Add logo to website
7/28/24 [multiplication rule for Markov probability]
needs to be written in documentation
currently writen into appmap as command 11, but not fleshed out.
[TRIAGE] : vsearch align with kmerdb
Use k-mer frequencies to rank similarity to sequences in db.
Proceed from seed match/mismatch to full dynamic programmin smith waterman w/ affine gap penalty
7/16/24 NEW metadata feature for graph subcommand
graph subcommand needs node count explicitly, (k^n, where n is proportional to fastq size in number of reads)
IN PROGRESS D2 metrics, markov sequence prob review
D2 = ∑(I(A, B))
D2s = ∑{ \frac{ (X - \bar{X})(Y - Ybar) }{ \sqrt{ (X - Xbar) + (Y - Ybar) } } (the squareroot of the sum of the standardized X’s is the denominator, numerator is the product of the standardized X and Y counts, then the ratio is summed)
D2* = ∑{ \frac{ (X - Xbar)(Y - Ybar) }{ mhat*nhat*pwX*pwY } } (w=word, hat = “adjusted”/translated = m - k, X and Y are counts from )
State “WAITING” from “DONE” [2024-08-01 Thu 18:49]
State “DONE” from “CANCELED” [2024-08-01 Thu 18:49]
State “CANCELED” from “DELEGATED” [2024-08-01 Thu 18:49]
Reinert G. et al. “Alignment-free sequence comparison (1): statistics and power” J. Comput. Biol. 2003 v16 (p1615-1634)
Bibtex format below:
@article{reinert2009alignment,
title={Alignment-free sequence comparison (I): statistics and power},
author={Reinert, Gesine and Chew, David and Sun, Fengzhu and Waterman, Michael S},
journal={Journal of Computational Biology},
volume={16},
number={12},
pages={1615–1634},
year={2009},
publisher={Mary Ann Liebert, Inc. 140 Huguenot Street, 3rd Floor New Rochelle, NY 10801 USA}
}
core species choices
chicken farm estuary system changes (algination, asphyxia, microbiological changes
anti-human leaky gut syndrome changes.
i.e. looking at the human leaky gut syndrome, but in reverse. What are bioprotective species and niches that provide resilience to leaky-gut syndrome
chemophore SMILES and gastrotoxic footprints
pathology of lupus or auto-immune skin condition microbiome/metagenomic changes.
vaginal microbiome changes
Perspective 1 from reivew on distance metrics
IN PROGRESS 7/10/24 - [IMPORTANT] Needs a choice [cython d2 x graph algorithm features ]:
cython d2 metrics including the delta distance : |pab(A)-pab(B)| (Karlin et al, tetra,tri,di- nucleotide frequencies)
(describe Karlin delta, algorithm to calculate)
Karlin delta first requires the least ambiguous k-mer (4-mer) frequency, i.e. the frequency of self
next required is the most ambiguous k-mer (4-mer) frequency, that with terminal residues identical, but internal residues as N, thus summing frequencies of recursively associated k-mers (4-mers)
next, all k-1 (trinucleotide), and k-2 mer (dinucleotide) frequencies are required, proceeding from outside in, such that internal residues again tend towards N, such that all combinations of residues are visited by the faNc trinucleotide frequency, with a - adenine and c - cytosine fixed, and the internal position of the trinucleotide specified as N, thus summing so that [ f(atc), f(aac), f(acc), and f(agc) = f(aNc) ].
this specifies the numerator for the tetranucleotide frequency (lowercause tau)
the denominator is only the most specific tetra and 1-neighboring trinucleotide frequencies, and the mononucleotide frequencies. [ f(acc) f(accg) f(ccg) f(a) f(c) f(t) f(g) ]
new graph file format specification ( walk,path is a subclass of unlabeled graph, where node labels can be visited, path order, and progressive or retro in the walk.
contig generator method, and contig boundary definition specification
6/28/24 - [ …whoops, forgot the date by 3 x24hr blocxz. ] okay, so the 0.8.4 release should have the graph labeling done.
graph node labeling and classification, and walk strategy
walk strategy, backtrace, and expansion step node labeling patterns need structure
assembler requires color graph feature unimplemented
index features need expanding
index as a .kdb.gi file?
new datatypes
new jsonschemas required:
product_result
the full product (nx3xm), the square product, the comprehension product
walk product (a label and node order specification)
node product (a node ordering and/or enumeration schema)
permutation (range(n)) => n! (n0, n1, n2, n3) (n0, n2, n1, n3) (n0, n3, n2, n1) … etc. for 24 total permutations of the 4 starting items.
combination (abcd xyz ) => ax ay az bx by … etc. (n!/(n-r)!)
6/14/24 - Revise README.md from changes to profile subcommand for multi-K and generic ‘prefix’ outfile pattern.
6/11/24 - Index refactor, offset calculations, index table structure
D2*, D2S, and D2 statistics/distances
IN PROGRESS Refactor fileutil/index modules to produce valid index data on file-read
IN PROGRESS Refactor distance.pyx, ensure it compiles and computes successfully
Refactor profile command to accept minK and maxK commands
Refactor profile command to have ‘prefix’ as requried –flag instead of trailing positional argument
Default behavior, on single -k, is to create a file named $PREFIX.$K.kdb
-k is now optional
def profile in __init__.py must have logic to determine if single or multi-k mode enabled
Alt behavior, on minK and maxK together, is to create files name $PREFIX.$K.kdb as required till minK/maxK is satisfied
6/8/24 - Index + D2
Fix index subcommand, ensure it stores offsets
D2 statistic in Cython, distances.pyx
Presence absence AND exact k-mer profile match
4/25/24 - a small RNAidea (and other RNA families)
k-mer compositions and mutational families in small-RNA rich species
k-mer compositions of riboswitches
k-mer compositions of introns, exons (in eukaryotic) and promoters, terminators, orfs, orf families, and operons.
k-mer distances benchmark
Cython pearson
scikit spearman and correlation distance
statsmodel statistics
sm.add_constant(x1) # The b0 param in the ordinary Least Squares fit.
results = sm.OLS(y, x).fit()
results.summary()
associated graphics for inferences
pearson vs ols R2 from statsmodel
spearman vs pearson vs k on test dataset. Matrix representation in example_report2.
numpy or cython implementation of regression model.
4/13/24 - Assembly
Networkx
Assemble or markov probability, markov chains, contig definition, locality
Leads to better graphing. Can’t get to exact solution. Simplification requires heuristics and design.
>>> from kmerdb import graph, fileutil
>>> import numpy as np
>>> import networkx as nx
>>> import matplotlib.pyplt as plt
>>> kdb = fileutil.open(“path/to/example.8.kdb”, mode=’r’, slurp=True)
>>> kdbg = graph.open(“path/to/example.8.kdb”, mode=’r’, slurp=True)
>>> kmer_ids = kdb.kmer_ids
>>> n1 = kdbg.n1
>>> n2 = kdbg.n2
>>> w = kdbg.w
>>> edge_list = list(zip(n1, n2))
>>> G = nx.Graph()
>>> G.add_nodes_from(kmer_ids)
>>> G.add_edges_from(edge_list)
>>> if nx.is_planar(G) is False:
>>> raise ValueError(“Need planar graph to continue”)
>>> g = nx.generic_graph_view(G)
>>> #nx.is_tournament(g) #should not be a tournament
>>> #nx.tournament.hamiltonian_path(g)
>>> # Utility function - # of walks
>>> # num_walks = number_of_walks(g, length=walk_length)
>>>
>>> final_g = restricted_view(G, hidden_nodes, hidden_edges)
>>> degree_sequence = sorted((d for n, d in G.degree()), reverse=True)
>>> dmax = max(degree_sequence)
>>> dmax
7
>>> fig = plt.figure(“Degree of Cdiff k-mers for k=8 (Max neighbors = 8)”)
>>> axgrid = fig.add_gridspec(5,4)
>>> ax0
>>> ax0 = fig.add_subplot(axgrid[3:, :2])
>>> ax0 = fig.add_subplot(axgrid[0:3, :])
>>> Gcc = G.subgraph(sorted(nx.connected_components(G), key=len, reverse=True)[0])
>>> help(nx
… )
[ THIS NEEDS BOTH A PLAIN STDERR AND/OR .logging RELATED INTERFACE, AS WELL AS A ‘RICH’ styled output. (this way logs are ASCII and from .logging) (other stderr content may be printed, stylized by “rich”.
Example
[ x ] resume rich text logging to stderr
the reason for the ‘rich’ module would be to show traceback and relevant loggable line and callstack?
Configure kmerdb logger to pass -n, –log-lines from stderr array, collected
Configure kmerdb to log to -l, –log-file as well as stderr/stdout
2. metadata schema
3. usage notes
[ metadata] | command Summary and Usage Reminder
Usage reminder (short form usage_notes text)
[ metadata ]
[ metadata description ]
[ x ] Error/exit note
exit_code
Error summary
tracebackcall stack (processed from error text??)loggable_line (also processed)
Relevant issue
[metadata]
key indices | key arrays/structurestracebacktext description of the process (these should be the sys.stderr with the carriage return \r texts)index-of-error (of the loggable line)index of error (in the data structure(s)) [ part of metadata ]str( | loggable line | || | traceback | )– + passed to both ‘rich’ and logging module (to file and stderr)[ matched syntax in rich between modules and index of error ]
outputs_directory and output_file(s)
[ x ] end rich formatting (avoids double logging to stderr issue)
x why its totally optional at this point.
Logger subfooter
command
params
runtime
Logfile : path/to/logfile.log
“logger” header (logger type, metadata ‘state’ number : int, url of logging configuration README.md, which describes the logging and error blocks)
verbosity level
global/local variables state 1
global/local variables state 2
…etc.
“logger” header (file logger, syntax breakdown,
[ 2 ] Footer note - | ‘metadata’ or ‘data’ or available information at time of program exit. (see below)
assembled before program termination, and a collection of descriptor structures necessary for pinpointing “loggable line” i.e. the metadata structures
spacer
[ x ] end of rich text module preference throughout interfaces, captured in a series of logging variable addresses
access to stderr, rich, and other logging facilities
beginning of secondary logging variables (the structured log data) being used to stdout
this way, the most relevant logging variables are printed to stdout first, without the “usage note, static documentative content”
logging to stderr or logging file continues by virtue of Python logging module, (the logging continues, by virtue of message assembly, addressing, and passage through the program branches, part of the nascent “logging fnx” featurer merger with the appmap rom.
And primary variable chain, “the outputs”, part of the data|metadata, and captured as program proceeds taskwise, key variables, indices, are printed in rich text post logging, to make valuable stdout, but logging proceeds both to stderr by virtue of logging internal library module, (1.) the logfile, and (2.) to rich-text enabled (table support, emphasis) stderr.
And the logger_header
------------------------
[ 1 ] | Description of error capture progress (blame?)
internal_errors variable
sigterm/error capture
accumulated log array (.logging determined)
try: caught error
traceback
modules
usage note
[ 2 ] Footer | command Summary and Usage Reminder
Usage reminder (short form usage_notes text)
[ metadata ]
- metadata
- metadata property
[ x ] Error/exit note
exit_code
Error summary
traceback
call stack (processed from error text??)
loggable_line (also processed)
Relevant issue
[metadata] + usage note (short) on each variable, metadata property, array, custom type, and index value
key indices |
key arrays/structures,
python version (? + citation)
loaded modules (hardcoded from pyproject.toml)
-compiler-
traceback
subcommand usage note text description of the process (these should be the sys.stderr with the carriage return \r texts)
index-of-error (of the loggable line)
index of error (in the data structure(s)) [ part of metadata ]
Today marks the beginning of the end… of the DeBruijn graph format pull-request from branch ‘graph_algo’
I’m doing a little bit better mentally. Learned today about non-stiumlant ADHD meds
In hindsight, I’ve never been diagnosed with ADHD. I have reasonable hyper-focus, but I get derailed with alternate versions of … oops I literally forgot what the psychiatrist calls it when you change tasks and get unfocused. Wow.
I like my new therapist/counselor and her level of care seems nice. Let’s see how the next 3 months goes.
Okay, that’s enough about meTM.
Project remarks
I’m very happy with the recent additions to the the graph_algo branch. The feature ‘seems’ to be working quite well regarding neighboring/subsequent k-mers appended to the id array.
Specifically, I have a –quiet option that will silence most output delivered to the screen in addition to the verbosity setting.
By DEFAULT I print an obnoxious amount of output to the STDERR stream, without the verbosity settings changed from the default of warning level (-v, -vv).
I believe this demonstrates to the user how adjacencies in the id array are considered aka that they have the k-1 subsequence in common.
These assertions introduced in kmerdb.graph are essential to verify that subsequent read counts, propagate an error, which is displayed to the user as a warning
describing the nature of the assertion failures and suggesting the reason why.
More specifically: it should be added to the README.md that the number of assertion failures should roughly equal the number of reads in a .fq file, triggering the issue via k-mer ids from the end of one read and the beginning of the next.
NOTE: ADJACENCY ERRORS DETECTED: Found 24999 ‘improper’ k-mer pairs/adjacencies from the input file(s),
where subsequent k-mers in the k-mer id array (produced from the sliding window method over input seqs/reads) did not share k-1 residues in common.
These may be introduced in the array from the last k-mer of one seq/read (in the .fa/.fq) and the first k-mer of the next seq/read.
Okay, with this settled, I can now describe any plans for revision to the .kdbg format, as well as a description of a first-pass networkx based solution to graph traversal and stop criterion during contig generation.
With that said, I absolutely need a visualizer at this point to check my work.
Code cleanup
Documentation
Deprecations
strand_specific
all_metadata
IUPAC
README
kmerdb module
[X] kmer.py
[ ] verbose => quiet
[X] graph.py
[X] parse.py
[ ] __init__.py
README.md
[ ] README.md
[ ] Document the new IUPAC strategy for ‘kmerdb.kmer._shred_for_graph’
[ ] Provide
website - matthewralston.github.io/kmerdb
[/] Expanded documentation on subcommands.
[ ] profile
[ ] view
[ ] distance (SWAP ORDER)
[ ] matrix (SWAP ORDER)
[ ] NEW! graph
[ ] kmeans
[ ] hierarchical
[ ] probability
[ ] DONT DO YET graph/assembly page
[/] API
[ ] reading .kdbg or .kdb files
[ ] writing .kdbg or .kdb files
Assembly algorithm planning
CPU (NetworkX) implementation (overview)
Stop criterion
[ ] when are the necessary traversals are completed
[ ] How do we rank these?
Lost comments
What the sort order of the residue encoding into bits does to the bit encoding of a single letter vs a string
Writing the goals down for the pearson’s r saturation behavior with depth
Implement a square on square matrix functionality on GPU with cupy in pyx?
Cupy
Literally failing to document hidden search/link-traversal features…
Remembering that it’s only supposed to be a k-mer count vector storage medium right now
Scoping scoping where does it end
Is my life’s work pointless?
Losing my best friend because of argument
Sent 1 basic sorry, got an minor acknowledgement.
Smoking habit down to 1 cig a day (just bored, less and less dynamism of focus.
Recalling the CortizoneTM
Apply gently
Reminding myself I don’t believe in these human-type humans. Humans about other humans seems like a soft, subjective, and wishy-washy skill to develop and I don’t trust it.
assert suffix == “.kdb”, “provided filename did not end in .kdb”
shutil.move(fasta, prefix + “.fa”)
write kdb file (prefix + “.kdb”)
Rmd report2
algorithm profiling
kdb profile k x time x cpu (z)
we need to choose a range of k that is meaningful and explain why.
the choice of k of 8 - 12 is convenient because it means
we don’t have to pay for extra memory. This will be managable on any number of cores
with at least 32 Gb of memory for about 20 samples.
According to the following graph, the uncompressed value of the sparse matrix in n x 4^k
may take gigabytes per profile in the low double digits.
but the value of these profiles grows exponentially with the increased cost as well.
so when we look at these genomes with this degree of sensitivity, which has been substantial in the literature in the neighborhood of k=10-12,
then suddenly we agree that more characterizations are possible and this places more value on the expected scaling behavior of this program.
The goal is most likely not to reinvent the wheel. Since this is an academic package at this point, we feel that it is necessary and important to couple this with a graph database
We have selected the RDF format going forward and expect that long term use of Amazon Neptune might be an important source of understanding that we can get from users uploading their graphs, sparse or otherwise, to a giant Neptune repository.
It could be an entirely new sequence database format.
kdb distance correlation <fasta|fastq>
profile reads sam/bam
use pysam to iterate over reads, creating a profile in the process.
Likelihood of dataset given prior k-mer profiles
Calculate graph properties indicative of de Bruijn graph collapse
‘kmerdb random’ sequence simulator
given a certain length of sequence N, suggest a sequence that best solves the k-mer abundance graph
Connect this to meme suite
Hypotheses:
Suppose that k-mer spectra have a positive and negative saturation direction.
In this way, more specific signals and antisignals could be surmissed from samples with enough resolution, temporal or otherwise resolved by covariates.
Think of what could happen if the signals and antisignals were resolved on the order of genes, you could detect gene expression levels with it.
kmerize
to use bed/gff features to select reads from bam/bai using pysam
and then creating sparse profiles for each feature
to split a bam according to gff/bed features, and putting that in an output directory
Learn the RDF spec
Think of a specification for each node.
Manifold learning
Isomap (derived from multidimensional scaling (MDS) or Kernel PCA)
Lower dimensional projectsion of the data preserving geodesic distances between all points
(Modified) Locally Linear Embedding
Lower dimensional projection of the data preserving local neighborhood distances
locally_linear_embedding or LocallyLinearEmbedding with method=”modified”
t-SNE
While isomap, LLE, and variants are best tuited to unfold a single continuous low-dimensional manifold
t-SNE will focus on the local structure of the data and will tend to extract clustered local groups of samples.
This ability to group samples based on the local structure might be beneficial to visually disentangle a dataset that comprises several manifolds at once.
these need to be calculated differently for efficiency/memory reasons
repetitive summation/multiplication and not direct to unit vector transformation
1. Pearson correlation coefficient of counts? of unit vector?
2. euclidean distance of unit vectors?
3. sort by count of vector/index and Spearman
jaccard
presence/absence (k-mer is observed in both profiles? it’s in the intersection
similar count within a tolerance… vs Spearman?
MUMi distance
jsonify
transform the debrujin graph into json
Partitioning experiment
Use khmer to partition reads from an example dataset
Similarity metrics between partition fastas and whole profile
Annotate kdb metadata to include Markov probabilities of single sequences to partition
How do we describe or select subgraphs based on the partition information?
Presence of Eulerian walk among partition AND if the eulerian walk extends too far into other partitions
Key reads AND k-mers involved in complex graph structures near partition bridges
Suggestions for deeper sequencing or skew in partition compositions to make up for low depth
Number of partition bridges vs subsampling
Number of partition bridges vs unique k-mer count / partition
Other metrics besides unique k-mer count
Overlap k-mer count
unique k-mers per total k-mers
unique k-mers per partitioned reads
How do we describe subgraph features worth considering, given the partition
Node connectivity stats
kdb filtering ( retrieve only k-mers with partition, connectivity, Markov probability cutoffs, participant in Eulerian walk)
Other functions
Partitionizer (partition fasta and genomic fastas; completeness of each partition’s capture of the ideal composite)
How much more data do I need from each partition to minimize bridges, maximize genomic coverage, and maximize orthogonality to other partitions
Given a partition fasta and a genomic fasta
Could estimate the sequencing depth and complexity required to minimize most partition bridges
Could also estimate the size and partitioning required to maximize partition orthogonality
Sampleizer (one genome fasta; dial up/back efforts in improving this partition/sampling)
Does my sampling protocol for this partition only have enough uniqueness to cover the one major walk, or is most of the data getting in the way of the other species at the current composite compositions?
How much of the genomic profile is covered by the partition?
At a certain orthogonality metric per sampling from the genomic fasta, does the amount of uniqueness orthogonality recovered by additional depth tend to clarify the partition, or obfuscate other operations on leading partitions?
Profilizer (all genome fastas; snapshot/metrics, as composite is improved)
Construct a perfect profile from all genomes and integrate
Similarities between individual profiles and perfect composite (Ideal distance metrics for each profile addition to perfect the composite)
Similarities between imperfect composite and perfect composite (How much orthogonality and completeness is currently recovered)
Similarities between imperfect partitions and perfect composite (How much orthogonality is lost due to current imperfect partitioning)
Similarities between imperfect composite and imperfect partitions (How much orthogonality is lost due to current imperfect partitioning)
walker (calculate Eulerian walks, i.e. walks that maximize path length under constrains (no node visited twice, etc.))