Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KFF 1.1 ideas #8

Open
yoann-dufresne opened this issue Apr 6, 2021 · 6 comments
Open

KFF 1.1 ideas #8

yoann-dufresne opened this issue Apr 6, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@yoann-dufresne
Copy link
Collaborator

yoann-dufresne commented Apr 6, 2021

Upcoming ideas for v1.1. Do not hesitate to propose other features!
(list updated with new ideas)

  • i section: Index section. Register the distance to some following sections (not necessary all of them). This will allow parallel reading of a file.
@yoann-dufresne yoann-dufresne added the enhancement New feature or request label Apr 6, 2021
@natir
Copy link
Contributor

natir commented Apr 6, 2021

A specialization of raw and minimizer section, for count ?

About index of section, it's distance in raw ? How it's work if we have compressed version of kff file ?

@yoann-dufresne
Copy link
Collaborator Author

It is a good point to think about data specialization.
But I think that the version 1 of kff will focus on the sequence part only.
Maybe the v2.0 will include such ideas but will have first to publish the simplest version of the format.

@yoann-dufresne
Copy link
Collaborator Author

For the index it is a distance in the uncompressed file.
I do not know how it will work inside of compressed files. Do you have any suggestion ?

@natir
Copy link
Contributor

natir commented Apr 6, 2021

I agree specialization should be hard and/or inelegant to include without a breaking of compatibility.

I think that if we want to be able to do parallel or random access in a compressed kff file. We have to apply the method chosen for the bam.

We don't compress the whole file we compress blocks and the index indicates the beginning of these blocks.
We can apply this for the version 1.1, we could imagine that the blocks can be compressed or not. When we read the file we use the magic number to know which decompression algorithm we have to use.

@yoann-dufresne
Copy link
Collaborator Author

MAJOR UPDATE: The index section is moving to version 1.0

@lrobidou
Copy link

Useful for at least my use case:

  • a flag indicating that a minimizer appears only in one minimizer section, and never anywhere else
  • a flag indicating the count is unique for all kmers in the block

Random idea:
A flag indicating an index section is present and contains information about the data, e.g. the occurrence of the first kmer above count x is in section u, the last below count x is in section v. This would allow to do a binary search on the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants