Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FM index cursor (de-)serialization #2044

Closed
tloka opened this issue Aug 17, 2020 · 3 comments · Fixed by #2048
Closed

FM index cursor (de-)serialization #2044

tloka opened this issue Aug 17, 2020 · 3 comments · Fixed by #2048
Labels
question a user question how to do certain things

Comments

@tloka
Copy link
Contributor

tloka commented Aug 17, 2020

Platform

Question

I have a question regarding the new FM index implementation when trying to upgrade from SeqAn2 to SeqAn3.
When using the SeqAn FM index data structure but running my own algorithm on it, it was previously possible to obtain the node position of the underlying suffix tree and use it later to access the same node. This is for example necessary when writing interim states of the algorithm to disk and read it later to continue the index search.

In SeqAn2, I used the following way to achieve this:

// Index config
typedef seqan::FastFMIndexConfig<void, uint64_t,2 ,1> FMIConfig;

// Index type
typedef seqan::Index<seqan::StringSet<seqan::DnaString>, seqan::FMIndex<void, FMIConfig> > FMIndex;

// Vertex descriptor
typedef seqan::Iter<FMIndex,seqan::VSTree<seqan::TopDown<seqan::Preorder>>>::TVertexDesc FMVertexDescriptor;

// [...]
// Build index and do some search
// [...]

// Now I can simply use the FMVertexDescriptor to store the current index position of the algorithm:
// vDesc is an instance of FMVertexDescriptor
std::vector<char> data;
char* d = data.data();
memcpy(d, &vDesc, sizeof(FMVertexDescriptor));

// [...]

// And do the same to create a new vertex descriptor and continue the algorithm
void deserialize(char * d)
{
  FMVertexDescriptor vDesc;
  memcpy(&vDesc, d, sizeof(FMVertexDescriptor));
  bytes += sizeof(FMVertexDescriptor);
  //[...]
}

When I tried to use the SeqAn3 FM index for the same thing, I recognized that in principle this should be possible using the seqan3::fm_index_cursor containing the node that is used for searching the index. Like this (minimal example):

// Assuming index_t is the index type used
index_t index;

// [...]
// build index
// [...]

seqan3::fm_index_cursor<index_t> cursor(index);
cursor.extend_right('G'_dna5);
// works fine so far. 

// How could I now serialize the cursor, write it to a file, 
// and load it later again to create a new cursor and continue search?

My question: As far as I can observe, there is no way to access the private members node, parent_lb, parent_rb and sigma of seqan3::fm_index_cursor<index_t> for storing the cursor location. At the same time, it also seems not to be possible to create a cursor at a given position that was calculated before, e.g. by providing a constructor to create a cursor from the members mentioned above. Thus, is there currently any way to serialize / obtain the underlying values of FM index cursor positions and create a new cursor later using these values? Or any other way to perform one part of the search, store the current state and continue the search later with a new cursor instance?

@tloka tloka added the question a user question how to do certain things label Aug 17, 2020
@eseiler
Copy link
Member

eseiler commented Aug 18, 2020

Hey Tobias,

I opened #2048 to offer serialisation for the cursors, you can check the code out if you want to try it.

Here are the docs about serialisation: https://docs.seqan.de/seqan/3-master-user/howto_use_cereal.html

Your example could look like this:

index_t index;

// build index

seqan3::fm_index_cursor<index_t> cursor(index);
cursor.extend_right('G'_dna5);

// Create archive storing to some temporary file
std::ofstream os(tmp_file.get_path(), std::ios::binary);
cereal::BinaryOutputArchive archive(os); 
archive(cursor);

Later:

index_t index;

// build index

seqan3::fm_index_cursor<index_t> cursor(index);

// Read from file
std::ifstream is(tmp_file.get_path(), std::ios::binary);
cereal::BinaryInputArchive archive(is);
archive(cursor); // cursor now in same state as after calling `cursor.extend_right('G'_dna5);`

@tloka
Copy link
Contributor Author

tloka commented Aug 18, 2020

Hey Enrico,
thank you for the quick response! I will have a look within the next days to see whether the cereal solution fits my needs (though I am optimistic it will).
Thanks!
Tobias

@tloka
Copy link
Contributor Author

tloka commented Sep 2, 2020

Sorry for the delay, was pretty busy with other stuff the last few days.
Integrated the serialization into my code now, works like a charm. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question a user question how to do certain things
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants