Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store key dataset blocks in database for fast visiting in typical scenarios #986

Open
zaychenko-sergei opened this issue Dec 10, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request performance rust Pull requests that update Rust code

Comments

@zaychenko-sergei
Copy link
Contributor

Depends on #978

Key dataset blocks are all kinds of blocks, except those that represent data nodes (AddData, ExecuteTransform).
Most of them, like SetDataSchema, Seed, SetPollingSource, can only have 1 active version at a given moment of time.
Only AddPushSource may have several active versions, which are distinguishable by a source name.

Key dataset blocks are often requested in all parts of the system, including:

  • HTTP/GraphQL APIs (under transaction)
  • planning phases of long operations (under transaction)
  • execution phases of long operatinos (no transaction).
    Implement a mechanism to allow satisfying most visitors with the database-cached versions of key events.

When transaction is open, it could mean a query to the database, given event type and sequence number or hash representing upper node boundary. Without the transaction they need to be pre-fetched at planning phase in some way and propagated to execution phases.

Caching layer may be added as well for both cases (with and without transaction), so that the same type of block is not requeried multiple times within the same operation.

Although many scanning algorithms except to see "last data block", attempting to replicate it's history in the database is senseless, as it will create a copy of the dataset metadata. It looks cheaper to access the chain directly, ensuring the scanning starts from the safe version of HEAD that was known at the beginning of the operation.

Note that some visiting patterns inherently except to iterate entire dataset, i.e. transform planning, sync, querying. Here we can't really optimize any accesss at all, but we must ensure that a parallel write does not confuse this scanning with events committed by a parallel transaction.

Key dataset blocks must be kept in sync transactionally:

  • implement indexing solution for initial fill for existing datasets that are scanned this way for the first time
  • implement update procedure that follows a change of HEAD reference, and updates key dataset blocks in the same transaction as the reference
@zaychenko-sergei zaychenko-sergei added enhancement New feature or request rust Pull requests that update Rust code performance labels Dec 10, 2024
@zaychenko-sergei zaychenko-sergei self-assigned this Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance rust Pull requests that update Rust code
Projects
None yet
Development

No branches or pull requests

1 participant