Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): add core enums, traits and functions for versioned storage #1172

Merged
merged 4 commits into from
Feb 21, 2024

Conversation

morph-dev
Copy link
Collaborator

@morph-dev morph-dev commented Feb 15, 2024

What was wrong?

Our storage system needs update. See #1157 for details.

How was it fixed?

Created core types for the new versioned storage system:

  • ContentType - the type of content that is stored
  • StoreVersion - the version of the store used
  • VersionedContentStore - the trait for the actual content store
    • there should be at most one VersionedContentStore for each StoreVersion (ideally it should be exactly one, but we are fine missing some implementations until we migrate all existing implementations)
  • MemoryContentStore - in memory implementation of the VersionedContentStore
  • create_store function - creates instance of the VersionedContentStore and migrates from previous version (if needed and migration is implemented)

It's worth noting that one network might want to use multiple VersionedContentStore (for example, this might be the case for beacon), in which case it will need some custom logic on top of it.

New schema (different table schema and different prune logic) will be done in the followup PR.

To-Do

Copy link
Collaborator Author

@morph-dev morph-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added few comments.
Let me know if you think I should add some comments in the code as well.

#[derive(Copy, Clone, Debug, Display, Eq, PartialEq, EnumString, AsRefStr, EnumIter)]
#[strum(serialize_all = "snake_case")]
pub enum ContentType {
Beacon,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are mostly placeholders for now. Their purpose is to define the type of content that is stored in one store, and for store to differentiate between types (e.g. different column in a table, or completely different table name).
@ogenev Maybe Beacon should have multiple types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently store four types for beacon:

  • LightCllentBootstrap (in content_data table, but we will need to purge any values older than the subjectivity period (a few months))
  • LightClientUpdate (in lc_update table)
  • LightClientOptimisticUpdate/LightClientFIialityUpdate (those are stored in a cache, because we keep only the latest values).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for the LightCllentBootstrap, there is no concept of distance (in deciding whether we store) or storage limit?

Meaning, we should store all fresh content and discard/purge all older content, right?

trin-storage/src/versioned/mod.rs Outdated Show resolved Hide resolved

/// Creates the instance of the store. This shouldn't be used directly. Store should be
/// created using `create_store` function.
fn create(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that anybody can just call this (as comments says, create_store function should be used). But I don't know how to enforce that.

I'm open for suggestions

) -> Result<S, ContentStoreError> {
let conn = config.sql_connection_pool.get()?;

let old_version = lookup_store_version(content_type, &conn)?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have some logic to detect legacy version (e.g. Legacy*).

It's not important until we have implementations.

Once this PR is merged, I will create an Issue to track this.

trin-storage/src/lib.rs Outdated Show resolved Hide resolved
pub use store::VersionedContentStore;
pub use utils::create_store;

#[derive(Copy, Clone, Debug, Display, Eq, PartialEq, EnumString, AsRefStr, EnumIter)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Are we actually using all of these derived traits? I would avoid adding a derived trait unless it's necessary
  • I think a simple impl ToString for ContentType would accomplish all you need here. So far, in this codebase, we've done this rather than use the strum crate. Imo, the priority here is to maintain consistency within the codebase. Now, the question as to whether strum is better than impl ToString is a valid question. But that should be addressed separately, and if we decide to use strum, then it's worth combing the codebase and making all the other changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I removed derived that I'm not using.
  • I found only one impl ToString and two usages of strum. I assume you mean: fmt::Display + FromStr because we have many of those.

I personally prefer strum because it's very nice and easy, and has many options and use cases. If there is no strong objection, I would go and replace all current usages with it.
If there is preference not to use strum, I will change this one as well.

trin-storage/src/versioned/utils.rs Outdated Show resolved Hide resolved
trin-storage/src/versioned/utils.rs Outdated Show resolved Hide resolved
#[strum(serialize_all = "snake_case")]
pub enum StoreVersion {
LegacyContentData,
IdIndexed,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the vision here, if you can help me understand. For each new "version" we would add an entry to the StoreVersion enum? Is this better than just using numbers or semver(might be overkill)? If we just use these "strings" to identify each version...

  • it is helpful to understand the changes included in the update (though, this can be accomplished with comments)
  • it's not intuitive to understand the order of the upgrades (though, this can be accomplished with comments)
    I'm not sure, it's just seems like using something like v0, v1 .. v100 is a bit more intuitive

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that different content types can be stored and potentially have completely different scheme that are not compatible and transitions are not possible (e.g current content_data and lc_updatetables, or MasterAccumulator as anotherStoreVersion`). So they don't have to be just for the portal network types.

So versioning using numbers or v1,v2 ... wouldn't work (because it doesn't tell you for which type it is). but something like IdIndexedV1, IdIndexedV2, ... , LightClientUpdatesV1, LightClientUpdatesV2, ... could.
I'm not set on the names and what they should be, but I would be happier if they would describe at least for what content type they are (and maybe how they are different from other versions).

Another ideas is that for each StoreVersion, we should have exactly one implementation of the VersionedContentStore. System should be flexible so that each implementation can do whatever it wants, even being completely in memory, or not using SQLite and using some other DB, or just use simple file (e.g. what MasterAccumulator is doing) or anything else.

In near future, we should have something like this:

pub enum ContentType {
    Beacon,
    History,
    State,
    LightClient, // maybe not the best name
}

pub enum StoreVersion {
    IdIndexedLegacy, // current `content_data` table
    IdIndexedV1, // upcoming new version
    LightClientUpdatesV1, // current `lc_update` table
}

with following usages:

  • HistoryStorage
    • IdIndexedLegacy with ContentType::History (with transition to IdIndexedV0 when we think it's ready)
  • BeaconStorage - two content stores
    • IdIndexedLegacy with ContentType::Beacon (with transition to IdIndexedV0 when we think it's ready)
    • LightClientUpdatesV1 with ContentType::LightClient.
  • StateStorage
    • IdIndexedV0 with ContentType::State

CREATE TABLE IF NOT EXISTS store_info (
content_type TEXT PRIMARY KEY,
version TEXT NOT NULL
)";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding a timestamp here, so that we can identify when the db was last updated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add it, but I'm not sure why that timestamp would be useful and how it will be used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a great reason... Maybe just to check when the db was last migrated? But I'm not sure that's a strong reason, since the version to which it was last migrated is already available, and that's the important information. So feel free to disregard

Copy link
Collaborator Author

@morph-dev morph-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left more detailed explanation about my idea on one comment in the trin-storage/src/versioned/mod.rs file.

I'm happy to have a video call and discuss things in order to move faster with this.

trin-storage/src/lib.rs Outdated Show resolved Hide resolved
pub use store::VersionedContentStore;
pub use utils::create_store;

#[derive(Copy, Clone, Debug, Display, Eq, PartialEq, EnumString, AsRefStr, EnumIter)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I removed derived that I'm not using.
  • I found only one impl ToString and two usages of strum. I assume you mean: fmt::Display + FromStr because we have many of those.

I personally prefer strum because it's very nice and easy, and has many options and use cases. If there is no strong objection, I would go and replace all current usages with it.
If there is preference not to use strum, I will change this one as well.

CREATE TABLE IF NOT EXISTS store_info (
content_type TEXT PRIMARY KEY,
version TEXT NOT NULL
)";
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add it, but I'm not sure why that timestamp would be useful and how it will be used.

#[strum(serialize_all = "snake_case")]
pub enum StoreVersion {
LegacyContentData,
IdIndexed,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that different content types can be stored and potentially have completely different scheme that are not compatible and transitions are not possible (e.g current content_data and lc_updatetables, or MasterAccumulator as anotherStoreVersion`). So they don't have to be just for the portal network types.

So versioning using numbers or v1,v2 ... wouldn't work (because it doesn't tell you for which type it is). but something like IdIndexedV1, IdIndexedV2, ... , LightClientUpdatesV1, LightClientUpdatesV2, ... could.
I'm not set on the names and what they should be, but I would be happier if they would describe at least for what content type they are (and maybe how they are different from other versions).

Another ideas is that for each StoreVersion, we should have exactly one implementation of the VersionedContentStore. System should be flexible so that each implementation can do whatever it wants, even being completely in memory, or not using SQLite and using some other DB, or just use simple file (e.g. what MasterAccumulator is doing) or anything else.

In near future, we should have something like this:

pub enum ContentType {
    Beacon,
    History,
    State,
    LightClient, // maybe not the best name
}

pub enum StoreVersion {
    IdIndexedLegacy, // current `content_data` table
    IdIndexedV1, // upcoming new version
    LightClientUpdatesV1, // current `lc_update` table
}

with following usages:

  • HistoryStorage
    • IdIndexedLegacy with ContentType::History (with transition to IdIndexedV0 when we think it's ready)
  • BeaconStorage - two content stores
    • IdIndexedLegacy with ContentType::Beacon (with transition to IdIndexedV0 when we think it's ready)
    • LightClientUpdatesV1 with ContentType::LightClient.
  • StateStorage
    • IdIndexedV0 with ContentType::State

@ogenev
Copy link
Member

ogenev commented Feb 19, 2024

I don't understand how this migration system would work in practice. Here is how I imagine this:

use rusqlite::{params, Connection, Result};

struct Migration {
    version: i64,
    up: &'static str,
}

const MIGRATIONS: &[Migration] = &[
    Migration {
        version: 1,
        up: "CREATE TABLE IF NOT EXISTS  content_store (version INTEGER PRIMARY KEY)",
    },
    Migration {
        version: 2,
        up: "CREATE TABLE IF NOT EXISTS table_name1 (...)",
    },
    Migration {
        version: 3,
        up: "modify content_store table (...)",
    },
    Migration {
        version: 4,
        up: "CREATE TABLE IF NOT EXISTS table_name2(...)",
    },
    // Add more migrations here as needed.
];

pub fn migrate(conn: &Connection) -> Result<()> {
    let tx = conn.transaction()?;

    tx.execute(
        "CREATE TABLE IF NOT EXISTS content_store (version INTEGER PRIMARY KEY)",
        params![],
    )?;

    let current_version: i64 = tx
        .query_row("SELECT version FROM content_store ORDER BY version DESC LIMIT 1", params![], |row| row.get(0))
        .optional()?
        .unwrap_or(0);

    for migration in MIGRATIONS {
        if migration.version > current_version {
            tx.execute(migration.up, params![])?;
            tx.execute(
                "INSERT INTO content_store (version) VALUES (?1)",
                params![migration.version],
            )?;
        }
    }

    tx.commit()
}

Maybe we are trying to accomplish a similar thing here but I'm not sure if I understand it.

@morph-dev
Copy link
Collaborator Author

I don't think we are trying to accomplish similar things.

In my model, you have to manually write transition from VersionA to VersionB (but you can reuse and chain them if that makes sense).

The way that I envisioned is that migration from one version to another can be quite complicated and not always possible just via SQL. For example, to migrate current content_data table to something described in #1157, I would say we would have to:

  • create new table for the new version
  • read every entry from the old table
  • convert content-value from hex TEXT to BLOB
  • calculate distance from NodeId
  • write entry to the new database
  • delete old table
  • update version in the store_info table
    (in this particular case, we might be able to do everything via SQL but that might not always be the case).

This process can take a long time, but I don't think that's the issue as it would happen on start. The only issue is that if program is stopped mid transition (but I think we can deal with that if we make sure new table is empty when we start).

Also, as I said in the other comment, Version doesn't mean that it's used only for one type (regular content key/value data). Storing LC_Updates can/should be one store Version as well.
It also doesn't imply that it has to use SQLite (e.g. we decide to go back to RocksDB, or use Redis, or even simple file (i.e. MasterAccumulator)).

Copy link
Member

@ogenev ogenev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

#[derive(Copy, Clone, Debug, Display, Eq, PartialEq, EnumString, AsRefStr, EnumIter)]
#[strum(serialize_all = "snake_case")]
pub enum ContentType {
Beacon,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently store four types for beacon:

  • LightCllentBootstrap (in content_data table, but we will need to purge any values older than the subjectivity period (a few months))
  • LightClientUpdate (in lc_update table)
  • LightClientOptimisticUpdate/LightClientFIialityUpdate (those are stored in a cache, because we keep only the latest values).

match old_version {
Some(old_version) => {
// Migrate if version doesn't match
if S::version() != old_version {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to migrate only from an older version to the new version, i.e.

Suggested change
if S::version() != old_version {
if S::version() > old_version {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The versions are not strictly ordered. In this context, the old_version should be interpreted as "previous version".

If someone tries to migrate from newer to the older version, the migrate_from function should fail.

Copy link
Collaborator

@njgheorghita njgheorghita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Excited to see how this works out in state network-land

CREATE TABLE IF NOT EXISTS store_info (
content_type TEXT PRIMARY KEY,
version TEXT NOT NULL
)";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a great reason... Maybe just to check when the db was last migrated? But I'm not sure that's a strong reason, since the version to which it was last migrated is already available, and that's the important information. So feel free to disregard

@morph-dev morph-dev merged commit 385264a into ethereum:master Feb 21, 2024
8 checks passed
@morph-dev morph-dev deleted the versioned_storage branch February 21, 2024 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants