-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database hardening proposal #2918
Comments
Can you plz provide a description of the currently implemented KV-based schema, and a simple proposal draft for a new RDBMS-compatible schema, which includes the query use-cases for indexes? It will make the described schema easier to understand. |
Each bullet point is a separate key, so for example:
means that the key is a binary concatenation of layer_id and block_id. what else would you consider as a part of schema for KV store? |
@dshulyak thanks for the writeup! I generally agree with you that the current DB design makes the node less robust than it needs to be. The reason different databases were used originally is to keep caches local, i.e. to have a separate cache for each type of data. While I can see the benefit, I think this motivation is misguided. It sacrifices critical safety features for performance gains. While getting "free" caches is nice, they are out of our control and most likely suboptimal. I believe that we should be in control of our caches - different caching strategies make more sense for different parts of the node. So I agree about merging the databases. Not so sure about sqlite, though. I think that for our uses sqlite is overkill. This may be worth it if you want to take advantage of referential integrity guarantees, but I don't think we need that. If you think that you can quickly code up a POC and benchmark it, I don't mind looking at the results. However, performance is not the sole factor here - we should be careful to keep the developer experience as simple as possible, which I feel that will be easier with LevelDB. This is also something we can see in the POC. My main point is that if we merge the databases, regardless of tech, we have to give some thought to caches. This may make this project less trivial - but I wholeheartedly support this effort! |
i am also concerned with developer experience and how to make it in general less prone to bugs. but as mentioned in pros/cons i have the opposite opinion. if we take blocks data as an example, with leveldb - developer needs to craft 3 additional indexes manually (layer index, input vector, contextual validity). then remember to write all of them to one database with a batch, and not 3 separate key writes. with sql - developer will need to define only 3 additional fields in the blocks table and declare an index for each field, and use sql tx for atomicity. it seems that with sql it is very hard to mess it up.
i remember this point, it was raised in #2547. |
@moshababo adding sql table definitions, as discussed: CREATE TABLE blocks (
block_id CHAR(20) PRIMARY KEY,
layer_id UNSIGNED MEDIUMINT,
in_input_vector BOOL,
contextually_valid BOOL,
block BLOB
) WITHOUT ROWID;
CREATE INDEX blocks_by_layer_id ON blocks(layer_id);
CREATE INDEX blocks_by_in_input_vector ON blocks(in_input_vector,layer_id) WHERE in_input_vector = 1;
CREATE INDEX blocks_by_contextual_validity ON blocks(contextually_valid,layer_id) WHERE contextually_valid = 1;
---
CREATE TABLE layers (
layer_id UNSIGNED MEDIUMINT PRIMARY KEY,
/* PROCESSED, SYNCED, STATE */
label SMALLINT,
hash CHAR(32),
aggregated_hash CHAR(32)
) WITHOUT ROWID;
CREATE UNIQUE INDEX layer_id_by_label ON layers(label);
---
CREATE TABLE activations (
atx_id CHAR(32) PRIMARY KEY,
epoch_id UNSIGNED MEDIUMINT,
node_id VARCHAR,
timestamp UNSIGNED BIGINT,
is_top BOOL,
header BLOB,
body BLOB
) WITHOUT ROWID;
CREATE UNIQUE INDEX atx_by_epoch_node ON activations(epoch_id,node_id);
CREATE UNIQUE INDEX top_atx ON activations(is_top) WHERE is_top = 1;
---
CREATE TABLE poets (
poet_id VARCHAR PRIMARY KEY,
poet BLOB
) WITHOUT ROWID;
---
CREATE TABLE identieis (
node_key VARCHAR PRIMARY KEY,
vrf_key VARCHAR
) WITHOUT ROWID; I didn't check that the indexes will have an expected complexity, but i assume that they will. All existing queries should use them if implemented in the most straightforward way. Regarding leveldb, the schema remains as it was described in initial message, the only difference is that bucket will need to have unique prefix (e.g. activations - about implementation, preferably it should not be using database/sql driver and github.com/mattn/go-sqlite3. it is not really optimal for sqlite, and doesn't exactly makes sense. see https://crawshaw.io/blog/go-and-sqlite for discussion and https://turriate.com/articles/making-sqlite-faster-in-go. library that i would use is https://github.com/crawshaw/sqlite . |
Which complexity are you referring to? |
complexity of the query, this is about how optimal is the index. optimal index will allow doing less work for sql engine when performing queries. |
Is there any complexity involved besides properly designing them according to the existing set of queries? I thought it should be easy with this kind of schema. But it will also cause performance regression for inserts/updates/deletes, so we need to consider that as well, even though the read/write ratio seems to support having it. |
sorry for the confusion, by saying that essentially yes it boils down to designing them properly, if they won't be designed properly the complexity of the query will be different from optimal. |
I had some time to think about poc for blocks. It is not integrated with spacemesh codebase, but in general the changes for integration are minimal. |
## Motivation The existing approach lacks atomicity, causal durability (e.g. an atx may not be on disk when a ballot is saved), and the durability can't be enforced in general without running every leveldb operation in sync mode. All this problems will result in subtle bugs that are hard to diagnose. For those 3 requirements, we want to maintain all state in a single db. Moving state to a single leveldb will require us to enforce isolation ourselves (by maintaining separate namespace and manually concatenating keys like we do in some modules), beside that we have to create every single index manually while with sqlite we can just do `CREATE INDEX` and sqlite will do it for us and probably do a better job. Another significant benefit is that we can duplicate some state in sql table to avoid loading the whole structure into memory. For instance it will be relevant for atx, which is a large (10kb) and usually, after it was validated, we want to know only the associated smesher and weight of the atx. This would be problematic with leveldb, and would require adding custom index. related: #2918 ## Changes - general plumbing for core database stuff (db, transaction, migrations) using https://github.com/crawshaw/sqlite that is a relatively simple wrapper around C sqlite - tables and for layers, blocks and ballots - reworked ZeroLayer, it is relevant only for hare_output, which can be in 3 states - nil, empty, non-empty. SetZeroLayer update hare_output to empty state, at which tortoise will vote against all blocks within hdist. - updated to golang 1.16 for `embed` module. note that there is a bug with go mode tidy in 1.16 so i had to manually fix go.sum and disable go mod tidy on ci - golang/go#44129 ## Test Plan existing and new uts
## Motivation The existing approach lacks atomicity, causal durability (e.g. an atx may not be on disk when a ballot is saved), and the durability can't be enforced in general without running every leveldb operation in sync mode. All this problems will result in subtle bugs that are hard to diagnose. For those 3 requirements, we want to maintain all state in a single db. Moving state to a single leveldb will require us to enforce isolation ourselves (by maintaining separate namespace and manually concatenating keys like we do in some modules), beside that we have to create every single index manually while with sqlite we can just do `CREATE INDEX` and sqlite will do it for us and probably do a better job. Another significant benefit is that we can duplicate some state in sql table to avoid loading the whole structure into memory. For instance it will be relevant for atx, which is a large (10kb) and usually, after it was validated, we want to know only the associated smesher and weight of the atx. This would be problematic with leveldb, and would require adding custom index. related: #2918 ## Changes - general plumbing for core database stuff (db, transaction, migrations) using https://github.com/crawshaw/sqlite that is a relatively simple wrapper around C sqlite - tables and for layers, blocks and ballots - reworked ZeroLayer, it is relevant only for hare_output, which can be in 3 states - nil, empty, non-empty. SetZeroLayer update hare_output to empty state, at which tortoise will vote against all blocks within hdist. - updated to golang 1.16 for `embed` module. note that there is a bug with go mode tidy in 1.16 so i had to manually fix go.sum and disable go mod tidy on ci - golang/go#44129 ## Test Plan existing and new uts
## Motivation part of #2918 ## Changes - move layers hashes (blocks hare and aggregated hash) to layers table - reward table - sql function add_uint64 to add rewards in sqlite (sqlite doesn't support uint64 natively)
## Motivation part of #2918 ## Changes - move layers hashes (blocks hare and aggregated hash) to layers table - reward table - sql function add_uint64 to add rewards in sqlite (sqlite doesn't support uint64 natively)
## Motivation part of #2918 ## Changes - move layers hashes (blocks hare and aggregated hash) to layers table - reward table - sql function add_uint64 to add rewards in sqlite (sqlite doesn't support uint64 natively)
## Motivation part of: #2918 ## Changes - table for transactions to replace transactions and unappliedTxs databases - add API to filter transactions for multiple layers. instead of querying layer by layer - simplify logic around pending transactions (transactions now added as pending to the table, and either switch to applied or marked as deleted)
missing parts:
svm state should not be touched. most likely it will use different storage that works better with state trie access patterns. |
@dshulyak can you share info on sqlite vs. leveldb benchmarks here? Also, how does go-ethereum (and maybe bitcoin, too, for that matter) solve the ACID issues you describe here with leveldb? |
I used my benchmarks for tortoise to compare performance. It is not completely about ACID, something like isolation is less important for us. Problems that i described can be solved by using batched writes, single database, proper durability control and application level locking. One concern though was that db code in some modules wasn't well written (it was either inefficient or just broken). With sql it is really hard to mess it up. Besides sqlite itself is well known for its exceptional quality (see https://www.sqlite.org/testing.html) and is actively maintained. goleveldb might be not bad but it is far from sqlite in terms of quality. |
Those are benchmarks are from rerun, as it is probably the only db path that requires perf tuning.
|
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Part of #2918 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - move beacons storage to SQLite - format SQL code - extract `*sql.Database` from `mesh` ## Test Plan <!-- Please specify how these changes were tested (e.g. unit tests, manual testing, etc.) --> UT, ST ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Part of #2918 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - move beacons storage to SQLite - format SQL code - extract `*sql.Database` from `mesh` ## Test Plan <!-- Please specify how these changes were tested (e.g. unit tests, manual testing, etc.) --> UT, ST ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Part of #2918 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - move beacons storage to SQLite - format SQL code - extract `*sql.Database` from `mesh` ## Test Plan <!-- Please specify how these changes were tested (e.g. unit tests, manual testing, etc.) --> UT, ST ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Part of #2918 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - move atxs storage to SQLite ## Test Plan <!-- Please specify how these changes were tested (e.g. unit tests, manual testing, etc.) --> unit and system tests ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Part of #2918 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - move atxs storage to SQLite ## Test Plan <!-- Please specify how these changes were tested (e.g. unit tests, manual testing, etc.) --> unit and system tests ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Part of #2918 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - move PoETs storage to SQLite ## Test Plan <!-- Please specify how these changes were tested (e.g. unit tests, manual testing, etc.) --> UT, ST ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Part of #2918 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - move proposals storage to SQLite ## Test Plan <!-- Please specify how these changes were tested (e.g. unit tests, manual testing, etc.) --> Unit and system tests ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Part of #2918 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - move ref ballots storage to SQLite ## Test Plan <!-- Please specify how these changes were tested (e.g. unit tests, manual testing, etc.) --> Unit and system tests ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on) Co-authored-by: kimmy lin <[email protected]>
Looks like everything was integrated |
The proposal focuses on eliminating hard to debug bugs that will occur once spacemesh software will be running in non-managed environments, and besides bugs make database code more robust. The main issues with database code:
If OS or hardware crashes, due to the power loss, for example, the node may enter into a state that is not visibly corrupted but invalid. One example could be #2871, in case if ATX wasn't persisted but was broadcasted to another node - any future ATX that is produced by the crashed node is discarded by the network, but considered valid by the crashed node.
Examples of this are #2516 and #2547. Even though the software can be "designed" to handle non-atomic writes this is usually a bad idea and will lead to many unexpected bugs. In the case of ATXs and blocks none of them will be processed again if the main body of the object was written but the write for the secondary index failed. For example, the node will think that block is fully synced but we will not add it to the layer.
Because of this design decision, there will be certain problems that we will need to handle. One example: marking a block during rerun contextually valid before rewinding state in svm. If we crash before rewinding the state we will never discover that block again in tortoise and the state will never again be reapplied.
We won't be able to solve such issues just by writing the correct database code, and they will have to be handled in a special way.
Schema and requirements
we are also using databases in some other modules (tortoise, tortoise beacon), but they are not critical for correctness since we can always rebuild them from the main data.
mesh module
transactions and rewards are omitted, as i wasn't following what is getting moved to svm.
blocks
layers
separate database
activation module
activations
all in a single database, but no atomic and synced writes
poet proofs
identities
Changes and implementation
For safety and correctness we need to:
All writes that are executed as a part of a gossip handler should be atomic.
Writes that are made during and after tortoise execution should be atomic.
Not sure if syncer needs special handling, as it relies on the code that is used in gossip handlers.
some writes, such as that are made during ATX publishing, should be always durable as the error in that domain will lead to an invalid state. some other writes may not require durability. but to simplify our life we can always go for durable if it is detrimental for performance.
sometimes there are implicit dependencies, such as when we receive a block we validate that ATX is already stored on disk. if ATX's are stored in a separate database we can't know that this data will be recovered unless we always fsync that data first.
unlike two previous examples, we can't rely on DB atomicity for correctness. therefore we should persist data on the side of go-sm only after svm finished the write on their end. and svm must be ready to receive the same data multiple times in the event of crashes.
implementation
preferably we will use the same database for the whole application to guarantee consistency between cross-module writes. such as with blocks that rely on ATX's to be persisted.
the alternative is to use a separate database but make sure that the written data is always fsynced.
leveldb (or any key-value blob storage)
pros:
cons:
switch to sqlite
pros:
cons:
This would be my choice, spacemesh doesn't do many writes (thousands per minute is nothing for any database). But in some use cases, it is very read-heavy. I would recommend doing a POC using SQLite and comparing the performance of the tortoise rerun for example.
The text was updated successfully, but these errors were encountered: