Database hardening proposal #2918

dshulyak · 2021-11-05T07:42:37Z

The proposal focuses on eliminating hard to debug bugs that will occur once spacemesh software will be running in non-managed environments, and besides bugs make database code more robust. The main issues with database code:

all writes are not synchronous

If OS or hardware crashes, due to the power loss, for example, the node may enter into a state that is not visibly corrupted but invalid. One example could be #2871, in case if ATX wasn't persisted but was broadcasted to another node - any future ATX that is produced by the crashed node is discarded by the network, but considered valid by the crashed node.

in many places writes that must be atomic are not atomic

Examples of this are #2516 and #2547. Even though the software can be "designed" to handle non-atomic writes this is usually a bad idea and will lead to many unexpected bugs. In the case of ATXs and blocks none of them will be processed again if the main body of the object was written but the write for the secondary index failed. For example, the node will think that block is fully synced but we will not add it to the layer.

non-atomic state transitions between go-sm and svm

Because of this design decision, there will be certain problems that we will need to handle. One example: marking a block during rerun contextually valid before rewinding state in svm. If we crash before rewinding the state we will never discover that block again in tortoise and the state will never again be reapplied.

We won't be able to solve such issues just by writing the correct database code, and they will have to be handled in a special way.

Schema and requirements

schema is simple, no complex queries
handle large binary blobs efficiently (ATX is around ~3kb, block size depends on the tortoise state and can vary between ~1-10kb)
database is mostly random reads, with low writes, so better to optimize for reads

we are also using databases in some other modules (tortoise, tortoise beacon), but they are not critical for correctness since we can always rebuild them from the main data.

mesh module

transactions and rewards are omitted, as i wasn't following what is getting moved to svm.

blocks

main blob storage, indexed by block id
layer index, indexed by layer_id || block_id
contextual validity index, indexed by block_id, separate database
input vector, indexed by layer id, separate database

layers

separate database

indexes for latest, processed, instate layers. using constant as an index
indexes for hash and aggregated hash, using layer_id in each bucket

activation module

activations

all in a single database, but no atomic and synced writes

blob storage for headers and bodies
index from epoch_id || node_id to ATX
ATX timestamp, indexed by ATX id
constant index for first received ATX with largest publication layer

poet proofs

blob storage for poet proofs

identities

index from node key to node vrf key

Changes and implementation

For safety and correctness we need to:

make all related writes atomic

All writes that are executed as a part of a gossip handler should be atomic.
Writes that are made during and after tortoise execution should be atomic.
Not sure if syncer needs special handling, as it relies on the code that is used in gossip handlers.

make writes durable, preferably always

some writes, such as that are made during ATX publishing, should be always durable as the error in that domain will lead to an invalid state. some other writes may not require durability. but to simplify our life we can always go for durable if it is detrimental for performance.

sometimes there are implicit dependencies, such as when we receive a block we validate that ATX is already stored on disk. if ATX's are stored in a separate database we can't know that this data will be recovered unless we always fsync that data first.

get atomicity between go-sm and svm

unlike two previous examples, we can't rely on DB atomicity for correctness. therefore we should persist data on the side of go-sm only after svm finished the write on their end. and svm must be ready to receive the same data multiple times in the event of crashes.

implementation

preferably we will use the same database for the whole application to guarantee consistency between cross-module writes. such as with blocks that rely on ATX's to be persisted.

the alternative is to use a separate database but make sure that the written data is always fsynced.

leveldb (or any key-value blob storage)

atomic writes can be persisted using batches. leveldb provides an option to follow a write with fsync.
bucker per submodule (just 1-2 bytes of unique prefix)

pros:

less refactoring
faster writes for small values, due to the append-only nature of LSM trees

cons:

harder to work with, which was the case in mesh module, most of the indexes were bugged
more complex data model
slower reads
slower writes for large values

switch to sqlite

transactions for atomic writes. synchronicity mode is defined with pragma synchronous.
table per sub-module (blocks, layers, activations, poet, identity)

pros:

easier to use correctly, the programmer makes fewer choices
faster random reads, no merges such as in LSM trees
simpler data model, instead of writing code for custom indexes we will use SQL table and indexes for specific fields out of the box
faster writes for large values

cons:

more refactoring
slower writes

This would be my choice, spacemesh doesn't do many writes (thousands per minute is nothing for any database). But in some use cases, it is very read-heavy. I would recommend doing a POC using SQLite and comparing the performance of the tortoise rerun for example.

moshababo · 2021-11-14T18:56:10Z

Can you plz provide a description of the currently implemented KV-based schema, and a simple proposal draft for a new RDBMS-compatible schema, which includes the query use-cases for indexes? It will make the described schema easier to understand.

dshulyak · 2021-11-15T03:01:19Z

Each bullet point is a separate key, so for example:

layer index, indexed by layer_id || block_id

means that the key is a binary concatenation of layer_id and block_id. what else would you consider as a part of schema for KV store?

noamnelke · 2021-11-15T14:10:34Z

@dshulyak thanks for the writeup!

I generally agree with you that the current DB design makes the node less robust than it needs to be.

The reason different databases were used originally is to keep caches local, i.e. to have a separate cache for each type of data. While I can see the benefit, I think this motivation is misguided. It sacrifices critical safety features for performance gains. While getting "free" caches is nice, they are out of our control and most likely suboptimal. I believe that we should be in control of our caches - different caching strategies make more sense for different parts of the node.

So I agree about merging the databases. Not so sure about sqlite, though. I think that for our uses sqlite is overkill. This may be worth it if you want to take advantage of referential integrity guarantees, but I don't think we need that. If you think that you can quickly code up a POC and benchmark it, I don't mind looking at the results. However, performance is not the sole factor here - we should be careful to keep the developer experience as simple as possible, which I feel that will be easier with LevelDB. This is also something we can see in the POC.

My main point is that if we merge the databases, regardless of tech, we have to give some thought to caches. This may make this project less trivial - but I wholeheartedly support this effort!

dshulyak · 2021-11-15T14:35:15Z

However, performance is not the sole factor here - we should be careful to keep the developer experience as simple as possible, which I feel that will be easier with LevelDB. This is also something we can see in the POC.

i am also concerned with developer experience and how to make it in general less prone to bugs. but as mentioned in pros/cons i have the opposite opinion.

if we take blocks data as an example, with leveldb - developer needs to craft 3 additional indexes manually (layer index, input vector, contextual validity). then remember to write all of them to one database with a batch, and not 3 separate key writes. with sql - developer will need to define only 3 additional fields in the blocks table and declare an index for each field, and use sql tx for atomicity. it seems that with sql it is very hard to mess it up.

My main point is that if we merge the databases, regardless of tech, we have to give some thought to caches.

i remember this point, it was raised in #2547.
we have lru caches on top of leveldb for blocks and atx's (basically two most-read things). in general, i am not very concerned about caching, there are not so many places that are using db intensively. we definitely can profile every place that is executed during consensus before mainnet launch and add caching (or just optimize disk operations) with appropriate strategies to keep db code performant enough.

dshulyak · 2021-11-18T06:40:44Z

@moshababo adding sql table definitions, as discussed:

CREATE TABLE blocks ( 
	block_id CHAR(20) PRIMARY KEY,
	layer_id UNSIGNED MEDIUMINT,
	in_input_vector BOOL,
	contextually_valid BOOL,
	block    BLOB
) WITHOUT ROWID;

CREATE INDEX blocks_by_layer_id ON blocks(layer_id);
CREATE INDEX blocks_by_in_input_vector ON blocks(in_input_vector,layer_id) WHERE in_input_vector = 1;
CREATE INDEX blocks_by_contextual_validity ON blocks(contextually_valid,layer_id) WHERE contextually_valid = 1;

---

CREATE TABLE layers (
	layer_id UNSIGNED MEDIUMINT PRIMARY KEY,
	/* PROCESSED, SYNCED, STATE */
	label SMALLINT,
	hash CHAR(32),
	aggregated_hash CHAR(32)
) WITHOUT ROWID;

CREATE UNIQUE INDEX layer_id_by_label ON layers(label);

---

CREATE TABLE activations (
	atx_id CHAR(32) PRIMARY KEY,
	epoch_id UNSIGNED MEDIUMINT,
	node_id VARCHAR,
	timestamp UNSIGNED BIGINT, 
	is_top BOOL,
	header BLOB,
	body BLOB
) WITHOUT ROWID;

CREATE UNIQUE INDEX atx_by_epoch_node ON activations(epoch_id,node_id);
CREATE UNIQUE INDEX top_atx ON activations(is_top) WHERE is_top = 1;

---


CREATE TABLE poets (
	poet_id VARCHAR PRIMARY KEY,
	poet BLOB
) WITHOUT ROWID;

--- 

CREATE TABLE identieis (
	node_key VARCHAR PRIMARY KEY,
	vrf_key VARCHAR
) WITHOUT ROWID;

I didn't check that the indexes will have an expected complexity, but i assume that they will. All existing queries should use them if implemented in the most straightforward way. Regarding leveldb, the schema remains as it was described in initial message, the only difference is that bucket will need to have unique prefix (e.g. activations - a, blocks -b and so on).

about implementation, preferably it should not be using database/sql driver and github.com/mattn/go-sqlite3. it is not really optimal for sqlite, and doesn't exactly makes sense. see https://crawshaw.io/blog/go-and-sqlite for discussion and https://turriate.com/articles/making-sqlite-faster-in-go. library that i would use is https://github.com/crawshaw/sqlite .

moshababo · 2021-11-21T15:13:20Z

I didn't check that the indexes will have an expected complexity, but i assume that they will.

Which complexity are you referring to?

dshulyak · 2021-11-21T15:33:57Z

Which complexity are you referring to?

complexity of the query, this is about how optimal is the index. optimal index will allow doing less work for sql engine when performing queries.

moshababo · 2021-11-21T15:53:08Z

Is there any complexity involved besides properly designing them according to the existing set of queries? I thought it should be easy with this kind of schema. But it will also cause performance regression for inserts/updates/deletes, so we need to consider that as well, even though the read/write ratio seems to support having it.

dshulyak · 2021-11-22T05:08:42Z

sorry for the confusion, by saying that I didn't check that the indexes will have an expected complexity, but i assume that they will. i meant that i think that i designed them properly, but not 100% sure because i didn't check the actual performance.

essentially yes it boils down to designing them properly, if they won't be designed properly the complexity of the query will be different from optimal.

dshulyak · 2021-12-20T05:31:21Z

I had some time to think about poc for blocks. It is not integrated with spacemesh codebase, but in general the changes for integration are minimal.
https://github.com/dshulyak/smstate/

## Motivation The existing approach lacks atomicity, causal durability (e.g. an atx may not be on disk when a ballot is saved), and the durability can't be enforced in general without running every leveldb operation in sync mode. All this problems will result in subtle bugs that are hard to diagnose. For those 3 requirements, we want to maintain all state in a single db. Moving state to a single leveldb will require us to enforce isolation ourselves (by maintaining separate namespace and manually concatenating keys like we do in some modules), beside that we have to create every single index manually while with sqlite we can just do `CREATE INDEX` and sqlite will do it for us and probably do a better job. Another significant benefit is that we can duplicate some state in sql table to avoid loading the whole structure into memory. For instance it will be relevant for atx, which is a large (10kb) and usually, after it was validated, we want to know only the associated smesher and weight of the atx. This would be problematic with leveldb, and would require adding custom index. related: #2918 ## Changes - general plumbing for core database stuff (db, transaction, migrations) using https://github.com/crawshaw/sqlite that is a relatively simple wrapper around C sqlite - tables and for layers, blocks and ballots - reworked ZeroLayer, it is relevant only for hare_output, which can be in 3 states - nil, empty, non-empty. SetZeroLayer update hare_output to empty state, at which tortoise will vote against all blocks within hdist. - updated to golang 1.16 for `embed` module. note that there is a bug with go mode tidy in 1.16 so i had to manually fix go.sum and disable go mod tidy on ci - golang/go#44129 ## Test Plan existing and new uts

## Motivation part of #2918 ## Changes - move layers hashes (blocks hare and aggregated hash) to layers table - reward table - sql function add_uint64 to add rewards in sqlite (sqlite doesn't support uint64 natively)

## Motivation part of: #2918 ## Changes - table for transactions to replace transactions and unappliedTxs databases - add API to filter transactions for multiple layers. instead of querying layer by layer - simplify logic around pending transactions (transactions now added as pending to the table, and either switch to applied or marked as deleted)

dshulyak · 2022-01-24T07:06:35Z

missing parts:

svm state should not be touched. most likely it will use different storage that works better with state trie access patterns.

lrettig · 2022-01-26T23:00:29Z

@dshulyak can you share info on sqlite vs. leveldb benchmarks here? Also, how does go-ethereum (and maybe bitcoin, too, for that matter) solve the ACID issues you describe here with leveldb?

dshulyak · 2022-01-27T02:33:54Z

I used my benchmarks for tortoise to compare performance. It is not completely about ACID, something like isolation is less important for us. Problems that i described can be solved by using batched writes, single database, proper durability control and application level locking.

One concern though was that db code in some modules wasn't well written (it was either inefficient or just broken). With sql it is really hard to mess it up. Besides sqlite itself is well known for its exceptional quality (see https://www.sqlite.org/testing.html) and is actively maintained. goleveldb might be not bad but it is far from sqlite in terms of quality.

dshulyak · 2022-01-27T03:30:06Z

Those are benchmarks are from rerun, as it is probably the only db path that requires perf tuning.

benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                                  old ns/op       new ns/op       delta
BenchmarkRerun/Verifying/100-16            58922992        79576268        +35.05%
BenchmarkRerun/Verifying/1000-16           1004287667      1007788060      +0.35%
BenchmarkRerun/Verifying/10000-16          15027826254     11513656312     -23.38%
BenchmarkRerun/Full/100-16                 87056457        102140044       +17.33%
BenchmarkRerun/Full/100/Window/10-16       75063100        93536669        +24.61%
BenchmarkRerun/Full/1000/Window/100-16     1010487615      1002175876      -0.82%

## Motivation  Part of #2918  ## Changes  - move beacons storage to SQLite - format SQL code - extract `*sql.Database` from `mesh` ## Test Plan  UT, ST ## DevOps Notes  - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)

## Motivation  Part of #2918  ## Changes  - move atxs storage to SQLite ## Test Plan  unit and system tests ## DevOps Notes  - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)

## Motivation  Part of #2918  ## Changes  - move PoETs storage to SQLite ## Test Plan  UT, ST ## DevOps Notes  - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)

## Motivation  Part of #2918  ## Changes  - move proposals storage to SQLite ## Test Plan  Unit and system tests ## DevOps Notes  - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)

## Motivation  Part of #2918  ## Changes  - move ref ballots storage to SQLite ## Test Plan  Unit and system tests ## DevOps Notes  - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on) Co-authored-by: kimmy lin <[email protected]>

dshulyak · 2022-03-28T09:33:21Z

Looks like everything was integrated

moshababo mentioned this issue Nov 21, 2021

MeshDB.writeBlock is not safe in the event of the crash #2547

Closed

moshababo assigned dshulyak Dec 20, 2021

countvonzero mentioned this issue Dec 28, 2021

[Merged by Bors] - UCB: add standalone proposal db #3023

Closed

4 tasks

dshulyak mentioned this issue Jan 7, 2022

[Merged by Bors] - sql: store ballots, blocks and layers in sqlite #3047

Closed

dshulyak mentioned this issue Jan 17, 2022

[Merged by Bors] - sql: layers hashes, and rewards #3058

Closed

dshulyak mentioned this issue Jan 19, 2022

[Merged by Bors] - sql, mesh, api: refactor transactions #3064

Closed

lrettig mentioned this issue Jan 26, 2022

Database format spacemeshos/pm#111

Closed

moshababo assigned nkryuchkov Jan 27, 2022

nkryuchkov mentioned this issue Jan 31, 2022

[Merged by Bors] - sql: beacons #3106

Closed

4 tasks

nkryuchkov mentioned this issue Feb 6, 2022

[Merged by Bors] - sql: atxs #3114

Closed

4 tasks

nkryuchkov mentioned this issue Feb 14, 2022

[Merged by Bors] - sql: PoETs #3124

Closed

4 tasks

nkryuchkov mentioned this issue Feb 17, 2022

[Merged by Bors] - sql: proposals #3129

Closed

4 tasks

nkryuchkov mentioned this issue Feb 21, 2022

[Merged by Bors] - sql: ref ballots #3141

Closed

4 tasks

dshulyak closed this as completed Mar 28, 2022

huly-for-github bot unassigned nkryuchkov Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database hardening proposal #2918

Database hardening proposal #2918

dshulyak commented Nov 5, 2021

moshababo commented Nov 14, 2021

dshulyak commented Nov 15, 2021

noamnelke commented Nov 15, 2021 •

edited

Loading

dshulyak commented Nov 15, 2021

dshulyak commented Nov 18, 2021 •

edited

Loading

moshababo commented Nov 21, 2021

dshulyak commented Nov 21, 2021

moshababo commented Nov 21, 2021

dshulyak commented Nov 22, 2021

dshulyak commented Dec 20, 2021

dshulyak commented Jan 24, 2022 •

edited

Loading

lrettig commented Jan 26, 2022

dshulyak commented Jan 27, 2022 •

edited

Loading

dshulyak commented Jan 27, 2022

dshulyak commented Mar 28, 2022

Database hardening proposal #2918

Database hardening proposal #2918

Comments

dshulyak commented Nov 5, 2021

Schema and requirements

mesh module

blocks

layers

activation module

activations

poet proofs

identities

Changes and implementation

implementation

leveldb (or any key-value blob storage)

switch to sqlite

moshababo commented Nov 14, 2021

dshulyak commented Nov 15, 2021

noamnelke commented Nov 15, 2021 • edited Loading

dshulyak commented Nov 15, 2021

dshulyak commented Nov 18, 2021 • edited Loading

moshababo commented Nov 21, 2021

dshulyak commented Nov 21, 2021

moshababo commented Nov 21, 2021

dshulyak commented Nov 22, 2021

dshulyak commented Dec 20, 2021

dshulyak commented Jan 24, 2022 • edited Loading

lrettig commented Jan 26, 2022

dshulyak commented Jan 27, 2022 • edited Loading

dshulyak commented Jan 27, 2022

dshulyak commented Mar 28, 2022

noamnelke commented Nov 15, 2021 •

edited

Loading

dshulyak commented Nov 18, 2021 •

edited

Loading

dshulyak commented Jan 24, 2022 •

edited

Loading

dshulyak commented Jan 27, 2022 •

edited

Loading