Receipts, unlike transactions, are not refcoutned. #3169

SkidanovAlex · 2020-08-14T17:45:36Z

The same receipt (and the same transaction) can be in multiple chunks. Thus, when we garbage collect chunks, we need to make sure it is the last occurrence of a particular receipt or transaction. For txs we have a refcount in place, but for receipts we do not.

In clear_chunk_data:

            for receipt in chunk.receipts {
                self.gc_col(ColReceiptIdToShardId, &receipt.receipt_id.into());
            }
            for transaction in chunk.transactions {
                self.gc_col_transaction(transaction.get_hash())?;
            }

We need to implement the same refcounting for receipts.

UPD: When fixed, uncomment the debug assert in core/store/src/lib.rs (search for 3169 there)

Reproduced here: http://nayduck.eastus.cloudapp.azure.com:3000/#/test/12420

The text was updated successfully, but these errors were encountered:

Store validity checks are executed on each `get_status` and take up to 1.5 seconds, during which they block `Client`. In stress.py specifically they quickly get to actually taking 1.5 seconds, thus each time the test calls to `get_status`, one of the nodes gets blocked for 1.5s. Removing ongoing store validity checks from stress.py, and introducing them only once at the end. Test plan: ---------- stress.py now can actually pass in nayduck: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/93 The failure is a new issue: #3169

The nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up. If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall. Fixing it by re-broadcasting `head` if during some reasonable time the network made no progress. With this change `local_network` mode passes in nayduck. Test plan: --------- Before this change stress.py local_network fails consistently due to chain stalling. After: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104 Out of five failures the first one is new (and needs debugging), but the chain is not stalled. The remaining four are #3169

* fix(pytest): use python proxy framework to isolate nodes instead of unix utilities * fix: Properly handling stalls due to 1-height differences The nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up. If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall. Fixing it by re-broadcasting `head` if during some reasonable time the network made no progress. With this change `local_network` mode passes in nayduck. Test plan: --------- Before this change stress.py local_network fails consistently due to chain stalling. After: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104 Out of five failures the first one is new (and needs debugging), but the chain is not stalled. The remaining four are #3169 Co-authored-by: Michael Birch <[email protected]>

1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed; 2. Adding a new mode that drops a percentage of packets (fixes #3105); 3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues; 4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected

1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed; 2. Adding a new mode that drops a percentage of packets (fixes #3105); 3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues; 4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected 5. Enabling `local_network` in default nayduck runs. Also enabling a mode without shutting down nodes and interfering with the network, in which more invariants are checked (e.g. the transactions loss tolerance is lower) Test plan: --------- With (3) above the test becomes relatively stable (but still flaky). local_network and node_restart modes: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/122

1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed; 2. Adding a new mode that drops a percentage of packets (fixes #3105); 3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues; 4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected 5. Enabling `local_network` in default nayduck runs. Also enabling a mode without shutting down nodes and interfering with the network, in which more invariants are checked (e.g. the transactions loss tolerance is lower) Test plan: --------- With (3) above the test becomes relatively stable (but still flaky). local_network and node_restart modes: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/122 Tests without any interference, and with packages_drop: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/128

SkidanovAlex · 2020-08-17T01:00:56Z

When fixed, uncomment the debug assert in core/store/src/lib.rs (search for 3169 there)

1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed; 2. Adding a new mode that drops a percentage of packets (fixes #3105); 3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues; 4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected 5. Enabling `local_network` in default nayduck runs. Also enabling a mode without shutting down nodes and interfering with the network, in which more invariants are checked (e.g. the transactions loss tolerance is lower) Test plan: --------- With (3) above the test becomes relatively stable (but still flaky). local_network and node_restart modes: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/122 Tests without any interference, and with packages_drop: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/128

- Move refcount logic for ColState from Trie to Store - Change refcount for ColState 4 byte to 8 byte - Use the new logic for ColTransactions, deprecate ColTransactionRefCount - Use it for ColReceiptIdToShardId This requires a storage version upgrade. Fixes #3169 Test plan --------- TODO

- Move refcount logic for ColState from Trie to Store - Change refcount for ColState 4 byte to 8 byte - Use the new logic for ColTransactions, deprecate ColTransactionRefCount - Use it for ColReceiptIdToShardId This requires a storage version upgrade. Fixes #3169 Test plan --------- nightly and db migration pass

SkidanovAlex assigned mikhailOK Aug 14, 2020

SkidanovAlex mentioned this issue Aug 14, 2020

fix: Removing store validity tests from stress.py #3170

Merged

SkidanovAlex mentioned this issue Aug 15, 2020

fix: Properly handling stalls due to 1-height differences #3175

Merged

SkidanovAlex mentioned this issue Aug 17, 2020

test: Various stress.py changes #3189

Merged

mikhailOK mentioned this issue Aug 19, 2020

Refcounted DB columns #3215

Merged

mikhailOK closed this as completed in #3215 Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receipts, unlike transactions, are not refcoutned. #3169

Receipts, unlike transactions, are not refcoutned. #3169

SkidanovAlex commented Aug 14, 2020 •

edited

Loading

SkidanovAlex commented Aug 17, 2020

Receipts, unlike transactions, are not refcoutned. #3169

Receipts, unlike transactions, are not refcoutned. #3169

Comments

SkidanovAlex commented Aug 14, 2020 • edited Loading

SkidanovAlex commented Aug 17, 2020

SkidanovAlex commented Aug 14, 2020 •

edited

Loading