Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receipts, unlike transactions, are not refcoutned. #3169

Closed
SkidanovAlex opened this issue Aug 14, 2020 · 1 comment · Fixed by #3215
Closed

Receipts, unlike transactions, are not refcoutned. #3169

SkidanovAlex opened this issue Aug 14, 2020 · 1 comment · Fixed by #3215
Assignees

Comments

@SkidanovAlex
Copy link
Collaborator

SkidanovAlex commented Aug 14, 2020

The same receipt (and the same transaction) can be in multiple chunks. Thus, when we garbage collect chunks, we need to make sure it is the last occurrence of a particular receipt or transaction. For txs we have a refcount in place, but for receipts we do not.

In clear_chunk_data:

            for receipt in chunk.receipts {
                self.gc_col(ColReceiptIdToShardId, &receipt.receipt_id.into());
            }
            for transaction in chunk.transactions {
                self.gc_col_transaction(transaction.get_hash())?;
            }

We need to implement the same refcounting for receipts.

UPD: When fixed, uncomment the debug assert in core/store/src/lib.rs (search for 3169 there)

Reproduced here: http://nayduck.eastus.cloudapp.azure.com:3000/#/test/12420

SkidanovAlex added a commit that referenced this issue Aug 14, 2020
Store validity checks are executed on each `get_status` and take up to 1.5 seconds, during which they block `Client`.
In stress.py specifically they quickly get to actually taking 1.5 seconds, thus each time the test calls to `get_status`, one of the nodes gets blocked for 1.5s.

Removing ongoing store validity checks from stress.py, and introducing them only once at the end.

Test plan:
----------
stress.py now can actually pass in nayduck:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/93

The failure is a new issue: #3169
SkidanovAlex added a commit that referenced this issue Aug 14, 2020
Store validity checks are executed on each `get_status` and take up to 1.5 seconds, during which they block `Client`.
In stress.py specifically they quickly get to actually taking 1.5 seconds, thus each time the test calls to `get_status`, one of the nodes gets blocked for 1.5s.

Removing ongoing store validity checks from stress.py, and introducing them only once at the end.

Test plan:
----------
stress.py now can actually pass in nayduck:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/93

The failure is a new issue: #3169
SkidanovAlex added a commit that referenced this issue Aug 15, 2020
The nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up.
If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall.

Fixing it by re-broadcasting `head` if during some reasonable time the network made no progress.

With this change `local_network` mode passes in nayduck.

Test plan:
---------
Before this change stress.py local_network fails consistently due to chain stalling.
After:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104

Out of five failures the first one is new (and needs debugging), but the chain is not stalled.
The remaining four are #3169
SkidanovAlex added a commit that referenced this issue Aug 15, 2020
The nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up.
If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall.

Fixing it by re-broadcasting `head` if during some reasonable time the network made no progress.

With this change `local_network` mode passes in nayduck.

Test plan:
---------
Before this change stress.py local_network fails consistently due to chain stalling.
After:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104

Out of five failures the first one is new (and needs debugging), but the chain is not stalled.
The remaining four are #3169
SkidanovAlex added a commit that referenced this issue Aug 15, 2020
* fix(pytest): use python proxy framework to isolate nodes instead of unix utilities

* fix: Properly handling stalls due to 1-height differences

The nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up.
If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall.

Fixing it by re-broadcasting `head` if during some reasonable time the network made no progress.

With this change `local_network` mode passes in nayduck.

Test plan:
---------
Before this change stress.py local_network fails consistently due to chain stalling.
After:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104

Out of five failures the first one is new (and needs debugging), but the chain is not stalled.
The remaining four are #3169

Co-authored-by: Michael Birch <[email protected]>
SkidanovAlex added a commit that referenced this issue Aug 16, 2020
1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed;
2. Adding a new mode that drops a percentage of packets (fixes #3105);
3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues;
4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected
SkidanovAlex added a commit that referenced this issue Aug 16, 2020
1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed;
2. Adding a new mode that drops a percentage of packets (fixes #3105);
3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues;
4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected
5. Enabling `local_network` in default nayduck runs. Also enabling a mode without shutting down nodes and interfering with the network, in which more invariants are checked (e.g. the transactions loss tolerance is lower)

Test plan:
---------
With (3) above the test becomes relatively stable (but still flaky). local_network and node_restart modes:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/122
SkidanovAlex added a commit that referenced this issue Aug 17, 2020
1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed;
2. Adding a new mode that drops a percentage of packets (fixes #3105);
3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues;
4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected
5. Enabling `local_network` in default nayduck runs. Also enabling a mode without shutting down nodes and interfering with the network, in which more invariants are checked (e.g. the transactions loss tolerance is lower)

Test plan:
---------
With (3) above the test becomes relatively stable (but still flaky). local_network and node_restart modes:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/122

Tests without any interference, and with packages_drop:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/128
@SkidanovAlex
Copy link
Collaborator Author

When fixed, uncomment the debug assert in core/store/src/lib.rs (search for 3169 there)

SkidanovAlex added a commit that referenced this issue Aug 17, 2020
1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed;
2. Adding a new mode that drops a percentage of packets (fixes #3105);
3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues;
4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected
5. Enabling `local_network` in default nayduck runs. Also enabling a mode without shutting down nodes and interfering with the network, in which more invariants are checked (e.g. the transactions loss tolerance is lower)

Test plan:
---------
With (3) above the test becomes relatively stable (but still flaky). local_network and node_restart modes:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/122

Tests without any interference, and with packages_drop:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/128
SkidanovAlex added a commit that referenced this issue Aug 17, 2020
1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed;
2. Adding a new mode that drops a percentage of packets (fixes #3105);
3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues;
4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected
5. Enabling `local_network` in default nayduck runs. Also enabling a mode without shutting down nodes and interfering with the network, in which more invariants are checked (e.g. the transactions loss tolerance is lower)

Test plan:
---------
With (3) above the test becomes relatively stable (but still flaky). local_network and node_restart modes:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/122

Tests without any interference, and with packages_drop:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/128
bowenwang1996 pushed a commit that referenced this issue Aug 17, 2020
1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed;
2. Adding a new mode that drops a percentage of packets (fixes #3105);
3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues;
4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected
5. Enabling `local_network` in default nayduck runs. Also enabling a mode without shutting down nodes and interfering with the network, in which more invariants are checked (e.g. the transactions loss tolerance is lower)

Test plan:
---------
With (3) above the test becomes relatively stable (but still flaky). local_network and node_restart modes:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/122

Tests without any interference, and with packages_drop:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/128
mikhailOK added a commit that referenced this issue Aug 19, 2020
- Move refcount logic for ColState from Trie to Store
- Change refcount for ColState 4 byte to 8 byte
- Use the new logic for ColTransactions, deprecate ColTransactionRefCount
- Use it for ColReceiptIdToShardId

This requires a storage version upgrade.

Fixes #3169

Test plan
---------
TODO
mikhailOK added a commit that referenced this issue Aug 19, 2020
- Move refcount logic for ColState from Trie to Store
- Change refcount for ColState 4 byte to 8 byte
- Use the new logic for ColTransactions, deprecate ColTransactionRefCount
- Use it for ColReceiptIdToShardId

This requires a storage version upgrade.

Fixes #3169

Test plan
---------
TODO
mikhailOK added a commit that referenced this issue Aug 19, 2020
- Move refcount logic for ColState from Trie to Store
- Change refcount for ColState 4 byte to 8 byte
- Use the new logic for ColTransactions, deprecate ColTransactionRefCount
- Use it for ColReceiptIdToShardId

This requires a storage version upgrade.

Fixes #3169

Test plan
---------
TODO
mikhailOK added a commit that referenced this issue Aug 19, 2020
- Move refcount logic for ColState from Trie to Store
- Change refcount for ColState 4 byte to 8 byte
- Use the new logic for ColTransactions, deprecate ColTransactionRefCount
- Use it for ColReceiptIdToShardId

This requires a storage version upgrade.

Fixes #3169

Test plan
---------
TODO
mikhailOK added a commit that referenced this issue Aug 19, 2020
- Move refcount logic for ColState from Trie to Store
- Change refcount for ColState 4 byte to 8 byte
- Use the new logic for ColTransactions, deprecate ColTransactionRefCount
- Use it for ColReceiptIdToShardId

This requires a storage version upgrade.

Fixes #3169

Test plan
---------
TODO
mikhailOK added a commit that referenced this issue Aug 19, 2020
- Move refcount logic for ColState from Trie to Store
- Change refcount for ColState 4 byte to 8 byte
- Use the new logic for ColTransactions, deprecate ColTransactionRefCount
- Use it for ColReceiptIdToShardId

This requires a storage version upgrade.

Fixes #3169

Test plan
---------
TODO
mikhailOK added a commit that referenced this issue Aug 19, 2020
- Move refcount logic for ColState from Trie to Store
- Change refcount for ColState 4 byte to 8 byte
- Use the new logic for ColTransactions, deprecate ColTransactionRefCount
- Use it for ColReceiptIdToShardId

This requires a storage version upgrade.

Fixes #3169

Test plan
---------
TODO
mikhailOK added a commit that referenced this issue Aug 19, 2020
- Move refcount logic for ColState from Trie to Store
- Change refcount for ColState 4 byte to 8 byte
- Use the new logic for ColTransactions, deprecate ColTransactionRefCount
- Use it for ColReceiptIdToShardId

This requires a storage version upgrade.

Fixes #3169

Test plan
---------
TODO
mikhailOK added a commit that referenced this issue Aug 21, 2020
- Move refcount logic for ColState from Trie to Store
- Change refcount for ColState 4 byte to 8 byte
- Use the new logic for ColTransactions, deprecate ColTransactionRefCount
- Use it for ColReceiptIdToShardId

This requires a storage version upgrade.

Fixes #3169

Test plan
---------
nightly and db migration pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants