-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Receipts, unlike transactions, are not refcoutned. #3169
Comments
SkidanovAlex
added a commit
that referenced
this issue
Aug 14, 2020
Store validity checks are executed on each `get_status` and take up to 1.5 seconds, during which they block `Client`. In stress.py specifically they quickly get to actually taking 1.5 seconds, thus each time the test calls to `get_status`, one of the nodes gets blocked for 1.5s. Removing ongoing store validity checks from stress.py, and introducing them only once at the end. Test plan: ---------- stress.py now can actually pass in nayduck: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/93 The failure is a new issue: #3169
SkidanovAlex
added a commit
that referenced
this issue
Aug 14, 2020
Store validity checks are executed on each `get_status` and take up to 1.5 seconds, during which they block `Client`. In stress.py specifically they quickly get to actually taking 1.5 seconds, thus each time the test calls to `get_status`, one of the nodes gets blocked for 1.5s. Removing ongoing store validity checks from stress.py, and introducing them only once at the end. Test plan: ---------- stress.py now can actually pass in nayduck: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/93 The failure is a new issue: #3169
SkidanovAlex
added a commit
that referenced
this issue
Aug 15, 2020
The nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up. If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall. Fixing it by re-broadcasting `head` if during some reasonable time the network made no progress. With this change `local_network` mode passes in nayduck. Test plan: --------- Before this change stress.py local_network fails consistently due to chain stalling. After: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104 Out of five failures the first one is new (and needs debugging), but the chain is not stalled. The remaining four are #3169
SkidanovAlex
added a commit
that referenced
this issue
Aug 15, 2020
The nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up. If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall. Fixing it by re-broadcasting `head` if during some reasonable time the network made no progress. With this change `local_network` mode passes in nayduck. Test plan: --------- Before this change stress.py local_network fails consistently due to chain stalling. After: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104 Out of five failures the first one is new (and needs debugging), but the chain is not stalled. The remaining four are #3169
SkidanovAlex
added a commit
that referenced
this issue
Aug 15, 2020
* fix(pytest): use python proxy framework to isolate nodes instead of unix utilities * fix: Properly handling stalls due to 1-height differences The nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up. If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall. Fixing it by re-broadcasting `head` if during some reasonable time the network made no progress. With this change `local_network` mode passes in nayduck. Test plan: --------- Before this change stress.py local_network fails consistently due to chain stalling. After: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104 Out of five failures the first one is new (and needs debugging), but the chain is not stalled. The remaining four are #3169 Co-authored-by: Michael Birch <[email protected]>
SkidanovAlex
added a commit
that referenced
this issue
Aug 16, 2020
1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed; 2. Adding a new mode that drops a percentage of packets (fixes #3105); 3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues; 4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected
SkidanovAlex
added a commit
that referenced
this issue
Aug 16, 2020
1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed; 2. Adding a new mode that drops a percentage of packets (fixes #3105); 3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues; 4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected 5. Enabling `local_network` in default nayduck runs. Also enabling a mode without shutting down nodes and interfering with the network, in which more invariants are checked (e.g. the transactions loss tolerance is lower) Test plan: --------- With (3) above the test becomes relatively stable (but still flaky). local_network and node_restart modes: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/122
SkidanovAlex
added a commit
that referenced
this issue
Aug 17, 2020
1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed; 2. Adding a new mode that drops a percentage of packets (fixes #3105); 3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues; 4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected 5. Enabling `local_network` in default nayduck runs. Also enabling a mode without shutting down nodes and interfering with the network, in which more invariants are checked (e.g. the transactions loss tolerance is lower) Test plan: --------- With (3) above the test becomes relatively stable (but still flaky). local_network and node_restart modes: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/122 Tests without any interference, and with packages_drop: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/128
When fixed, uncomment the debug assert in core/store/src/lib.rs (search for 3169 there) |
SkidanovAlex
added a commit
that referenced
this issue
Aug 17, 2020
1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed; 2. Adding a new mode that drops a percentage of packets (fixes #3105); 3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues; 4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected 5. Enabling `local_network` in default nayduck runs. Also enabling a mode without shutting down nodes and interfering with the network, in which more invariants are checked (e.g. the transactions loss tolerance is lower) Test plan: --------- With (3) above the test becomes relatively stable (but still flaky). local_network and node_restart modes: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/122 Tests without any interference, and with packages_drop: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/128
SkidanovAlex
added a commit
that referenced
this issue
Aug 17, 2020
1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed; 2. Adding a new mode that drops a percentage of packets (fixes #3105); 3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues; 4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected 5. Enabling `local_network` in default nayduck runs. Also enabling a mode without shutting down nodes and interfering with the network, in which more invariants are checked (e.g. the transactions loss tolerance is lower) Test plan: --------- With (3) above the test becomes relatively stable (but still flaky). local_network and node_restart modes: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/122 Tests without any interference, and with packages_drop: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/128
bowenwang1996
pushed a commit
that referenced
this issue
Aug 17, 2020
1. Removing the limit on 50 transactions per batch. It was needed when we had a bug that hangs if the tx doesn't exist, and is no longer needed; 2. Adding a new mode that drops a percentage of packets (fixes #3105); 3. Disabling the check for not deleting the same object within a transaction, until #3169 is fixed. After (1) above it crashes stress.py 3 out of 4 times, preventing it from getting to the (potential) real issues; 4. Increasing the epoch to 25 blocks, so that in the time it takes to send all the transactions and wait for the balances in the `local_network` mode ((15+20) * 2 = 70 seconds, which is approx 100 blocks) five epochs do not pass, and the transactions results are not garbage collected 5. Enabling `local_network` in default nayduck runs. Also enabling a mode without shutting down nodes and interfering with the network, in which more invariants are checked (e.g. the transactions loss tolerance is lower) Test plan: --------- With (3) above the test becomes relatively stable (but still flaky). local_network and node_restart modes: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/122 Tests without any interference, and with packages_drop: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/128
mikhailOK
added a commit
that referenced
this issue
Aug 19, 2020
- Move refcount logic for ColState from Trie to Store - Change refcount for ColState 4 byte to 8 byte - Use the new logic for ColTransactions, deprecate ColTransactionRefCount - Use it for ColReceiptIdToShardId This requires a storage version upgrade. Fixes #3169 Test plan --------- TODO
mikhailOK
added a commit
that referenced
this issue
Aug 19, 2020
- Move refcount logic for ColState from Trie to Store - Change refcount for ColState 4 byte to 8 byte - Use the new logic for ColTransactions, deprecate ColTransactionRefCount - Use it for ColReceiptIdToShardId This requires a storage version upgrade. Fixes #3169 Test plan --------- TODO
mikhailOK
added a commit
that referenced
this issue
Aug 19, 2020
- Move refcount logic for ColState from Trie to Store - Change refcount for ColState 4 byte to 8 byte - Use the new logic for ColTransactions, deprecate ColTransactionRefCount - Use it for ColReceiptIdToShardId This requires a storage version upgrade. Fixes #3169 Test plan --------- TODO
mikhailOK
added a commit
that referenced
this issue
Aug 19, 2020
- Move refcount logic for ColState from Trie to Store - Change refcount for ColState 4 byte to 8 byte - Use the new logic for ColTransactions, deprecate ColTransactionRefCount - Use it for ColReceiptIdToShardId This requires a storage version upgrade. Fixes #3169 Test plan --------- TODO
mikhailOK
added a commit
that referenced
this issue
Aug 19, 2020
- Move refcount logic for ColState from Trie to Store - Change refcount for ColState 4 byte to 8 byte - Use the new logic for ColTransactions, deprecate ColTransactionRefCount - Use it for ColReceiptIdToShardId This requires a storage version upgrade. Fixes #3169 Test plan --------- TODO
mikhailOK
added a commit
that referenced
this issue
Aug 19, 2020
- Move refcount logic for ColState from Trie to Store - Change refcount for ColState 4 byte to 8 byte - Use the new logic for ColTransactions, deprecate ColTransactionRefCount - Use it for ColReceiptIdToShardId This requires a storage version upgrade. Fixes #3169 Test plan --------- TODO
mikhailOK
added a commit
that referenced
this issue
Aug 19, 2020
- Move refcount logic for ColState from Trie to Store - Change refcount for ColState 4 byte to 8 byte - Use the new logic for ColTransactions, deprecate ColTransactionRefCount - Use it for ColReceiptIdToShardId This requires a storage version upgrade. Fixes #3169 Test plan --------- TODO
mikhailOK
added a commit
that referenced
this issue
Aug 19, 2020
- Move refcount logic for ColState from Trie to Store - Change refcount for ColState 4 byte to 8 byte - Use the new logic for ColTransactions, deprecate ColTransactionRefCount - Use it for ColReceiptIdToShardId This requires a storage version upgrade. Fixes #3169 Test plan --------- TODO
Merged
mikhailOK
added a commit
that referenced
this issue
Aug 21, 2020
- Move refcount logic for ColState from Trie to Store - Change refcount for ColState 4 byte to 8 byte - Use the new logic for ColTransactions, deprecate ColTransactionRefCount - Use it for ColReceiptIdToShardId This requires a storage version upgrade. Fixes #3169 Test plan --------- nightly and db migration pass
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The same receipt (and the same transaction) can be in multiple chunks. Thus, when we garbage collect chunks, we need to make sure it is the last occurrence of a particular receipt or transaction. For txs we have a refcount in place, but for receipts we do not.
In
clear_chunk_data
:We need to implement the same refcounting for receipts.
UPD: When fixed, uncomment the debug assert in core/store/src/lib.rs (search for 3169 there)
Reproduced here: http://nayduck.eastus.cloudapp.azure.com:3000/#/test/12420
The text was updated successfully, but these errors were encountered: