Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

triage: reorged chain needs to "restabilize" before running transactions #9613

Open
Tracked by #9261
just-mitch opened this issue Oct 31, 2024 · 0 comments
Open
Tracked by #9261
Assignees
Labels
S-needs-triage Status: This new issue/PR needs to be triaged.

Comments

@just-mitch
Copy link
Collaborator

just-mitch commented Oct 31, 2024

See the survives a reorg test.

Why do we need to sleep 3 epochs and allow the transaction bot to "stabilize the chain" before we can attempt to create wallets and transfer tokens?

Further, setting the sleep to 2 seems to consistently trigger a Tx dropped by P2P node when the test attempts to deploy accounts.

@just-mitch just-mitch added the S-needs-triage Status: This new issue/PR needs to be triaged. label Oct 31, 2024
@just-mitch just-mitch changed the title Chain needs to "restabilize" before rerunning transactions reorged chain needs to "restabilize" before running transactions Oct 31, 2024
just-mitch added a commit that referenced this issue Nov 1, 2024
Effectively runs the transfer test twice with a reorg in the middle.

To do so, I allow control of the host's k8s from within our e2e test
container during the jest test.

This allows us to programmatically from jest:
- forward ports
- install helm charts (like the chaos mesh one used to kill the provers
for 2 epochs)
- kill pods
- wait for pods
- etc

Note: I expected to only need to wait 2 epochs for the reorg, and then
be able to run more transfers. Instead I needed to wait 3 epochs after
killing the provers for 2 epochs, otherwise I get `Tx dropped by P2P
node`. I filed #9613

Other changes in this PR:
- use p2p prover coordination in the prover nodes by default
- run the reorg test on merge to master against 16 validators _**which I
expect to fail**_
- run the reorg test nightly against 16 validators _**which I expect to
fail**_ (yes, this is the same as the above, but the nightly should move
to AWS after #9472)
- restart the transaction bot in k8s when it fails
- update the transaction bot to send 1 public and 0 private transactions
by default
- update the `install_local_k8s` script to install metrics and chaos
mesh by default
@just-mitch just-mitch changed the title reorged chain needs to "restabilize" before running transactions triage: reorged chain needs to "restabilize" before running transactions Nov 7, 2024
just-mitch added a commit that referenced this issue Nov 14, 2024
This test:
- updates the aztec network deployment, allowing validators to use each other as boot nodes
- applies the "network-requirements" network shaping
- permanently disables the boot node
- runs 3 epochs during which it:
  - kills 25% of the validators
  - asserts that we miss less than 50% of slots

Other work in this branch includes:
- add `ignoreDroppedReceiptsFor` TX wait options
  - this allows sending a TX to one node, and awaiting it on another since we need time for p2p propagation
  - we need this since we have shifted the PXE to point at the top-level validator service, which load balances across individuals
  - this may help with #9613
- scalable loki deployment for prod
- more visible logging for core sequencer operations
- better error handling during the setup of l2 contracts
- better error handling in the pxe
- rename the network shaping charts to "aztec-chaos-scenarios"
just-mitch added a commit that referenced this issue Nov 14, 2024
This test:
- updates the aztec network deployment, allowing validators to use each other as boot nodes
- applies the "network-requirements" network shaping
- permanently disables the boot node
- runs 3 epochs during which it:
  - kills 25% of the validators
  - asserts that we miss less than 50% of slots

Other work in this branch includes:
- add `ignoreDroppedReceiptsFor` TX wait options
  - this allows sending a TX to one node, and awaiting it on another since we need time for p2p propagation
  - we need this since we have shifted the PXE to point at the top-level validator service, which load balances across individuals
  - this may help with #9613
- scalable loki deployment for prod
- more visible logging for core sequencer operations
- better error handling during the setup of l2 contracts
- better error handling in the pxe
- rename the network shaping charts to "aztec-chaos-scenarios"
just-mitch added a commit that referenced this issue Nov 15, 2024
This test:
- updates the aztec network deployment, allowing validators to use each other as boot nodes
- applies the "network-requirements" network shaping
- permanently disables the boot node
- runs 3 epochs during which it:
  - kills 25% of the validators
  - asserts that we miss less than 50% of slots

Other work in this branch includes:
- add `ignoreDroppedReceiptsFor` TX wait options
  - this allows sending a TX to one node, and awaiting it on another since we need time for p2p propagation
  - we need this since we have shifted the PXE to point at the top-level validator service, which load balances across individuals
  - this may help with #9613
- scalable loki deployment for prod
- more visible logging for core sequencer operations
- better error handling during the setup of l2 contracts
- better error handling in the pxe
- rename the network shaping charts to "aztec-chaos-scenarios"
just-mitch added a commit that referenced this issue Nov 15, 2024
This test:
- updates the aztec network deployment, allowing validators to use each other as boot nodes
- applies the "network-requirements" network shaping
- permanently disables the boot node
- runs 3 epochs during which it:
  - kills 25% of the validators
  - asserts that we miss less than 50% of slots

Other work in this branch includes:
- add `ignoreDroppedReceiptsFor` TX wait options
  - this allows sending a TX to one node, and awaiting it on another since we need time for p2p propagation
  - we need this since we have shifted the PXE to point at the top-level validator service, which load balances across individuals
  - this may help with #9613
- scalable loki deployment for prod
- more visible logging for core sequencer operations
- better error handling during the setup of l2 contracts
- better error handling in the pxe
- rename the network shaping charts to "aztec-chaos-scenarios"
just-mitch added a commit that referenced this issue Nov 20, 2024
Main feature of this branch (unlike the branch name indicates) is the
`gating_passive.test.ts`.

This test:
- updates the aztec network deployment, allowing validators to use each
other as boot nodes
- applies the "network-requirements" network shaping
- permanently disables the boot node
- runs 3 epochs during which it:
  - kills 25% of the validators
  - asserts that we miss less than 50% of slots

Other work in this branch includes:
- add `ignoreDroppedReceiptsFor` TX wait options
- this allows sending a TX to one node, and awaiting it on another since
we need time for p2p propagation
- we need this since we have shifted the PXE to point at the top-level
validator service, which load balances across individuals
  - this may help with #9613
- scalable loki deployment for prod
- more visible logging for core sequencer operations
- better error handling during the setup of l2 contracts
- better error handling in the pxe
- rename the network shaping charts to "aztec-chaos-scenarios"

Fix #9713 
Fix #9883
@just-mitch just-mitch self-assigned this Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-needs-triage Status: This new issue/PR needs to be triaged.
Projects
None yet
Development

No branches or pull requests

1 participant