Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pause faulty kernel while allowing cosmic-swingset to continue? #4516

Open
warner opened this issue Feb 9, 2022 · 0 comments
Open

pause faulty kernel while allowing cosmic-swingset to continue? #4516

warner opened this issue Feb 9, 2022 · 0 comments
Assignees
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Feb 9, 2022

What is the Problem Being Solved?

If the chain-side swingset kernel crashes, we obviously cannot safely proceed with kernel operations. Currently, that means controller.run() will throw, and then cosmic-swingset will halt. (I think the process doesn't terminate, but neither does it process any blocks).

Once we're in this state, the operator needs to restart their validator (we can imagine a supervisor process that notices and restarts, if the process actually terminated). But if the problem is not somehow transient, we expect controller.run() to throw again right away.

The only way to get out of this state is to actually fix the problem, which might mean modifying the kernel code. How do we execute that upgrade? And how should the validators collectively decide that a kernel change is the right thing to do? When the chain is running, this is what governance votes are for. But in this situation, the chain is not running.

One idea we've kicked around is to have some sort of cosmic-swingset flag that says "the kernel is offline right now", set when c.run() throws. In this state, all swingset-dispatched txns are rejected (including mailbox messages). However, plain cosmos messages (including governance) continue to work. This might allow a governance vote to take place, whose passage would instruct cosmic-swingset to do something with the kernel that would get it un-stuck.

This would be pretty tricky, though, because we have lots of places where cosmic-swingset (Go) code expects to be able to talk to the JS-side kernel code, or where it wouldn't be able to catch up afterwards. For example, if a balance transfer txn arrived while the kernel was paused, the Go-side ledger balance would be updated, but the "balance has changed" message normally sent into the kernel could not be delivered. Later, when the kernel is fixed and resumed, this message would probably be lost, and something on the JS side might never learn about the change.

Description of the Design

Not sure. Something in the cosmic-swingset Swingset module.

Security Considerations

I suspect there's all sorts of mischief one could achieve while the kernel is paused.

Determinism Considerations

At the moment of kernel panic, the kernel will have some per-crank state in the crank buffer (which will probably not have been committed), and considerably more in the host's block buffer (changes from every successful delivery before the crashing one). If the "resume" step is going to start with a fresh copy of the kernel, then we need to discard some or all of this state. Any DeliverTx that caused run-queue entries to be added need to survive or be re-submitted when the kernel is unpaused. We'll need to look very very carefully at where the state goes and what gets kept/discarded to make sure every operation happens exactly once in total.

Test Plan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

2 participants