pause faulty kernel while allowing cosmic-swingset to continue? #4516
Labels
cosmic-swingset
package: cosmic-swingset
enhancement
New feature or request
SwingSet
package: SwingSet
What is the Problem Being Solved?
If the chain-side swingset kernel crashes, we obviously cannot safely proceed with kernel operations. Currently, that means
controller.run()
will throw, and then cosmic-swingset will halt. (I think the process doesn't terminate, but neither does it process any blocks).Once we're in this state, the operator needs to restart their validator (we can imagine a supervisor process that notices and restarts, if the process actually terminated). But if the problem is not somehow transient, we expect
controller.run()
to throw again right away.The only way to get out of this state is to actually fix the problem, which might mean modifying the kernel code. How do we execute that upgrade? And how should the validators collectively decide that a kernel change is the right thing to do? When the chain is running, this is what governance votes are for. But in this situation, the chain is not running.
One idea we've kicked around is to have some sort of cosmic-swingset flag that says "the kernel is offline right now", set when
c.run()
throws. In this state, all swingset-dispatched txns are rejected (including mailbox messages). However, plain cosmos messages (including governance) continue to work. This might allow a governance vote to take place, whose passage would instruct cosmic-swingset to do something with the kernel that would get it un-stuck.This would be pretty tricky, though, because we have lots of places where cosmic-swingset (Go) code expects to be able to talk to the JS-side kernel code, or where it wouldn't be able to catch up afterwards. For example, if a balance transfer txn arrived while the kernel was paused, the Go-side ledger balance would be updated, but the "balance has changed" message normally sent into the kernel could not be delivered. Later, when the kernel is fixed and resumed, this message would probably be lost, and something on the JS side might never learn about the change.
Description of the Design
Not sure. Something in the cosmic-swingset
Swingset
module.Security Considerations
I suspect there's all sorts of mischief one could achieve while the kernel is paused.
Determinism Considerations
At the moment of kernel panic, the kernel will have some per-crank state in the crank buffer (which will probably not have been committed), and considerably more in the host's block buffer (changes from every successful delivery before the crashing one). If the "resume" step is going to start with a fresh copy of the kernel, then we need to discard some or all of this state. Any
DeliverTx
that caused run-queue entries to be added need to survive or be re-submitted when the kernel is unpaused. We'll need to look very very carefully at where the state goes and what gets kept/discarded to make sure every operation happens exactly once in total.Test Plan
The text was updated successfully, but these errors were encountered: