-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspected liveslots sensitivity to engine allocations #6784
Comments
I looked into a few of these, here are some preliminary notes. The first one ("XS: fuzzilli 38", vat v24, tnum=1547) is a makeWantMintedInvitation to o+19/1, in which the "expected" behavior has syscalls which allocate a new InvitationHandle (instance of durable Kind 12), but the "got" is a state fetch for o+19/1 (the target object). This indicates that liveslots had no Representative or state for the target of the message (I guess the GC variance caused the Representative to get collected early), and when @FUDCo and I considered this during design/implementation, but it looks like we missed something. Our intention was that the Representative can be built without doing a vatstore read (the Kind ID is parsed out of the vref), so simply delivering a message to a vobj shouldn't trigger a syscall (or divergence if the Rep is/is-not currently present in the Hm, I wonder if the write-back cache is involved: if userspace does:
then the 3-read must use the dirty data from the cache, for correctness, but we thought we were doing an unconditional read from the vatstore, for determinism. I don't remember considering the conflict between those two, which means I'm not confident that it does the right thing. (@FUDCo and I have talked about removing that LRU cache entirely, or making it write-through, in #6693, partly because of how it confuses issues like this) The last one ("XS profile ID by function definition", vat v6, tnum=12258) is a BringOutYourDead, which (in the transcripts provided) starts with a pair of Basically the I concur that we have more sensitivity to GC order than we first thought (i.e. #6360 and #6760 do not form an exhaustive list). |
Not sure if it can help, but I have generated new vat transcripts replaying chain activity using an updated version of XS (with different allocation behavior). Here is the diff vs the vat transcripts for the pismoC version of XS (as if that fixed version had been used from agd upgrade). I have the full transcripts if needed as well. transcript-fixed-vs-updated-diff.tgz Generated using for i in {1..54}; do diff -U 5 <(zcat replay-fixed/transcript-v$i.sst.gz | jq -r .) <(cat replay-updated-xs-from-transcript/transcript-v$i.sst | head -n $(zcat replay-fixed/transcript-v$i.sst.gz | wc -l) | jq -r .) > transcript-fixed-vs-updated-v$i.diff ; done |
We'll use this ticket to track the sensitivies we discover, but we may or may not be able to fix all of them in time for the bulldozer release. |
Update: Moddable landed a patch for us which considers WeakRef strong during organic GC and only empties them during forced GC (which happens either during BYOD or snapshot making, which are both in consensus operations). That effectively hides organic GC from liveslots, and thus from the kernel. That patch seem to have removed all the anachrophobia reported above. Testing methodology: new vat transcripts generated using chain-replay tool on pismoC + WeakRef GC patch cherry picked (see mhofman/6784-hide-organic-gc-pismo), replay transcripts on latest XS which includes the same WeakRef GC patch (see mhofman/6784-hide-organic-gc-updated-xs). This should allow us to confidently update XS versions which contain allocation differences without triggering replay issues during node restart. Of course we should still fix the underlying issues in liveslots so we can disable this patch. |
Edit: that issue was happening in a node based worker, explaining the full non-determinism. |
Not sure if it's relevant, but leaving a pointer just in case: I found that liveslots |
Describe the bug
While validating an upgrade to Moddable SDK 3.6.0 from the fork we've been running, I attempted to validate that an upgraded XS version could still execute transcripts of mainnet vats.
Liveslots is supposed to hide any gc effect from the user code, and hide organic gc from the kernel (as captured by transcripts), only revealing the result of a full gc in a deterministic manner when explicitly requested by bringOutYourDead. As such if the JavaScript behavior of the user program is the same, the transcript should be the same.
While we're aware of and are able to work around a gc sensitivity in virtual collections, something is causing other execution differences.
I've narrowed down the triggers of those differences to 5 commits in the Moddable SDK, which have all been confirmed to have an impact on allocations.
To Reproduce
release-pismo
with the Moddable SDK updated to 3.6.0 and select reverts on top to make all transcripts run as they initially did.moddable
submodule to trigger the corresponding error. Please note thatyarn build
inpackages/xsnap
automatically restores the moddable submodule commit, unless it's added to the index. Inpackages/xsnap/moddable
:git reset --hard mhofman/sync-sdk-keep-behavior
git revert $commitsha
(cd .. && git add moddable && yarn build)
/path/to/agoric-sdk/packages/SwingSet/misc-tools/replay-transcript.js --no-forced-reload-from-snapshot transcript-v$i.sst
Observed behaviors
The text was updated successfully, but these errors were encountered: