-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
divergent XS heap snapshots on gcc-9 (i.e. Ubuntu-20.04-LTS) #7829
Comments
Correction: we've already started the communcation about requiring a newer GCC. So instead, we'll be changing #ifndef __has_builtin
#error "xsnap requires __has_builtin, so GCC>=10, e.g. Ubuntu-22.04 or newer"
#endif This will require two PRs: the first to update https://github.com/agoric-labs/xsnap-pub to include the |
XS emits different heap snapshots when compiled under gcc-9 (missing __has_builtin) and gcc-10 or later (has __has_builtin). To avoid the consensus problems that result, require __has_builtin, and thus a newer compiler. refs Agoric/agoric-sdk#7829
This switches to a version of `xsnap-pub` that adds a compile-time assertion of the availability of `__has_builtin()`. This function was added to gcc-10, and was missing in gcc-9. XS can produce different heap snapshots depending upon whether this function is available or not (breaking consensus, which includes the heap snapshot hashes). By mandating its presence, we ensure that all instances will behave the same way, maintaining consensus. closes #7829
This switches to a version of `xsnap-pub` that adds a compile-time assertion of the availability of `__has_builtin()`. This function was added to gcc-10, and was missing in gcc-9. XS can produce different heap snapshots depending upon whether this function is available or not (breaking consensus, which includes the heap snapshot hashes). By mandating its presence, we ensure that all instances will behave the same way, maintaining consensus. closes #7829
This switches to a version of `xsnap-pub` that adds a compile-time assertion of the availability of `__has_builtin()`. This function was added to gcc-10, and was missing in gcc-9. XS can produce different heap snapshots depending upon whether this function is available or not (breaking consensus, which includes the heap snapshot hashes). By mandating its presence, we ensure that all instances will behave the same way, maintaining consensus. closes #7829
Both PRs have landed, so I'm closing this one out. We chose to have xsnap to require that With the approach we used, a compiler which has a @michaelfig and I developed a different approach, in which we edit the build process to use I think, in the vaults+1 timeframe, we should switch to the |
Today's Emerynet upgrade exposed a behavior divergence: some validators (call them Team Number) produced a different AppHash for the bootstrap block than the rest (call those ones Team Integer). All of the validators that we (@arirubinstein) ran were on Team Integer, however Team Number had a 2/3rds majority. The first block stalled a lot, as the two teams refused to vote for each other's blocks, and several rounds of voting (and timeouts) had to occur before a Team Number validator became the proposer. (Our debugging was hindered by the fact that we were using a dashboard hosted by a Team Integer node, which did not accept the ultimately-winning Team Number blocks).
We extracted the genesis export from members of both teams, and compared their contents, revealing differences in the transcript spans and XS heap snapshots for a particular vat (
v42
, the vault manager I think). We then extracted the two heap snapshots (made as delivery 4, one from each team) throughpackages/SwingSet/misc-tools/ingest-xs-snapshot.js
to parse them into a little DB, and compared theslots
table of each. We found exactly one divergence:on a Team Integer validator, vs:
on a Team Number validator.
That means in one cohort, one particular slot was of type
XS_NUMBER_KIND
(a JS Number, aka a float), but in the other cohort it was of typeXS_INTEGER_ID
. This is an optimization of Number that only holds 32-bit integers (note: not a BigInt, just something in the integer subset ofNumber
).One place that performs such an optimization (although probably not the one causing the problem) is in
xsJSON.c / fxParseJSONToken
where it's parsing numbers:With Moddable's help, we eventually concluded that it was most likely to be in
xsRun.c
, where basic XS bytecodes like "multiply two things" do:If that code is multiplying two INTEGERs, the result may or may not fit into a 32-bit integer. If it does, this will produce a slot with
XS_INTEGER_KIND
(which is faster), but only if it can use__builtin_add_overflow
to efficiently tell whether it overflows or not. If it does not have__builtin_add_overflow
, then it unconditionally producesXS_NUMBER_KIND
. And it needs__has_builtin()
to sense whether__builtin_add_overflow
is available or not.We eventually concluded that Team Integer validators compiled their
xsnap
with a compiler that offers__has_builtin()
(specifically gcc-10 or newer), whereas Team Number used an older compiler that did not (gcc-9). Ubuntu 20.04 LTS offers both, but gcc-9 is the default, and it is not trivial to use gcc-10 for thexsnap
compile. Ubuntu 22.04 LTS uses gcc-11 as the default.Note that gcc-9 provides
__builtin_mul_overflow
and the rest, it's only missing the nice portable way to inquire about their presence:__has_builtin()
.The difference between
XS_NUMBER_KIND
andXS_INTEGER_KIND
is not visible to JavaScript: both havetypeof x === 'number'
and they behave identically. However the heap snapshot records the slot Kind, so the heap snapshots were different. And because we now include heap snapshot hashes in consensus, Team Number and Team Integer did not approve of each other's blocks.We got lucky that the divergence occurred in the first four deliveries of some vat, because that allowed us to see the difference in the heap snapshots right away. If the divergence happened in delivery 5, we wouldn't have seen the next heap snapshot for another 2000 deliveries.
The Fix
We're going to require that the code be compiled with a compiler that offers all the builtins needed to use these optimizations: Team Integer is our preferred winner.
To avoid requiring validators to upgrade their Ubuntu 20.04 LTS platforms to one with gcc-10 (after all 20.04 LTS is still supported, and is only three years old), we're doing to do this by
#define
ing the__has_builtin()
into something that always returns true. @michaelfig came up with a patch that only affects agoric-sdk, and avoids the need to modify both our XS fork and patch agoric-sdk to use it:With this in place, XS will always attempt to use
__builtin_mul_overflow
and it'sadd
/sub
friends, which will work on gcc-9 (and maybe even older versions). If someone attempts to compile it on a really old platform that lacks them, they'll get a linker error. But all validators should be doing the same thing.We also intend to add a unit test (later) that performs a multiply of two small numbers and creates a heap snapshot, comparing it against a "golden" value created on a new platform (and manually verified to be Team Integer). This test would then fail if, for whatever reason, XS chose to use the
XS_NUMBER_KIND
variant.The text was updated successfully, but these errors were encountered: