-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide on mitigation of missed leadership checks due to ledger snapshots #868
Comments
A quick comment.
That may prove to be a useful workaround in the short-term, but our separation of concerns-based goal has so far been to avoid this sort of design (anything that changes the node's behavior based on the upcoming leadership schedule). |
It seems fair to assumed the leadership check thread is starved of CPU while the snapshot is being taken. I think there are two a priori suspects who might be occupying the CPU during that time:
If the GC pauses are greater than a second, then it's entirely possible the leadership check doesn't have CPU time during the elected slot. If we incrementalize the snapshotting work, then perhaps the lower allocation rate will allow for smaller GC pauses, ideally sub-second (ie sub-slot). Generally the RTS scheduler is fair, so I wouldn't expect the snapshotting threads (busy while taking the snapshot) would be able to starve the leadership check thread. But I list them anyway because maybe something about the computation is not playing nice with the RTS's preemption. (Examples include a long-running pure calculation that doesn't allocate, but I don't anticipate that in this particular case. Maybe some "FFI" stuff? I'm unsure how the file is actually written.) The lack of fairness in STM shouldn't matter here. The leadership check thread only reads the current chain, which shouldn't be changing so rapidly to incur contention among STM transactions that depend on it. |
There are possible culprits for the increase in CPU and memory usage, and GC time:
@lehins will look into 1 and 2, to see how much work they would require. As for 0 this is something we could do at the Consensus side. @coot suggested we could also use info tables profiling to extract more information. |
Regarding 0, we are writing a lazy writeSnapshot ::
forall m blk. MonadThrow m
=> SomeHasFS m
-> (ExtLedgerState blk -> Encoding)
-> DiskSnapshot
-> ExtLedgerState blk -> m ()
writeSnapshot (SomeHasFS hasFS) encLedger ss cs = do
withFile hasFS (snapshotToPath ss) (WriteMode MustBeNew) $ \h ->
void $ hPut hasFS h $ CBOR.toBuilder (encode cs)
where
encode :: ExtLedgerState blk -> Encoding
encode = encodeSnapshot encLedger
-- | This function makes sure that the whole 'Builder' is written.
--
-- The chunk size of the resulting 'BL.ByteString' determines how much memory
-- will be used while writing to the handle.
hPut :: forall m h
. (HasCallStack, Monad m)
=> HasFS m h
-> Handle h
-> Builder
-> m Word64
hPut hasFS g = hPutAll hasFS g . BS.toLazyByteString I tried using The following screenshot shows the heap profile, which is started before taking a snapshot of the ledger state: The first spike related to encoders appears on the 3rd page of the "Detailed" tab, and takes only 40 MB |
We also observe a very low productivity during the run above:
|
after processing a fixed number of slots. This is an experiment to investigate #868
Processing 20K blocks and storing 3 snapshots corresponding to the last 3 blocks that were applied results in the following heap profile: The "Detailed" tab seems to show The branch used to obtain these results can be found here. The profile can be produced by running: cabal run exe:db-analyser -- cardano --config $NODE_HOME/configuration/cardano/mainnet-config.json --db $NODE_HOME/mainnet/db --analyse-from 72316896 --only-immutable-db --store-ledger 72336896 +RTS -hi -s -l-agu
|
@dnadales could you run it once more, with the https://hackage.haskell.org/package/base-4.19.1.0/docs/GHC-Stats.html#v:getRTSStats dump before and after each of the ledger snapshots being written? |
A thought for a quick-fix partial mitigation to help out users until we eliminate the underlying performance bug: we could take snapshots much less often. The default interval is 72min. The nominal expectation for that duration is 216 new blocks arising. If we increased the interval to 240 minutes, then the expectation would be 720 blocks. The initializing node should still be able to re-process that many blocks in less than a few minutes (my recollection is that deserializing the ledger state snapshot file is far and away the dominant factor in startup-time). ^^^ that all assumes we're considering a node that is already caught-up. A syncing node would have processed many more blocks than 216 during a 72min interval. But the argument above does suggest it would be relatively harmless for a caught-up node to use a inter-snapshot duration of 240min. (Moreover, I don't immediately see why it couldn't even be 10 times that or more. But that seems like a big change, so I hesitate to bless it without giving it much more thought.) (The interval is the first |
Sure thing:
|
@TimSheard suggested we could try increasing the pulse size in this line (eg by doubling it) , and re-run this experiment again. @TimSheard would like to know what the protocol parameters are during this run. We could also re-run the experiment with a GHC build that has info table built into the base libraries, or using different And also, it might be useful to make sure we cross epoch boundaries. |
FTR, Ledger experiment on avoiding forcing the pulser when serializing a ledger state IntersectMBO/cardano-ledger#4196 |
Using PR #1245, I ran two Every run started from slot 127600033, which happens to be the snapshot file I had on hand that was very near the end of an epoch. (It also happens to be immediately prior to the June 2024 attack, so I wouldn't advise using this same interval in the future --- those txs make the code unnecessarily slow.)
Regardless of whether it was all-in-one execution or one-execution-per, each snapshot took ~65 to ~75 seconds to write out. There's no starkly obvious pattern in the distribution. Thoughts:
|
In the original report by John Lotoski above, it took ~2min, so it isn't off by that much. |
There is also a chance that creation of a snapshot by a running node can actually trigger GC to run at the same time, which would have a significant impact on a snapshot creation time. There is a good chance GC is not triggered in db-anaylzer, while it would be for a running node, due to lower memory overhead. Naturally this is just a speculation that is not backed by any real data. 😁 |
I've also embellished my run with stats similar to what Damian did in #868 (comment) My results look seem to have a similar order of magnitude to his, but do differ --- perhaps its due to the machine etc (mine is 32G RAM, i7-1165G7 @ 2.8 GHz). These numbers are all observed via
(Each ledger snapshot file is ~2.7 gigabytes.) |
I did some experimentation with an ad-hoc
Even with UTxO HD soon essentially resulting in the third list bullet above, that's still 14 seconds of GC, "all at once". (And the heap of a real node has a lot more contents, much more than 4.5 gigabytes, so the traversal times are probably noticeably worse than in this bare-bones I'm starting to suspect we need to also "pulse" the snapshot creation :face-palm:, spreading the seemingly-unavoidable GC work over tens of minutes, in order to avoid starving the threads by (unnecessarily) creating that GC burst all at once. (I don't recall the details of the Is this token allocation overhead just a known downside of |
I had a call with @lehins. We both have an idea, and they both seem plausible and orthogonal. And moreover orthogonal to UTxO HD.
|
What about separating the work based on the chunks of the resulting lazy
|
Yeah, byte size/chunk count/etc could replace token count. But the same problem remains: we don't a priori know the total byte size, the total chunk count, etc. If we get it wrong, it'd take 🤷 8 hours (days?) to write a bigger-than-expected snapshot, eg. For a stable caught-up node, that's probably not a huge problem, but for a syncing node that's essentially all progress (although re-validation is cheaper than download and full validation...). |
Initial results from @lehins's
Edit: that's ~25% allocation and ~50% token count. Edit: I can confirm the other components' token counts did not change. I wasn't sure which components to isolate, but the ones I chose do amount to 98.7% of the non-
Edit: the size of the ledger snapshot increases from 2,584,897,053 to 2,722,857,842 bytes. Seems fine, though I was expecting the opposite. |
Note that @lehins We will try to reproduce this once 10.3 is released to see if the problem persists. |
Problem description
John Lotoski informed us that currently on Cardano mainnet, adequately resourced nodes (well above minimum specs) are missing lots of leadership checks during ledger snapshots.
Concretely, during every ledger snapshot (performed every
2k seconds = 72min
by default), which takes about ~2min, the node misses ~30 leadership checks with 32GB RAM, and ~100 with 16GB RAM. This means that the node is missing ~0.7-2.3% of its leadership opportunities, and without mitigations, this number will likely grow as the size of the ledger state increases over time.This problem is not a new one, it has existed since at least node 8.0.0 (and likely even before).
Analysis
Various experiments (credits to John Lotoski) indicate that this problem is due to high GC load/long GC pauses while taking a ledger snapshot (current mainnet size is ~2.6GB serialized). The main reasons for this belief are:
Using
--nonmoving-gc
fixes the problem for some time.1Judging from a 6h log excerpt, both GC time and missed slots increase greatly during a ledger snapshot:
(GC time comes from
gc_cpu_ns
)Changing other aspects of the machine running the node (compute, IOPS) has no effect.
Potential mitigations
Several orthogonal mitigation options have been raised:
--nonmoving-gc
on a more recent GHC is enough, see 1.ByteString
) chunks?Note that UTxO HD will also help, but it will likely not be used for some time by block producers (where this is issue is actually important).
The goal of this ticket is to interact with other teams/stakeholders to identify the best way forward here.
Footnotes
Quoting from John Lotoski:
↩ ↩2The text was updated successfully, but these errors were encountered: