Decide on mitigation of missed leadership checks due to ledger snapshots #868

amesgen · 2024-01-08T13:10:52Z

Problem description

John Lotoski informed us that currently on Cardano mainnet, adequately resourced nodes (well above minimum specs) are missing lots of leadership checks during ledger snapshots.

Concretely, during every ledger snapshot (performed every 2k seconds = 72min by default), which takes about ~2min, the node misses ~30 leadership checks with 32GB RAM, and ~100 with 16GB RAM. This means that the node is missing ~0.7-2.3% of its leadership opportunities, and without mitigations, this number will likely grow as the size of the ledger state increases over time.

This problem is not a new one, it has existed since at least node 8.0.0 (and likely even before).

Analysis

Various experiments (credits to John Lotoski) indicate that this problem is due to high GC load/long GC pauses while taking a ledger snapshot (current mainnet size is ~2.6GB serialized). The main reasons for this belief are:

Using --nonmoving-gc fixes the problem for some time.¹
Judging from a 6h log excerpt, both GC time and missed slots increase greatly during a ledger snapshot:

(GC time comes from gc_cpu_ns)
Changing other aspects of the machine running the node (compute, IOPS) has no effect.

Potential mitigations

Several orthogonal mitigation options have been raised:

Try to find a set of RTS options that fixes the problem without any code changes. In particular, it might be the case that --nonmoving-gc on a more recent GHC is enough, see ¹.
Incrementalising the creation of ledger snapshots. Specifically, this could be achieved by introducing a delay between writing individual chunks of the snapshot to disk. One hope here is that by spreading out the allocation work over a longer period of time (default GC interval is 72min), the total time spent GCing will be less, but a priori, it might just be the case that we miss as many slots as before, just spread out more evenly over a longer period of time.
Don't take ledger snapshots around the time the node is elected. Since we know in advance when the node will be elected, we can take this into account when deciding whether to take a ledger snapshot, and choose not to do so if an election opportunity is imminent (eg <5min). This way, we are guaranteed not to miss slots due to this problem when it is actually important.
Try to optimize allocation behavior of ledger snapshot serialization There might be some opportunities to improve the existing code to not allocate as much or in a gentler way, eg maybe by using unpinned instead of pinned (ByteString) chunks?

Note that UTxO HD will also help, but it will likely not be used for some time by block producers (where this is issue is actually important).

The goal of this ticket is to interact with other teams/stakeholders to identify the best way forward here.

Quoting from John Lotoski:

Trying the non-moving GC out did in fact resolve the missed slots, at least for about 5 days, after which missed slots started happening again, and eventually they surpassed the default copying GC missed slots in quantity of about 30 per hour and looked to continue increasing further over time.
I suspect this is due to increasing fragmentation which the default copying GC is better at minimizing. There have been several improvements and bug fixes in the non-moving garbage collector through GHC 9.4.X, including improving the non-moving GC's ability to return allocated memory back to the OS which doesn't seem to happen at all on 8.10.7, so perhaps there might be better results with the non-moving GC once node is compiled on GHC >= ~9.4.X. In any case, when using the non-moving GC on 8.10.7, there are no new observed segfaults or other immediately obvious problem related to the use of the non-moving GC.

↩ ↩²

The text was updated successfully, but these errors were encountered:

nfrisby · 2024-01-08T14:10:49Z

A quick comment.

Don't take ledger snapshots around the time the node is elected.

That may prove to be a useful workaround in the short-term, but our separation of concerns-based goal has so far been to avoid this sort of design (anything that changes the node's behavior based on the upcoming leadership schedule).

nfrisby · 2024-01-10T15:03:46Z

Incrementalising the creation of ledger snapshots. Specifically, this could be achieved by introducing a delay between writing individual chunks of the snapshot to disk. One hope here is that by spreading out the allocation work over a longer period of time (default GC interval is 72min), the total time spent GCing will be less, but a priori, it might just be the case that we miss as many slots as before, just spread out more evenly over a longer period of time.

It seems fair to assumed the leadership check thread is starved of CPU while the snapshot is being taken. I think there are two a priori suspects who might be occupying the CPU during that time:

long GC pauses
the snapshotting thread (threads?)

If the GC pauses are greater than a second, then it's entirely possible the leadership check doesn't have CPU time during the elected slot. If we incrementalize the snapshotting work, then perhaps the lower allocation rate will allow for smaller GC pauses, ideally sub-second (ie sub-slot).

Generally the RTS scheduler is fair, so I wouldn't expect the snapshotting threads (busy while taking the snapshot) would be able to starve the leadership check thread. But I list them anyway because maybe something about the computation is not playing nice with the RTS's preemption. (Examples include a long-running pure calculation that doesn't allocate, but I don't anticipate that in this particular case. Maybe some "FFI" stuff? I'm unsure how the file is actually written.)

The lack of fairness in STM shouldn't matter here. The leadership check thread only reads the current chain, which shouldn't be changing so rapidly to incur contention among STM transactions that depend on it.

dnadales · 2024-02-12T11:14:20Z

The heap profile shows that the spikes are dominated by pinned memory (ARR_WORDS):

However, we also see allocations of Data.Map and Data.Set:

We will discuss with the Ledger team possible ways to avoid these allocations.

dnadales · 2024-02-13T16:14:39Z

There are possible culprits for the increase in CPU and memory usage, and GC time:

The encoding is not consumed as a stream, which requires a large amount of data to be encoded as bytstring at once before being written to disk.
We are forcing the Ledger's pulser, which requires a time consuming computation (as suggested by @jasagredo).
We are converting the compact representation of multi assets, which allocates a Data.Map.

@lehins will look into 1 and 2, to see how much work they would require. As for 0 this is something we could do at the Consensus side.

@coot suggested we could also use info tables profiling to extract more information.

dnadales · 2024-02-16T11:05:47Z

Regarding 0, we are writing a lazy ByteString to disk, so we are not building a huge ByteString before writing it to disk, but instead, we are writing it chunk by chunk:

  writeSnapshot ::
       forall m blk. MonadThrow m
    => SomeHasFS m
    -> (ExtLedgerState blk -> Encoding)
    -> DiskSnapshot
    -> ExtLedgerState blk -> m ()
  writeSnapshot (SomeHasFS hasFS) encLedger ss cs = do
      withFile hasFS (snapshotToPath ss) (WriteMode MustBeNew) $ \h ->
        void $ hPut hasFS h $ CBOR.toBuilder (encode cs)
    where
      encode :: ExtLedgerState blk -> Encoding
      encode = encodeSnapshot encLedger

-- | This function makes sure that the whole 'Builder' is written.
--
-- The chunk size of the resulting 'BL.ByteString' determines how much memory
-- will be used while writing to the handle.
hPut :: forall m h
     .  (HasCallStack, Monad m)
     => HasFS m h
     -> Handle h
     -> Builder
     -> m Word64
hPut hasFS g = hPutAll hasFS g . BS.toLazyByteString

I tried using hPutBuilder (using this branch) to check whether it would reduce memory usage by avoiding the allocation of chunks of bytestrings, but it did not seem to help.

The following screenshot shows the heap profile, which is started before taking a snapshot of the ledger state:

The first spike related to encoders appears on the 3rd page of the "Detailed" tab, and takes only 40 MB

db-analyser.eventlog.zip

dnadales · 2024-02-22T10:29:43Z

A second of db-analyser built without the profiling flag does not seem to exhibit the spikes observed above

dnadales · 2024-02-27T11:11:43Z

We also observe a very low productivity during the run above:

[426.09964258s] Started StoreLedgerStateAt (SlotNo 72436908)
[442.253250968s] BlockNo 7793000	SlotNo 72343251
[473.668029643s] BlockNo 7794000	SlotNo 72364115
[504.554850119s] BlockNo 7795000	SlotNo 72384162
[544.457781846s] BlockNo 7796000	SlotNo 72405218
[568.979508903s] BlockNo 7797000	SlotNo 72425439
....
[933.002690111s] Snapshot stored at SlotNo 72436985
[933.00286017s] Snapshot was created at SlotNo 72436985 because there was no block forged at requested SlotNo 72436908
[933.00319006s] Done
ImmutableDB tip: At (Block {blockPointSlot = SlotNo 109730906, blockPointHash = c51616e6e4eefa40d13f5d0f9a47371f50d01ce760e3b8f0d0add212449e9b12})
 216,427,440,776 bytes allocated in the heap
1,487,832,561,768 bytes copied during GC
   3,344,035,160 bytes maximum residency (684 sample(s))
     122,786,736 bytes maximum slop
            6465 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6211 colls,  6211 par   22.448s  22.395s     0.0036s    0.0370s
  Gen  1       684 colls,   683 par   1208.587s  852.713s     1.2467s    1.8203s

  Parallel GC work balance: 90.70% (serial 0%, perfect 100%)

  TASKS: 7 (1 bound, 6 peak workers (6 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.263s  (  0.263s elapsed)
  MUT     time   58.518s  ( 58.299s elapsed)
  GC      time  1231.035s  (875.108s elapsed)
  EXIT    time    0.000s  (  0.010s elapsed)
  Total   time  1289.816s  (933.680s elapsed)

  Alloc rate    3,698,500,665 bytes per MUT second

  Productivity   4.5% of total user, 6.2% of total elapsed

after processing a fixed number of slots. This is an experiment to investigate #868

dnadales · 2024-02-27T17:18:01Z

Processing 20K blocks and storing 3 snapshots corresponding to the last 3 blocks that were applied results in the following heap profile:

The "Detailed" tab seems to show PulsingReward as allocating a large portion of data:

The branch used to obtain these results can be found here.

The profile can be produced by running:

cabal run exe:db-analyser -- cardano   --config $NODE_HOME/configuration/cardano/mainnet-config.json   --db $NODE_HOME/mainnet/db   --analyse-from 72316896   --only-immutable-db --store-ledger 72336896 +RTS -hi -s -l-agu

NODE_HOME points to the directory where the node was run (which contains the chain database).

db-analyser.eventlog.zip

nfrisby · 2024-02-28T15:18:22Z

@dnadales could you run it once more, with the https://hackage.haskell.org/package/base-4.19.1.0/docs/GHC-Stats.html#v:getRTSStats dump before and after each of the ledger snapshots being written?

nfrisby · 2024-03-01T14:37:34Z

A thought for a quick-fix partial mitigation to help out users until we eliminate the underlying performance bug: we could take snapshots much less often. The default interval is 72min. The nominal expectation for that duration is 216 new blocks arising. If we increased the interval to 240 minutes, then the expectation would be 720 blocks. The initializing node should still be able to re-process that many blocks in less than a few minutes (my recollection is that deserializing the ledger state snapshot file is far and away the dominant factor in startup-time).

^^^ that all assumes we're considering a node that is already caught-up. A syncing node would have processed many more blocks than 216 during a 72min interval. But the argument above does suggest it would be relatively harmless for a caught-up node to use a inter-snapshot duration of 240min. (Moreover, I don't immediately see why it couldn't even be 10 times that or more. But that seems like a big change, so I hesitate to bless it without giving it much more thought.)

(The interval is the first DiskPolicyArgs, exposed as of #943 . I'm unsure what name the Node code uses for that in the JSON files etc.)

dnadales · 2024-03-07T16:01:39Z

@dnadales could you run it once more, with the https://hackage.haskell.org/package/base-4.19.1.0/docs/GHC-Stats.html#v:getRTSStats dump before and after each of the ledger snapshots being written?

Sure thing:

  GC stats before snapshot
  Total number of GCs:       6632
  Total number of major GCs: 21
  Allocated bytes:           195637928136
  Copied bytes:              26769009240
  Total GC CPU time:         22224343455
  Total elapsed GC time:     18213635966

  GC stats afterq snapshot
  Total number of GCs:       8939
  Total number of major GCs: 26
  Allocated bytes:           286259560712
  Copied bytes:              60956628280
  Total GC CPU time:         47662376906
  Total elapsed GC time:     38520211470
  [120.563698243s] Snapshot stored at SlotNo 72727294


  GC stats before snapshot
  Total number of GCs:       8939
  Total number of major GCs: 26
  Allocated bytes:           286259560712
  Copied bytes:              60956628280
  Total GC CPU time:         47662376906
  Total elapsed GC time:     38520211470

  GC stats afterq snapshot
  Total number of GCs:       11201
  Total number of major GCs: 31
  Allocated bytes:           375574341072
  Copied bytes:              93958305552
  Total GC CPU time:         70104502875
  Total elapsed GC time:     57027413813
  [155.701501429s] Snapshot stored at SlotNo 72727306


  GC stats before snapshot
  Total number of GCs:       11201
  Total number of major GCs: 31
  Allocated bytes:           375574341072
  Copied bytes:              93958305552
  Total GC CPU time:         70104502875
  Total elapsed GC time:     57027413813

  GC stats afterq snapshot
  Total number of GCs:       13464
  Total number of major GCs: 36
  Allocated bytes:           464890139512
  Copied bytes:              126969005864
  Total GC CPU time:         93114399263
  Total elapsed GC time:     76153926493
  [191.899606387s] Snapshot stored at SlotNo 72727311

Elapsed time after snapshot 1:         20.306575504
Elapsed time between snapshot 1 and 2: 0
Elapsed time after snapshot 2:         18.507202343
Elapsed time between snapshot 2 and 3: 0
Elapsed time after snapshot 3:         19.12651268

dnadales · 2024-03-11T17:36:37Z

I've created:

Avoid forcing the Ledger pulsers when serializing the Ledger State cardano-ledger#4191
Avoid converting to and from compact representation for multi-assets. cardano-ledger#4192

CC: @teodanciu @lehins @TimSheard

dnadales · 2024-03-13T16:12:53Z

@TimSheard suggested we could try increasing the pulse size in this line (eg by doubling it) , and re-run this experiment again.

@TimSheard would like to know what the protocol parameters are during this run.

We could also re-run the experiment with a GHC build that has info table built into the base libraries, or using different h options.

And also, it might be useful to make sure we cross epoch boundaries.

dnadales · 2024-04-11T11:05:05Z

FTR, Ledger experiment on avoiding forcing the pulser when serializing a ledger state IntersectMBO/cardano-ledger#4196

nfrisby · 2024-09-12T13:18:30Z

Using PR #1245, I ran two db-analyser experiments.

Every run started from slot 127600033, which happens to be the snapshot file I had on hand that was very near the end of an epoch. (It also happens to be immediately prior to the June 2024 attack, so I wouldn't advise using this same interval in the future --- those txs make the code unnecessarily slow.)

Using the patched command, in a single execution, I had db-analyser write out forty evenly spaced ledger snapshots, starting at the epoch boundary and ending two epoch boundaries later. (2 * 5 days / 40 = 6 hr per increment)
I then, alternatively, ran one execution per increment: all of which started from the same 127600033 ledger snapshot file and then wrote out the new snapshot at one of the forty slots.

Regardless of whether it was all-in-one execution or one-execution-per, each snapshot took ~65 to ~75 seconds to write out. There's no starkly obvious pattern in the distribution.

Thoughts:

This suggests that writing out the snapshot file doesn't cause severely different amount of work depending on where the ledger state is located within an epoch.
I'm surprised that it "only" took 70 seconds to write out the snapshot file, since I recall reports that the node was hampered for several minutes when writing a snapshot.
Maybe the heap is so much bigger in a full node compared to a db-analyser run that the GC work (ie analyzing the entire heap) incurred by writing a snapshot has a severely higher cost.

amesgen · 2024-09-12T14:02:12Z

I'm surprised that it "only" took 70 seconds to write out the snapshot file, since I recall reports that the node was hampered for several minutes when writing a snapshot.

In the original report by John Lotoski above, it took ~2min, so it isn't off by that much.

lehins · 2024-09-12T16:40:49Z

There is also a chance that creation of a snapshot by a running node can actually trigger GC to run at the same time, which would have a significant impact on a snapshot creation time. There is a good chance GC is not triggered in db-anaylzer, while it would be for a running node, due to lower memory overhead.

Naturally this is just a speculation that is not backed by any real data. 😁

nfrisby · 2024-09-13T16:19:57Z

I've also embellished my run with stats similar to what Damian did in #868 (comment)

My results look seem to have a similar order of magnitude to his, but do differ --- perhaps its due to the machine etc (mine is 32G RAM, i7-1165G7 @ 2.8 GHz). These numbers are all observed via GHC.Stats.getRTSStats immediately before and immediately after writing the snapshot file in db-analyser. What I'm listing here are representative value per snapshot write.

~3000 minor gcs
~5 major gcs
~118 gigabytes allocated
~40 gigabytes copied
~27 seconds elapsed in the mutator
~41 seconds elapsed in the GC

(Each ledger snapshot file is ~2.7 gigabytes.)

nfrisby · 2024-09-17T00:36:19Z

I did some experimentation with an ad-hoc db-analyser pass that merely loads the snapshot file and then (tail recursion) forces the spine of the CBOR Encoding. There's no writing to disk, or even conversion to a bytestring.

With the code as-is, that calculation incurs 4 majors, 3000 minors, spends 25 seconds in the mutator and 41.5 seconds in the GC. There are 4.5 GB live bytes. (on my laptop, see above for specs)
With the UTxO immediately relocated into a compact region, everything is similar except there's only 35 seconds in the GC. I was expecting that'd make a bigger difference. There are still 4.5 GB live bytes, but the UTxO need not be traversed for pointers.
With the UTxO immediately replaced by the empty map, the numbers are drastically different: 7 major, 500 minor, 3.8 seconds mutator, 14 seconds in the GC. There is only 1 GB live bytes.

Even with UTxO HD soon essentially resulting in the third list bullet above, that's still 14 seconds of GC, "all at once". (And the heap of a real node has a lot more contents, much more than 4.5 gigabytes, so the traversal times are probably noticeably worse than in this bare-bones db-analyser setup.)

I'm starting to suspect we need to also "pulse" the snapshot creation :face-palm:, spreading the seemingly-unavoidable GC work over tens of minutes, in order to avoid starving the threads by (unnecessarily) creating that GC burst all at once. (I don't recall the details of the --nonmoving-gc experiments, but it wasn't a silver bullet.)

Is this token allocation overhead just a known downside of ~~CBOR~~ cborg? There are 220 million tokens for the whole state (and the difference list of Encoding doubles the allocation cost of this spine, eg) -- I didn't measure that without the UTxO. I wonder if there are some low-hanging mitigations. @dcoutts thoughts? Thanks.

nfrisby · 2024-09-17T15:10:01Z

I had a call with @lehins. We both have an idea, and they both seem plausible and orthogonal. And moreover orthogonal to UTxO HD.

For the disk-based serialization, some of the motivations to use CBOR as much as possible don't apply as much as they do for on-chain and on-the-wire codecs. So Alexey was thinking to skip a few expensive steps when serializing each individual UTxO --- they'd just be one short ByteString CBOR token instead of the fully expanded "official" CBOR representation of each UTxO. We're hoping this avoids a non-negligible amount of allocation, and hence would improve the mutator&GC time. (This has a secondary benefit of requiring less work when writing to and querying from the UTxO HD backend.)
The other idea is my "pulser" in the previous comment. We can spread out the serialization work (and its allocation and hence its GC pressure) over tens of minutes instead of what is currently 70 seconds.
- The wrinkle here is that there's no fundamental way to anticipate how many total tokens. It's likely to stay around 220 million until UTxO HD, and then it'll drop significantly (but also in proportion to its GC pressure). And then it's likely to stay around the same again. So there would be some "magic numbers" to choose, but it's plausible they could be related to some metrics that are relatively stable (eg number of stake pools?).
- Also, overestimating the total token count would be harmless: the serialization would finish sooner than we anticipated, but the GC rate will still have been bounded to the extent intended.
- 220 million tokens in 70 seconds is ~3.15 million tokens per second. For example, 315k tokens per second ought to take around 700 seconds, which would mean the GC pressure would be 10x less concentrated.
- Perhaps the node could calibrate itself: count the tokens in the snapshot it read upon initialization and use that. That doesn't work nearly as well for a syncing node, but big GC pauses are also less disruptive for a syncing node, since they're not "properly online" yet.

amesgen · 2024-09-17T15:49:02Z

The other idea is my "pulser" in the previous comment. [...] The wrinkle here is that there's no fundamental way to anticipate how many total tokens.

What about separating the work based on the chunks of the resulting lazy ByteString, as suggested in the issue description?

Incrementalising the creation of ledger snapshots. Specifically, this could be achieved by introducing a delay between writing individual chunks of the snapshot to disk. One hope here is that by spreading out the allocation work over a longer period of time (default GC interval is 72min), the total time spent GCing will be less, but a priori, it might just be the case that we miss as many slots as before, just spread out more evenly over a longer period of time.

nfrisby · 2024-09-17T17:04:11Z

Yeah, byte size/chunk count/etc could replace token count.

But the same problem remains: we don't a priori know the total byte size, the total chunk count, etc. If we get it wrong, it'd take 🤷 8 hours (days?) to write a bigger-than-expected snapshot, eg.

For a stable caught-up node, that's probably not a huge problem, but for a syncing node that's essentially all progress (although re-validation is cheaper than download and full validation...).

nfrisby · 2024-09-18T13:47:19Z

Initial results from @lehins's mempack patch look great! There's still a significant GC burst, but the time penalties are approximately slashed in half.

130 million total CBOR tokens instead of 220 million
3 major GCs instead of 4
1000 minor GCs instead of 3000
28 gigabytes allocated instead of 115 gigabytes
8.2 seconds mutator time instead of 25
26 seconds of GC instead of 41.5
35 wall-clock seconds instead of 67

Edit: that's ~25% allocation and ~50% token count. ~~Does that imply the allocation your patch removed was 50-50 the eliminated tokens and not unpacking the UTxO entries' CompactForms~~?

Edit: I can confirm the other components' token counts did not change. I wasn't sure which components to isolate, but the ones I chose do amount to 98.7% of the non-mempack token count. (The ledger state I've been using is from slot 127600033.)

full 217,917,950 (128,616,802 mempack)
nesBprev 2,210
nesBcur 2,126
nesEs.esLState.lsUTxOState.utxosUtxo 133,741,314 (44,440,166 mempack)
nesEs.esLState.lsUTxOState.utxosGovState 789
nesEs.esLState.lsUTxOState.utxosStakeDistr 85,66,938
nesEs.esLState.lsCertState 33,816,975
nesEs.esSnapshots.ssStakeMark 10,850,632
nesEs.esSnapshots.ssStakeMarkPoolDistr 24,324
nesEs.esSnapshots.ssStakeSet 10,849,292
nesEs.esSnapshots.ssStakeGo 10,846,262
nesEs.esNonMyopic 320,849
nesRu 11,562,136
nesPd 24,332
stashedAVVMAddresses 1

Edit: the size of the ledger snapshot increases from 2,584,897,053 to 2,722,857,842 bytes. Seems fine, though I was expecting the opposite.

dnadales · 2025-01-15T14:18:33Z

Note that @lehins mempack patch was merged in IntersectMBO/cardano-ledger#4811

We will try to reproduce this once 10.3 is released to see if the problem persists.

amesgen added this to Consensus Team Backlog Jan 9, 2024

amesgen moved this to 🔖 Ready in Consensus Team Backlog Jan 9, 2024

amesgen added this to Consensus Team Backlog Jan 10, 2024

amesgen moved this to 🔖 Ready in Consensus Team Backlog Jan 10, 2024

dnadales added this to the Q1 2024 milestone Jan 25, 2024

dnadales self-assigned this Feb 2, 2024

dnadales moved this from 🔖 Ready to 🏗 In progress in Consensus Team Backlog Feb 2, 2024

lehins mentioned this issue Feb 13, 2024

Encode Value in Compact form in UTxO IntersectMBO/cardano-ledger#4078

Closed

dnadales added a commit that referenced this issue Feb 27, 2024

[DO NOT MERGE] Modify storeLedgerStateAt to store 3 snapshots

e3a32d8

after processing a fixed number of slots. This is an experiment to investigate #868

dnadales moved this from 🏗 In progress to 🔖 Ready in Consensus Team Backlog Mar 4, 2024

dnadales moved this from 🔖 Ready to 🚫 Help needed in Consensus Team Backlog Mar 7, 2024

This was referenced Mar 11, 2024

Avoid forcing the Ledger pulsers when serializing the Ledger State IntersectMBO/cardano-ledger#4191

Open

Avoid converting to and from compact representation for multi-assets. IntersectMBO/cardano-ledger#4192

Closed

dnadales modified the milestones: Q1 2024, QX Apr 17, 2024

dnadales removed their assignment Apr 17, 2024

amesgen added the known-deficiency label Jun 24, 2024

nfrisby mentioned this issue Sep 17, 2024

Lighter-weight codec for the ledger state file on disk IntersectMBO/cardano-ledger#4634

Open

geo2a mentioned this issue Nov 21, 2024

Calculate and compare CRC when writing and reading ledger snapshots #1319

Merged

geo2a mentioned this issue Dec 2, 2024

Add micro benchmark for ledger state snapshots #1335

Open

amesgen mentioned this issue Jan 7, 2025

[BUG] - Mainnet epoch 528: Missed block at slot 143012062 with error TraceBlockFromFuture + Missed block at slot 143094566 with no information in node log. IntersectMBO/cardano-node#6058

Open

dnadales modified the milestones: QX , [DRAFT!] Q1 2025 Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide on mitigation of missed leadership checks due to ledger snapshots #868

Decide on mitigation of missed leadership checks due to ledger snapshots #868

amesgen commented Jan 8, 2024 •

edited

Loading

nfrisby commented Jan 8, 2024

nfrisby commented Jan 10, 2024

dnadales commented Feb 12, 2024

dnadales commented Feb 13, 2024

dnadales commented Feb 16, 2024 •

edited

Loading

dnadales commented Feb 22, 2024

dnadales commented Feb 27, 2024

dnadales commented Feb 27, 2024

nfrisby commented Feb 28, 2024

nfrisby commented Mar 1, 2024 •

edited

Loading

dnadales commented Mar 7, 2024

dnadales commented Mar 11, 2024

dnadales commented Mar 13, 2024 •

edited

Loading

dnadales commented Apr 11, 2024

nfrisby commented Sep 12, 2024 •

edited

Loading

amesgen commented Sep 12, 2024

lehins commented Sep 12, 2024

nfrisby commented Sep 13, 2024 •

edited

Loading

nfrisby commented Sep 17, 2024 •

edited

Loading

nfrisby commented Sep 17, 2024 •

edited

Loading

amesgen commented Sep 17, 2024

nfrisby commented Sep 17, 2024 •

edited

Loading

nfrisby commented Sep 18, 2024 •

edited

Loading

dnadales commented Jan 15, 2025

Decide on mitigation of missed leadership checks due to ledger snapshots #868

Decide on mitigation of missed leadership checks due to ledger snapshots #868

Comments

amesgen commented Jan 8, 2024 • edited Loading

Problem description

Analysis

Potential mitigations

Footnotes

nfrisby commented Jan 8, 2024

nfrisby commented Jan 10, 2024

dnadales commented Feb 12, 2024

dnadales commented Feb 13, 2024

dnadales commented Feb 16, 2024 • edited Loading

dnadales commented Feb 22, 2024

dnadales commented Feb 27, 2024

dnadales commented Feb 27, 2024

nfrisby commented Feb 28, 2024

nfrisby commented Mar 1, 2024 • edited Loading

dnadales commented Mar 7, 2024

dnadales commented Mar 11, 2024

dnadales commented Mar 13, 2024 • edited Loading

dnadales commented Apr 11, 2024

nfrisby commented Sep 12, 2024 • edited Loading

amesgen commented Sep 12, 2024

lehins commented Sep 12, 2024

nfrisby commented Sep 13, 2024 • edited Loading

nfrisby commented Sep 17, 2024 • edited Loading

nfrisby commented Sep 17, 2024 • edited Loading

amesgen commented Sep 17, 2024

nfrisby commented Sep 17, 2024 • edited Loading

nfrisby commented Sep 18, 2024 • edited Loading

dnadales commented Jan 15, 2025

amesgen commented Jan 8, 2024 •

edited

Loading

dnadales commented Feb 16, 2024 •

edited

Loading

nfrisby commented Mar 1, 2024 •

edited

Loading

dnadales commented Mar 13, 2024 •

edited

Loading

nfrisby commented Sep 12, 2024 •

edited

Loading

nfrisby commented Sep 13, 2024 •

edited

Loading

nfrisby commented Sep 17, 2024 •

edited

Loading

nfrisby commented Sep 17, 2024 •

edited

Loading

nfrisby commented Sep 17, 2024 •

edited

Loading

nfrisby commented Sep 18, 2024 •

edited

Loading