Chain DB: addBlock queue ineffective #655

mrBliss · 2020-07-31T13:49:07Z

@karknu has discovered that the BlockFetch client is able to quickly add the first block of the Shelley era to the ChainDB, but the second one takes 11s!

MsgRequestRange ChainRange from At (SlotNo {unSlotNo = 4492800}) to At (SlotNo {unSlotNo = 4492800})
MsgStartBatch ChainRange from At (SlotNo {unSlotNo = 4492800}) to At (SlotNo {unSlotNo = 4492800})
MsgBlock recved
MsgBlock verified after 0.000023s
MsgBlock written to disk after 0.085022s (0.085045s)
CompletedBlockFetch responseTime 0.172265s size 1013 ChainRange from At (SlotNo {unSlotNo = 4492800}) to At (SlotNo {unSlotNo = 4492800})

MsgRequestRange ChainRange from At (SlotNo {unSlotNo = 4492840}) to At (SlotNo {unSlotNo = 4492840})
MsgStartBatch ChainRange from At (SlotNo {unSlotNo = 4492840}) to At (SlotNo {unSlotNo = 4492840})
MsgBlock recved
MsgBlock verified after 0.000019s
MsgBlock written to disk after 11.361727s (11.361746s)
CompletedBlockFetch responseTime 11.522864s size 1013 ChainRange from At (SlotNo {unSlotNo = 4492840}) to At (SlotNo {unSlotNo = 4492840})

One might think that the first block was quick to validate, but the second slow. This would be counter-intuitive because the first block triggers the transition, requiring an expensive translation of the ledger state. Validating the second block should be quick.

Note that when the BlockFetch client adds a block to the ChainDB, it should only block until the block has been written to disk, not until chain selection has been performed for that block.

With some extra tracing, we see:

addFetchedBlock At (SlotNo {unSlotNo = 4492800}) 0.096076772s
chainSelection: 16.176468239s
addFetchedBlock At (SlotNo {unSlotNo = 4492840}) 15.617731605s

This means that actually the chain selection for the first block is taking long, not the second. Adding the second block is blocked by the first block still being processed.

What's actually going on is the following:

The BlockFetch client adds the blocks it downloaded to a queue in the ChainDB with a maximum size of 10. This maximum size is there to provide back pressure, otherwise the number of blocks in memory could grow without bound.
A background thread in the ChainDB processes the blocks in this queue one by one in a loop. In each iteration of the loop, it writes the block to disk (unless we intentionally ignore it), delivers the promise the BlockFetch client is waiting on, then performs chain selection for the block, and finally delivers another promise indicating the block has been processed.
While the background thread is still doing chain selection for the first block, the second block has been added to the queue and the BlockFetch client is waiting for it to be written to disk, which won't happen until after the first block has been fully processed.

This means that the effective overlap or pipelining is limited to 1 block, not the configured 10.

To fix this, there could be a separate queue for each step, i.e., one for writing blocks to disk and one for doing chain selection for blocks.

However, the more queues, the more time lost on synchronising things and overhead. The shorter the actual steps (writing to disk, chain selection) take, the more overhead there will be. So adding the extra queue is not guaranteed to speed things up in all cases. Likely for this case, but for bulk chain sync of mostly empty Byron blocks, it might slow things down.

My plan: in practice we always wait for the block to be written to disk, we never want to just add the block to the queue without any extra waiting. We can synchronously add the block to the VolatileDB and only then add the block to the queue with blocks awaiting chain selection. This would also allow reordering out of order blocks in that queue using, e.g., an OrdPSQ.

The text was updated successfully, but these errors were encountered:

Fixes #2487.

Fixes #2487. Currently, the effective queue size when adding blocks to the ChainDB is 1 (for why, see #2487). In this commit, we let the BlockFetch client add blocks fully asynchronously to the ChainDB, which restores the effective queue size to the configured value again, e.g., 10. The BlockFetch client will no longer wait until the block has been written to the VolatileDB (and thus also not until the block has been processed by chain selection). The BlockFetch client can just hand over the block and continue downloading with minimum delay. To make this possible, we change the behaviour of `getIsFetched` and `getMaxSlotNo` to account for the blocks in the queue, otherwise the BlockFetch client might try to redownload already-fetched blocks. This is an alternative to #2489, which let the BlockFetch client write blocks to the VolatileDB synchronously. The problem with that approach is that multiple threads are writing to the VolatileDB, instead of a single background thread. We have relied on the latter to simplify the VolatileDB w.r.t. consistency after incomplete writes.

mrBliss referenced this issue in IntersectMBO/ouroboros-network Jul 31, 2020

WIP fix #2487

e1c9b75

mrBliss referenced this issue in IntersectMBO/ouroboros-network Aug 10, 2020

ChainDB: write sync to VolDB, async chain selection

df5bacc

Fixes #2487.

mrBliss referenced this issue in IntersectMBO/ouroboros-network Aug 10, 2020

ChainDB: write sync to VolDB, async chain selection

a0875c7

Fixes #2487.

mrBliss referenced this issue in IntersectMBO/ouroboros-network Aug 10, 2020

ChainDB: write sync to VolDB, async chain selection

009613a

Fixes #2487.

mrBliss mentioned this issue Aug 10, 2020

ChainDB: write sync to VolDB, async chain selection IntersectMBO/ouroboros-network#2489

Closed

mrBliss mentioned this issue Nov 2, 2020

ChainDB: let the BlockFetch client add blocks asynchronously IntersectMBO/ouroboros-network#2721

Closed

dnadales transferred this issue from IntersectMBO/ouroboros-network Nov 30, 2023

amesgen added the known-deficiency label Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chain DB: addBlock queue ineffective #655

Chain DB: addBlock queue ineffective #655

mrBliss commented Jul 31, 2020

Chain DB: addBlock queue ineffective #655

Chain DB: addBlock queue ineffective #655

Comments

mrBliss commented Jul 31, 2020