Fee estimator using QEMU instrumentation: Read I/O bytes seem to be wrong #5678

jakmeier · 2021-12-06T14:15:51Z

When running the fee estimator, I always get 0 IO read bytes. Even on StorageReadBase, which I would assume should have read IO.

To Reproduce

cargo run --release -p runtime-params-estimator --features required -- --metrics-to-measure WasmInstruction,StorageReadBase --warmup-iters 1 --docker --full

Output then shows 0r, which is the counter for read IO bytes.

StorageReadBase                                     26_843_362_584 gas [            214665i 0r 1w] UNCERTAIN (computed in 127.336967362s)

Expected behavior
There should be a positive number for the read IO bytes counter.

Version (please complete the following information):
After merge of #5677

Additional context
Additional note: I found that without warming up for at least 1 iteration (--warmup-iters 1) the StorageReadBase crashes because in the first round, the noop_function_call_cost in runtime/runtime-params-estimator/src/lib.rs:506 returns 5 byte of Read IO. This causes total_cost - base_cost on line 510 to underflow.

With the warm up, this is okay because later calls have no read cost.

The text was updated successfully, but these errors were encountered:

jakmeier · 2021-12-10T15:25:50Z

Ok, after deeper analysis, I think the counters are fine. It was my understanding of what we measure that was wrong.

TLDR: Having zero IO in StorageReadBase makes sense. Where we should see IO going up is in StorageReadKeyByte. And that is the case, so all is good.

WasmInstruction                                            274_244 gas [        2.19i 0.00r   0.00w]           (computed in 6.88517744s)
StorageReadBase                                     25_364_624_002 gas [   202840.87i 0.00r   1.62w] UNCERTAIN (computed in 124.642912027s)
StorageReadKeyByte                                      49_161_581 gas [      126.08i 9.90r   0.00w] UNCERTAIN (computed in 429.069593867s)
StorageReadValueByte                                    36_910_768 gas [       58.13i 8.67r   0.07w] UNCERTAIN (computed in 237.759838672s)
StorageWriteBase                                    16_596_069_188 gas [   126531.64i 0.00r 132.70w] UNCERTAIN (computed in 101.649193485s)
StorageWriteKeyByte                                     44_034_187 gas [      217.33i 3.50r   0.86w] UNCERTAIN (computed in 289.089459917s)
StorageWriteValueByte                                   29_183_958 gas [      114.05i 2.93r   0.86w] UNCERTAIN (computed in 174.012487115s)

More detailed findings

Basically, my (wrong) assumption was that every byte of reading storage should correspond to some read-IO bytes. And that measurements of the two should increase hand-in-hand. But the relationship is much more delicate than that. It heavily depends on the storage internals, i.e. RocksDB. (A good resource I found on these internals is the high-level architecture overview in the RocksDB wiki.)

On a high-level, I believe that in our setup, most requests to RocksDB are handled inside memtable, which is an in-memory write buffer. Only once the memtable flushes can we expect IO traffic. Interestingly, the flush will involve a series of complex things, such as data compression and creations/merges of so-called SST files. This will cause a bulk of read and write IO, as well as significant CPU load all at once.

Now, I think it is reasonable to have zero IO bytes in many situations, due to this memtable not being flushed in short transactions. And that's probably okay for the estimator. The test for StorageReadBase writes and reads 1000 key-value pairs of 10 byte each. If the memtable absorbs that without IO, that's just good to know but not fundamentally wrong, as I first thought.

But this raises the question, why did I observe write IO even without flushes? This is because of the Write-Ahead-Log (WAL), which keeps a record of all non-flushed DB transactions. On top of executing the changes inside the memtable, a copy of each write-transaction is always written directly to this WAL, which is in persistent storage. This is to allow recovery after e.g. a power outage, where the content in the memtable would be lost.

Ok, I explained the IO measurements in the base cases. (StorageReadBase + StorageWriteBase)
Now, if we look at measurements for StorageReadKeyByte, then we are dealing with enough data that the memtable is flushed and later read back from persistent storage. And this allows to estimate an average cost per extra byte read from storage, where part of the cost is from IO reads (and writes).

In conclusion, if I had just thought a bit more about what we are measuring exactly, then the numbers would have made more sense. I guess, at least it was a good exercise for me to learn how our DB works and also how to debug the QEMU plugin.

matklad · 2021-12-10T16:06:25Z

Excellent analysis! I think one thing I am not entirely 100% sure about is

The test for StorageReadBase writes and reads 1000 key-value pairs of 10 byte each. If the memtable absorbs that without IO, that's just good to know but not fundamentally wrong, as I first thought.

It might be possible to read something that's not in memtable. Like, what if we spread-out writes in the benchmark some more?

jakmeier · 2021-12-10T18:49:41Z

It might be possible to read something that's not in memtable. Like, what if we spread-out writes in the benchmark some more?

Indeed we can!

Simply increasing the setup write loop counter from 1000 to 100_000, while keeping the reads the same, gives me this result:

StorageReadBase    986_735_398_549 gas [1242321.91i 246345.52r 4.94w] UNCERTAIN (computed in 5715.233569764s)

The amount of read-IO just exploded! Somewhat expected, it also massively increases the time to compute the results. I also tried 10_000 iterations but it still had 0 reads. I guess I can do a binary-search to find the sweet-spot. :-)

More importantly, the number of read IO seems way too high. I am not sure how to interpret this huge increase. I suspect it could be related to the observations investigated in #4771. But I will have to look at the details after the weekend.

And then there is also a more general discussion to be had here. I am not sure if we want to force a read from disk for this specific parameter. On one hand, security-wise it sounds very reasonable that we want to estimate for the worst case, so no data in memory whatsoever.
On the other hand, this would mean the StorageReadBase parameter will be very high, which has to be paid for every lookup. It essentially includes the cost to load a full block of data from the persistent storage into main memory. This includes lots of hard-to-predict IO and also the CPU load of decompressing a full block, which is probably also hard to predict considering how much compression-ration are dependent on the content.

What I am thinking is that in a perfect world, we should only charge the user for loading a block when it actually happens. For example, it looks like we have 16kB blocks right now. So if a contract can keep it's data within 16kB, no matter how often it reads from that data, it should only pay the huge overhead of loading the block into memory once. But it's probably not that easy to do that.

jakmeier added T-contract-runtime Team: issues relevant to the contract runtime team A-params-estimator Area: runtime params estimator labels Dec 6, 2021

jakmeier self-assigned this Dec 6, 2021

jakmeier mentioned this issue Dec 6, 2021

In the estimator, track instructions and IO separately #5560

Closed

jakmeier closed this as completed Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fee estimator using QEMU instrumentation: Read I/O bytes seem to be wrong #5678

Fee estimator using QEMU instrumentation: Read I/O bytes seem to be wrong #5678

jakmeier commented Dec 6, 2021

jakmeier commented Dec 10, 2021

matklad commented Dec 10, 2021

jakmeier commented Dec 10, 2021

Fee estimator using QEMU instrumentation: Read I/O bytes seem to be wrong #5678

Fee estimator using QEMU instrumentation: Read I/O bytes seem to be wrong #5678

Comments

jakmeier commented Dec 6, 2021

jakmeier commented Dec 10, 2021

More detailed findings

matklad commented Dec 10, 2021

jakmeier commented Dec 10, 2021