Performance implications of buffer initialization upon engine::EndStep #2532

franzpoeschel · 2020-12-03T15:29:34Z

Describe the bug
Discussed yesterday with @dmitry-ganyushin and @pnorbert: ADIOS2 will (for memory safety reasons) initialize its internal serialization buffers with zeros. I noticed that this can have noticeable performance implications.

One instance is that upon engine.EndStep(), the BP4Writer will reset the BufferSTL, filling the whole buffer with zeros in preparation for the next step. This means that each process will write gigabytes of zeros to main memory in each step.

As a benchmark, I ran a little PIConGPU simulation on HZDR's Hemera cluster with 4 GPUs (this is not really an issue of scalability anyway), dumping ~0.5TB of data across 10 ADIOS steps. For a second run, I deactivated buffer zeroing upon engine::EndStep. This gave me a performance boost of ~6-7 seconds for each call to engine::EndStep on average, down from 78.2 seconds to 71.7 seconds (the rest is mostly actual I/O, in our workflow in PIConGPU, serialization happens mostly before EndStep)

To Reproduce
Use any data writer that produces a lot of data per step. I tested it with PIConGPU which is a large software stack to share, if you prefer a small example, just give me notice and I'll look into it. Run the data producer twice, the second time by linking to a version of ADIOS2 where zero initialization upon EndStep has been deactivated (e.g. by quickly and dirtily commenting this line). Observe time differences for EndStep call.

Expected behavior
Since memory safety is a reasonable objective, being safe rather than sorry is probably fine. I can imagine a "fast serialization mode" could be provided through an engine parameter, after identifying instances of zero initialization that might not be strictly necessary.

Additional context
PR 2516 already eliminated a major performance drain affecting our use of ADIOS2 in PIConGPU where roughly half of our IO time was spent in format::BPBase::ResizeBuffer (without actually ever reallocating the buffer, so this time was spent writing zeros to memory).

The text was updated successfully, but these errors were encountered:

williamfgc · 2020-12-03T16:08:05Z

@franzpoeschel thanks for reporting this. One of the motivations for making Buffer an abstract class is to try different options, BufferSTL follows STL memory safety as you mentioned with optional zeroing. Something like a BufferMalloc would suit PIConGPU. For the most part memory allocation is a amortized cost as std::vector never reduces its capacity and if writing the same in every step the assign call should be optional. I remember adding assign for zeroing the buffer as we got a lot of left-over data from previous steps which is precisely "wrong" data. Hope this helps.

franzpoeschel · 2020-12-04T09:50:21Z

For the most part memory allocation is a amortized cost as std::vector never reduces its capacity

Yes, that's exactly the reason for this issue, since zeroing the whole buffer in every step does not amortize. Some way to make that optional, as you suggested, would hence be great for our performance :)

williamfgc · 2020-12-04T12:41:08Z

@franzpoeschel yes, I agree with you. I'm surprised the boolean flag is not exposed to the end user as a parameter option, perhaps I overlooked this. @pnorbert @dmitry-ganyushin what do you think? the latter would help @franzpoeschel.

pnorbert · 2020-12-07T16:26:06Z

Thanks @williamfgc for pointing to the already existing solution.

pnorbert · 2020-12-07T16:34:40Z

@williamfgc Do you think there is some place in adios where not zeroing the data buffer leads to corruption? Or is it only there to see a lot of zeros in the output data in case of a corruption happens due to a bug in adios code and to detect the corruption easier? If the former, would zeroing out a small part in the beginning, like 1MB or 1KB would fix even that issue?

pnorbert · 2020-12-07T16:36:09Z

I am asking because I would prefer a fast default settings and then allow for extra safety if someone is asking for it. (Like the old debugMode flag, but now an engine parameter).

williamfgc · 2020-12-07T17:10:23Z

@pnorbert I remember a couple of cases (non-critical), when reading with adios1 bpls those BP3 generated files (which is the case you referred to). Also when exposing the Span writers with uninitialized memory it could signal false positives during write and after write if we don't fill the entire Shape. Perhaps we should only be using safety by default when using Span (it was probably what I intended at that time, but I don't completely recall).

Yes, I think that boolean switch had that intention of exploring the cost (since it's not a hardcoded default) as @franzpoeschel provided actual numbers for the overhead. CI must reveal other situations when reusing the Buffer without zeroing it, but they are most likely bugs. Hope this helps.

pnorbert · 2020-12-07T18:48:00Z

It helps, thanks. Another question, why did you use assign() here? According to this link, std::fill would be faster (in optimized code) and memset would always be faster.
https://stackoverflow.com/questions/8848575/fastest-way-to-reset-every-value-of-stdvectorint-to-0

pnorbert · 2020-12-07T20:27:00Z

PR #2534

williamfgc · 2020-12-07T20:30:00Z

@pnorbert that's a good point. When optimized they all use memset see the gcc generated assembly in godbolt. assign can be slower as it's checks + resize + fill ~~but in this situation I don't think we resize~~(see below), std::fill or memset should be the same when optimized (that's reported in one of the answers of the stackoverflow link you shared).

williamfgc · 2020-12-07T20:44:55Z

Nevermind, there is a test that fails on Windows with Assertion failed: vector subscript out of range . If I recall the MSVC compiler will catch as an error when writing in between size and capacity, so the call to assign might have been resizing if needed.

pnorbert · 2020-12-10T19:08:50Z

@franzpoeschel This is now in master, the data buffer is not zeroed out when using BP4. Can you please test this before we do the release (first week of Jan). Thanks.

franzpoeschel · 2020-12-11T14:51:42Z

Thanks for getting to this so fast!
The cluster that I ran this test on is scheduled for maintenance until at least monday, I'll get back to you as soon as I can run this.

franzpoeschel · 2020-12-15T11:01:49Z

Since that maintenance has been extended, I replicated the test on another single-GPU machine, dumping per step ~15GB of data. The times in milliseconds per step for the calls to Engine::EndStep:

# on the master branch after the fix
6964
7848
7738
7141
7221
7831
7756
# …which is on average ~7.5 seconds

# on the master branch before the fix
8216
8283
8554
7764
8257
8890
8849
# …which is on average ~8.4 seconds

# with the quick and dirty fix that I described in the beginning of this issue
7040
7287
7665
6874
7075
7892
7997
# …which is on average 7.4 seconds

So it seems, the issue has been resolved for our use case. I still intend to re-run the original benchmark once our cluster is running again, but I expect similar results for then.

pnorbert · 2020-12-15T12:07:06Z

The extra time should come from clearing the first and last 1kB in the data buffer, just in case.

franzpoeschel · 2020-12-15T12:24:17Z

Yep, that's what I suspected. But that's a bounded amount unlike previously, so I would say that's fine.

franzpoeschel · 2020-12-17T12:00:45Z

I was able to run the original benchmark again, with the current master, I'm at ~71.2 seconds for one Engine::EndStep(), which is quite similar to the faster one of the original times.

pnorbert self-assigned this Dec 7, 2020

williamfgc mentioned this issue Dec 7, 2020

Zero out first and last KB of data buffer in BP4, instead of the enti… #2534

Merged

franzpoeschel closed this as completed Dec 17, 2020

franzpoeschel mentioned this issue Feb 22, 2021

Large memory consumption in BufferSTL when using PerformPuts and Flush instead of Begin/EndStep (BP4) #1891

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance implications of buffer initialization upon engine::EndStep #2532

Performance implications of buffer initialization upon engine::EndStep #2532

franzpoeschel commented Dec 3, 2020

williamfgc commented Dec 3, 2020 •

edited

Loading

franzpoeschel commented Dec 4, 2020

williamfgc commented Dec 4, 2020

pnorbert commented Dec 7, 2020

pnorbert commented Dec 7, 2020

pnorbert commented Dec 7, 2020

williamfgc commented Dec 7, 2020 •

edited

Loading

pnorbert commented Dec 7, 2020

pnorbert commented Dec 7, 2020

williamfgc commented Dec 7, 2020 •

edited

Loading

williamfgc commented Dec 7, 2020

pnorbert commented Dec 10, 2020

franzpoeschel commented Dec 11, 2020

franzpoeschel commented Dec 15, 2020

pnorbert commented Dec 15, 2020

franzpoeschel commented Dec 15, 2020

franzpoeschel commented Dec 17, 2020

Performance implications of buffer initialization upon engine::EndStep #2532

Performance implications of buffer initialization upon engine::EndStep #2532

Comments

franzpoeschel commented Dec 3, 2020

williamfgc commented Dec 3, 2020 • edited Loading

franzpoeschel commented Dec 4, 2020

williamfgc commented Dec 4, 2020

pnorbert commented Dec 7, 2020

pnorbert commented Dec 7, 2020

pnorbert commented Dec 7, 2020

williamfgc commented Dec 7, 2020 • edited Loading

pnorbert commented Dec 7, 2020

pnorbert commented Dec 7, 2020

williamfgc commented Dec 7, 2020 • edited Loading

williamfgc commented Dec 7, 2020

pnorbert commented Dec 10, 2020

franzpoeschel commented Dec 11, 2020

franzpoeschel commented Dec 15, 2020

pnorbert commented Dec 15, 2020

franzpoeschel commented Dec 15, 2020

franzpoeschel commented Dec 17, 2020

williamfgc commented Dec 3, 2020 •

edited

Loading

williamfgc commented Dec 7, 2020 •

edited

Loading

williamfgc commented Dec 7, 2020 •

edited

Loading