Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance implications of buffer initialization upon engine::EndStep #2532

Closed
franzpoeschel opened this issue Dec 3, 2020 · 17 comments
Closed
Assignees

Comments

@franzpoeschel
Copy link
Contributor

Describe the bug
Discussed yesterday with @dmitry-ganyushin and @pnorbert: ADIOS2 will (for memory safety reasons) initialize its internal serialization buffers with zeros. I noticed that this can have noticeable performance implications.

One instance is that upon engine.EndStep(), the BP4Writer will reset the BufferSTL, filling the whole buffer with zeros in preparation for the next step. This means that each process will write gigabytes of zeros to main memory in each step.

As a benchmark, I ran a little PIConGPU simulation on HZDR's Hemera cluster with 4 GPUs (this is not really an issue of scalability anyway), dumping ~0.5TB of data across 10 ADIOS steps. For a second run, I deactivated buffer zeroing upon engine::EndStep. This gave me a performance boost of ~6-7 seconds for each call to engine::EndStep on average, down from 78.2 seconds to 71.7 seconds (the rest is mostly actual I/O, in our workflow in PIConGPU, serialization happens mostly before EndStep)

To Reproduce
Use any data writer that produces a lot of data per step. I tested it with PIConGPU which is a large software stack to share, if you prefer a small example, just give me notice and I'll look into it. Run the data producer twice, the second time by linking to a version of ADIOS2 where zero initialization upon EndStep has been deactivated (e.g. by quickly and dirtily commenting this line). Observe time differences for EndStep call.

Expected behavior
Since memory safety is a reasonable objective, being safe rather than sorry is probably fine. I can imagine a "fast serialization mode" could be provided through an engine parameter, after identifying instances of zero initialization that might not be strictly necessary.

Additional context
PR 2516 already eliminated a major performance drain affecting our use of ADIOS2 in PIConGPU where roughly half of our IO time was spent in format::BPBase::ResizeBuffer (without actually ever reallocating the buffer, so this time was spent writing zeros to memory).

@williamfgc
Copy link
Contributor

williamfgc commented Dec 3, 2020

@franzpoeschel thanks for reporting this. One of the motivations for making Buffer an abstract class is to try different options, BufferSTL follows STL memory safety as you mentioned with optional zeroing. Something like a BufferMalloc would suit PIConGPU. For the most part memory allocation is a amortized cost as std::vector never reduces its capacity and if writing the same in every step the assign call should be optional. I remember adding assign for zeroing the buffer as we got a lot of left-over data from previous steps which is precisely "wrong" data. Hope this helps.

@franzpoeschel
Copy link
Contributor Author

For the most part memory allocation is a amortized cost as std::vector never reduces its capacity

Yes, that's exactly the reason for this issue, since zeroing the whole buffer in every step does not amortize. Some way to make that optional, as you suggested, would hence be great for our performance :)

@williamfgc
Copy link
Contributor

@franzpoeschel yes, I agree with you. I'm surprised the boolean flag is not exposed to the end user as a parameter option, perhaps I overlooked this. @pnorbert @dmitry-ganyushin what do you think? the latter would help @franzpoeschel.

@pnorbert pnorbert self-assigned this Dec 7, 2020
@pnorbert
Copy link
Contributor

pnorbert commented Dec 7, 2020

Thanks @williamfgc for pointing to the already existing solution.

@pnorbert
Copy link
Contributor

pnorbert commented Dec 7, 2020

@williamfgc Do you think there is some place in adios where not zeroing the data buffer leads to corruption? Or is it only there to see a lot of zeros in the output data in case of a corruption happens due to a bug in adios code and to detect the corruption easier? If the former, would zeroing out a small part in the beginning, like 1MB or 1KB would fix even that issue?

@pnorbert
Copy link
Contributor

pnorbert commented Dec 7, 2020

I am asking because I would prefer a fast default settings and then allow for extra safety if someone is asking for it. (Like the old debugMode flag, but now an engine parameter).

@williamfgc
Copy link
Contributor

williamfgc commented Dec 7, 2020

@pnorbert I remember a couple of cases (non-critical), when reading with adios1 bpls those BP3 generated files (which is the case you referred to). Also when exposing the Span writers with uninitialized memory it could signal false positives during write and after write if we don't fill the entire Shape. Perhaps we should only be using safety by default when using Span (it was probably what I intended at that time, but I don't completely recall).

Yes, I think that boolean switch had that intention of exploring the cost (since it's not a hardcoded default) as @franzpoeschel provided actual numbers for the overhead. CI must reveal other situations when reusing the Buffer without zeroing it, but they are most likely bugs. Hope this helps.

@pnorbert
Copy link
Contributor

pnorbert commented Dec 7, 2020

It helps, thanks. Another question, why did you use assign() here? According to this link, std::fill would be faster (in optimized code) and memset would always be faster.
https://stackoverflow.com/questions/8848575/fastest-way-to-reset-every-value-of-stdvectorint-to-0

@pnorbert
Copy link
Contributor

pnorbert commented Dec 7, 2020

PR #2534

@williamfgc
Copy link
Contributor

williamfgc commented Dec 7, 2020

@pnorbert that's a good point. When optimized they all use memset see the gcc generated assembly in godbolt. assign can be slower as it's checks + resize + fill but in this situation I don't think we resize(see below), std::fill or memset should be the same when optimized (that's reported in one of the answers of the stackoverflow link you shared).

@williamfgc
Copy link
Contributor

Nevermind, there is a test that fails on Windows with Assertion failed: vector subscript out of range . If I recall the MSVC compiler will catch as an error when writing in between size and capacity, so the call to assign might have been resizing if needed.

@pnorbert
Copy link
Contributor

@franzpoeschel This is now in master, the data buffer is not zeroed out when using BP4. Can you please test this before we do the release (first week of Jan). Thanks.

@franzpoeschel
Copy link
Contributor Author

Thanks for getting to this so fast!
The cluster that I ran this test on is scheduled for maintenance until at least monday, I'll get back to you as soon as I can run this.

@franzpoeschel
Copy link
Contributor Author

Since that maintenance has been extended, I replicated the test on another single-GPU machine, dumping per step ~15GB of data. The times in milliseconds per step for the calls to Engine::EndStep:

# on the master branch after the fix
6964
7848
7738
7141
7221
7831
7756
# …which is on average ~7.5 seconds

# on the master branch before the fix
8216
8283
8554
7764
8257
8890
8849
# …which is on average ~8.4 seconds

# with the quick and dirty fix that I described in the beginning of this issue
7040
7287
7665
6874
7075
7892
7997
# …which is on average 7.4 seconds

So it seems, the issue has been resolved for our use case. I still intend to re-run the original benchmark once our cluster is running again, but I expect similar results for then.

@pnorbert
Copy link
Contributor

The extra time should come from clearing the first and last 1kB in the data buffer, just in case.

@franzpoeschel
Copy link
Contributor Author

Yep, that's what I suspected. But that's a bounded amount unlike previously, so I would say that's fine.

@franzpoeschel
Copy link
Contributor Author

I was able to run the original benchmark again, with the current master, I'm at ~71.2 seconds for one Engine::EndStep(), which is quite similar to the faster one of the original times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants