-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance implications of buffer initialization upon engine::EndStep #2532
Comments
@franzpoeschel thanks for reporting this. One of the motivations for making Buffer an abstract class is to try different options, |
Yes, that's exactly the reason for this issue, since zeroing the whole buffer in every step does not amortize. Some way to make that optional, as you suggested, would hence be great for our performance :) |
@franzpoeschel yes, I agree with you. I'm surprised the boolean flag is not exposed to the end user as a parameter option, perhaps I overlooked this. @pnorbert @dmitry-ganyushin what do you think? the latter would help @franzpoeschel. |
Thanks @williamfgc for pointing to the already existing solution. |
@williamfgc Do you think there is some place in adios where not zeroing the data buffer leads to corruption? Or is it only there to see a lot of zeros in the output data in case of a corruption happens due to a bug in adios code and to detect the corruption easier? If the former, would zeroing out a small part in the beginning, like 1MB or 1KB would fix even that issue? |
I am asking because I would prefer a fast default settings and then allow for extra safety if someone is asking for it. (Like the old debugMode flag, but now an engine parameter). |
@pnorbert I remember a couple of cases (non-critical), when reading with adios1 bpls those BP3 generated files (which is the case you referred to). Also when exposing the Span writers with uninitialized memory it could signal false positives during write and after write if we don't fill the entire Shape. Perhaps we should only be using safety by default when using Span (it was probably what I intended at that time, but I don't completely recall). Yes, I think that boolean switch had that intention of exploring the cost (since it's not a hardcoded default) as @franzpoeschel provided actual numbers for the overhead. CI must reveal other situations when reusing the Buffer without zeroing it, but they are most likely bugs. Hope this helps. |
It helps, thanks. Another question, why did you use assign() here? According to this link, std::fill would be faster (in optimized code) and memset would always be faster. |
PR #2534 |
@pnorbert that's a good point. When optimized they all use |
Nevermind, there is a test that fails on Windows with |
@franzpoeschel This is now in master, the data buffer is not zeroed out when using BP4. Can you please test this before we do the release (first week of Jan). Thanks. |
Thanks for getting to this so fast! |
Since that maintenance has been extended, I replicated the test on another single-GPU machine, dumping per step ~15GB of data. The times in milliseconds per step for the calls to
So it seems, the issue has been resolved for our use case. I still intend to re-run the original benchmark once our cluster is running again, but I expect similar results for then. |
The extra time should come from clearing the first and last 1kB in the data buffer, just in case. |
Yep, that's what I suspected. But that's a bounded amount unlike previously, so I would say that's fine. |
I was able to run the original benchmark again, with the current master, I'm at ~71.2 seconds for one |
Describe the bug
Discussed yesterday with @dmitry-ganyushin and @pnorbert: ADIOS2 will (for memory safety reasons) initialize its internal serialization buffers with zeros. I noticed that this can have noticeable performance implications.
One instance is that upon
engine.EndStep()
, theBP4Writer
will reset theBufferSTL
, filling the whole buffer with zeros in preparation for the next step. This means that each process will write gigabytes of zeros to main memory in each step.As a benchmark, I ran a little PIConGPU simulation on HZDR's Hemera cluster with 4 GPUs (this is not really an issue of scalability anyway), dumping ~0.5TB of data across 10 ADIOS steps. For a second run, I deactivated buffer zeroing upon engine::EndStep. This gave me a performance boost of ~6-7 seconds for each call to
engine::EndStep
on average, down from 78.2 seconds to 71.7 seconds (the rest is mostly actual I/O, in our workflow in PIConGPU, serialization happens mostly beforeEndStep
)To Reproduce
Use any data writer that produces a lot of data per step. I tested it with PIConGPU which is a large software stack to share, if you prefer a small example, just give me notice and I'll look into it. Run the data producer twice, the second time by linking to a version of ADIOS2 where zero initialization upon
EndStep
has been deactivated (e.g. by quickly and dirtily commenting this line). Observe time differences forEndStep
call.Expected behavior
Since memory safety is a reasonable objective, being safe rather than sorry is probably fine. I can imagine a "fast serialization mode" could be provided through an engine parameter, after identifying instances of zero initialization that might not be strictly necessary.
Additional context
PR 2516 already eliminated a major performance drain affecting our use of ADIOS2 in PIConGPU where roughly half of our IO time was spent in
format::BPBase::ResizeBuffer
(without actually ever reallocating the buffer, so this time was spent writing zeros to memory).The text was updated successfully, but these errors were encountered: