-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark IO #606
Benchmark IO #606
Conversation
@youldrouis could you update the pull request so that we can see what you are doing and also test your changes ? |
you just have to commit to your repository and push to github to do that |
Yes, as fast as I produce the last tests and optimizations today |
IO PERF OPTIMIZATION ENSIGHTGOLD EXPORTER further analysis, through score-p instrumentation and profiling, allowed to identify the bottlenecks : The problem was the choice and use of the MPIIO writing procedure. The MPIIO procedure used was the collective operation, with collective implicit pointer MPI_File_write_ordered(). In IO, it is recommended to use collective operations, but in our code, more than 60% of the calls, only the Master rank has something to write. This uselessly multiplied the accesses to the writing pointers. I - A first step consisted in benchmarking the IOs, using different MPIIO options at file opening (collective buffering, data sieving, striping factor and striping unit). The results showed an improvement of 10 or 20% when collective buffering option is enabled. When reading is not necessary, the execution is also a little faster when using a write only mode. This was not enough to solve the problem. II - A second step consisted in refactoring some contiguous writing calls where "only master rank had something to write". This confirmed the observations, making the code 10% faster. This kind of optimization is limited, and a lot of time was still wasted. III - A third step consisted in reconsidering the choice of writing operation : individual operations clearly fit better the writing algorithm. Choice was then : The solution 1 was implemented, with an explicit management of the offsets for each process. The resulting writing times are much better :
The modifications were applied in : Some tests are still needed, especially on multi-timestep cases that I did not try. |
testing the pull request procedure with Alexandre (no code is included)