-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU Data Parallelism (with Parallel Data Layers) #2903
Conversation
- Interrupt the thread before waiting on join - Provide a method for looping threads to exit on demand - CHECK if start and stop succeed instead of returning an error
- Make sure each solver accesses a different subset of the data - Sequential reading of DB for performance - Prefetch a configurable amount of data to host memory - Distribute data to solvers in round-robin way for determinism
thanks to discussion by @thatguymike and @flx42
- Parallelize batches among GPUs and tree-reduce the gradients - The effective batch size scales with the number of devices - Batch size is multiplied by the number of devices - Split batches between GPUs, and tree-reduce the gradients - Detect machine topology (twin-GPU boards, P2P connectivity) - Track device in syncedmem (thanks @thatguymike) - Insert a callback in the solver for minimal code change - Accept list for gpu flag of caffe tool, e.g. '-gpu 0,1' or '-gpu all'. Run on default GPU if no ID given. - Add multi-GPU solver test - Deterministic architecture for reproducible runs
- Start with distant nodes in broadcast - Fix outside loop to loop for full tree depth
Well, tests pass, but training runs seem to hang in the data prefetch queue. Not sure the new datareader code is behaving. |
@thatguymike I'll look into this issue shortly, and see why training hang. I expect to do a rebase tonight and test on my data with multiple GPUs. |
It's a great thing to get rid of the data reader and unify all data layer types. One thing I'm concerned about in this design though is about the ordering of threads on the lock. It might not be absolutely required, but if we want runs to be reproducible at the numerical precision level, each solver needs to take data items in the same order, which I don't believe the lock can enforce as it is. Each run might see items distributed to solvers differently. The gradient sum should be the same, but with slight differences as items would have been added in different order. |
Regarding Michael Houston's concern:
In this PR, only one single DataLayer is shared among all worker solvers. Since data in lmdb/leveldb is read in this DataLayer prefetch thread rather than worker solver thread, the data prefetch behavior doesn't deviate from single GPU. |
@cypof I thought about this issue. However, I am not too concerned about that, since in general this PR produces more consistent and numerically same results for all other data layers (except for level/lmdb) than #2870. In #2870 you'll get random behavior if a data layer supports and turns on shuffling, or get e.g. 4X learning rate otherwise. In both situation, the behavior is clearly worse than this PR and deviates from single GPU training+increased batch size. The latter behavior also defeats the purpose of MultiGPU data parallelism. |
542e087
to
406448a
Compare
Travis CI fails because NVCC generates warning over boost/thread.hpp included in layer.hpp (see Travis CI build details)
@shelhamer any suggestions to fix/suppress this warning? |
@thatguymike I made some update, removed data reader, and successfully trained on MNIST. I am also training on ILSVRC-2012-CLS with this PR. Can you test again? Since data in lmdb/leveldb is read in this DataLayer prefetch thread rather than worker solver thread, the data prefetch behavior shouldn't deviate from single GPU. |
Seems to work functionally, but scaling perf took a significant hit for some reason at 4 GPUs for AlexNet. Quite significant slowdown. |
@thatguymike I'll look into this today. |
@thatguymike To be specific, do you experience a lot of following logs?
|
How many transform threads are created by the shared data layer? |
@cypof There should be only one single prefetch thread in which transfom is performed. Only forward is done multi-thread in each solver via a lock. @thatguymike looking into the drift issue you mentioned. |
I am seeing a few notices of the data layer prefetch queue being empty that in theory I shouldn't be seeing. I don't see them with 2870 because I'm on fast SSDs and my LMDB should be in kernel filecache. |
@thatguymike Thanks! Just now, I use device id -gpu 0,2,4,6, the problem is partly solved , but the speedup ratio is too horrible(Googlenet, quick_solver, mini-batch=64, device_id=0 & iter20=9s, device_id=0,2 & iter20=12s,device_id=0,2,4,6 & iter20=23s), what about your speedup ratio on DIGITS devbox (4 Titan X)? Our server is Tyan B7079, GPUs are Titan X, CPUs are Intel E2650v3(x2), Memory is 32G DDR4(x24), the HARD DISKs are all SSD. It now seems there are still some problems about our server system bios, we have called the manufacturer, thanks again! |
Remember that your effective batch size scales up as well, so your 2 device speedup doesn't look too bad, but clearly not great. Note from your P2Pbandwidth test results, your server has about half the bandwidth between boards as the DIGITS DevBox so you are going to be MUCH more communication bound on scaling that some other systems. I will note that issues with scaling performance and performance stability is exactly why my team designed the DevBox they way we did. You can replicate most of our build if you wish from online documents. You can try larger batches to see how your performance changes, but something is up with your server. You might want to check the server logs for PCIe errors and definately check on system bios. You can also systematically try different combinations of devices to see if you can find the fast and slow pairs and then the fast and slow 4 boards. 8 boards on that machine because you have to cross the PCIe bridge is not going to perform well with the current code, if ever. (Especially as one of your links is only 1GB/s from your bandwidth test results) You might also want to validate the scaling performance you achieve with AlexNet as there is more published work on that. You are also running TitanX's in a server chassis at that density is likely not going to behave how you want in the long run without careful cooling design. (Note the modifications we had to make in DIGITS DevBox to keep 4 TitanX's thermally happy without crazy fan setups). |
Okay, my numbers for GoogleNet with cudnnv3 on DIGITS DevBox (X99-E WS chipset and 4x TitanX) Weak scaling (default behavior of master) My P2P bidirectional perf: Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) |
@thatguymike Thanks for you suggest, we have solved the p2p bandwidth problem between GPU id 0&1. The system bios version is too low, after update the version, the p2p bandwidth value seems normal |
I tried it and got quite poor scaling: |
Is test iteration also distributed to gpus? |
I want to predict image at caffe-window.But result all the same for different image,I don not konw how to predict.
|
Test iterations are running on single GPU. |
Yes it is single GPU
|
@ronghanghu This PR is great!! Any hint on how to modify the code to do testing on Multi-GPU as well? |
Does this enable multi-GPU detection when executing? |
This is my package of #2870 (and originally, #2114)
Modification: Allow data layers (and also PythonLayer when used as data layer) to be shared among worker solver's training net, and also test net for future-proof if one wants to do Multi-GPU testing. Data layers are locked during forward to ensure sequential forward. Now all worker solvers fetch data from one single data layer.
This ensure that single-gpu training is consistent with multi-gpu training, and allow tests in #2870 to pass. Otherwise in #2870 (#2114) , there are multiple data layers created for worker solver, and these data layers are unaware of each other. This can be a serious issue if one uses deterministic data layers or turn off shuffling. In such case, since data layers in each worker solver reads the same data, one eventually gets same gradient on each solver, so it is almost equivalent to multiply learning rate by GPU number. This is definitely not the desired behavior of Multi-GPU data parallelism, since one should train on different subsets of dataset. Although in #2114 a DataReader is provided, it only applied to leveldb and lmdb, and is hardly extensible to other data layers.
DataReader is preserved in this PR and LMDB/LEVELDB DataLayer is not shared.
TODOs
Remove DataReader. Restore old behavior of DataLayer.DataReader is kept.make runtest
on multiple GPU machine.Drawback
Multi-GPU training is numerically non-deterministic on data layers excepted for LMDB/LEVELDB DataLayer, see #2903 (comment)