-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on bad perf when concurrent copy on single GPU #237
Comments
Hi @Zhaojp-Frank , I would like to know more about your setup before we drive deeper. Some questions are just to make sure that we have already eliminated external factors.
|
in general, do you think it's abnormal (not by design)? more info input:
$lspci -tv|grep -i nvidia diff
@@ -276,6 +275,7 @@ int main(int argc, char *argv[])
@@ -290,6 +290,7 @@ int main(int argc, char *argv[])
|
for size=64KB, if targets two gpu, each process reports ~6.4usec, CUDA_VISIBLE_DEVICES=0 numactl -l ./copylat -w 10000 -r 0 -s 65536 & |
GDRCopy, by design, is for low latency CPU-GPU communication at small message sizes. It uses CPU to drive the communication -- as opposed to On your system, write combining (WC) is likely enabled. WC uses the CPU WC buffer to absorb small messages and flushes out one large PCIe packet. This helps with the performance. However, the WC buffer size and how the buffer is shared across cores depend on the CPU. Putting a process on a far socket can increase the latency. This is because the transactions need to be forwarded through the CPU-CPU link. And that can also cause interference with transactions that originate from the near socket. I recommend setting the GPU clocks (SM and memory) to max. Otherwise, the GPU internal subsystem may operate at a lower frequency, which delays the response time. Setting the CPU clock to max is also recommended because CPU is driving the communication. I don't think this is the root cause, however. Using the default clocks should not cause the latency to double. |
thanks for you insight sharing, indeed I actually do care latency rather than BW in this test case. Your comment on WC makes great sense. indeed it's enabled (shown in the map info output). I just want to validate WC impact on latency, so do u know how to disable WC effect such as on specific dev range? using another avx instructions (rather that stream**?) |
WC mapping is enabled in the gdrdrv driver. You can comment out these lines to disable it (https://github.com/NVIDIA/gdrcopy/blob/master/src/gdrdrv/gdrdrv.c#L1190-L1197). The default on x86 should be uncached (UC) mapping. You probably see higher latency with UC with the sizes that you mentioned. |
well, If I comment out WC enabling, the perf no mather single or two processes, the latency is terrable, 220+ usec wondering other clue to improve concurrent gdr_copy_to_mapping |
Have you already measured the BW? If you are limited by the BW, there is nothing much we can do. As mentioned, the peak BW GDRCopy can achieved can be lower than the PCIe BW on your system. You may be able to get a bit more performance when playing with the copy algorithm. Depending on the system (CPU, topology, and other factors), changing the algorithm from AVX to something else might help. But I don't expect it to completely solve your problem about experiencing double latency when using two processes. |
Ok, I'll measure BW as well and post it later |
we observe bad latency if concurrent copy_to_mapping on the single GPU, and want to understand the cause (known limit?) before we dive into.
btw, if 2 processes run torwards different GPU, the perf behaves ok.
Question1: what's major cause for such big contention or perf degrade when concurrent gdr_copy_to_mapping? considering 32KB is not large enough I don't think PCIe bandwith is saturated.
Question2: any plan or possible to optimize concurrent gdr_copy_to_mapping?
Thanks for any feedback.
The text was updated successfully, but these errors were encountered: