-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Unclear how to saturate PCI for H->D and D->H #451
Comments
I'm not sure I understand what "the repro" is or what the code is trying to do. In general, the performance of the memory accesses and copies has nothing to do with RMM, it only allocates the memory. |
The goal is getting RAPIDS to move data at PCI bus speeds, such as for achieving line-rate on streaming workloads. The original thread found memory type to be an issue here, and explores aspects of that. I think the request to x-post here is this is not clearly in any repo and may touch on the allocator for part of the solution, such as solving the variance & x_pinned perf, and folks here may have insight to the ultimate goal. Current benchmarks seem only ~25%-50% of what the hardware is rated for, with high variance across runs, and pinned memory is all over the place, and we don't know why. The original thread experimented with different H<>D allocation types, and I'm guessing there may also be flags needed around alignment, priority, type, and whatever else may be stalling at the HW/OS level. I can't speak to the repro of others. For mine, |
To be clear, If you're only getting ~2 GB/s then that's because you're including the time of allocating host memory which is limited by the host kernel and typically will limit things to around 2 - 2.5 GB/s. If you're getting ~8 GB/s then that's the limitations of the system that you're on. Seems like you're in the cloud which adds a virtualization layer that typically impacts performance as well as makes the PCIe topology / traffic a bit of an unknown. John's test in the linked PR was on a bare metal DGX-1 where he was the only user while running which gives us no virtualization and an understood topology (full 16x PCIe 3.0 lane from CPU to GPU). Closing this as there's no action we can take here. |
@kkraus14 This sounds like Nvidia thinks Azure V100's can only do 2GB/s. I'm ok investigating further, such as trying alternate policies, but not sure why the rush to not have top of line cloud GPUs do more than 2GB/s. |
That is not what NVIDIA thinks at all. It's also not what Keith wrote. He wrote that the 2GB/s case is probably counting host memory allocation time in the throughput measurement. Your next step should probably be to run your test on bare metal where PCIe is definitely not virtualized. If you can't achieve the same performance between bare metal and Azure, then it's either a configuration difference, a virtualization cost, or something similar. |
Sorry, above left a pretty bad taste in my mouth - this immediately impacts one of our big co-customers trying to prove out speed @ scale on a bare bones "testing the wheels" setup, so not even the real version - so trying to think through how to be constructive vs. closing without solving. Ideas: -- Allocation: I'm going to try flipping hugepages stuff. Anything else to experiment with? (note that the "setup" step already had some preallocation.) -- Virtualization: I can try running the benchmark in the host to eliminate docker as a concern, though ofc still Azure stuff. I don't think multitenancy is an issue here, I ran multiple times. -- PCI expectation: It seems worth identifying the expected rate. The host After a real effort, it sounds like not much obvious to folks on this thread, so may be time to ping other Nvidia folks or Azure GPU folks. (I'm still fuzzy if others have achieved 75%+ utilization with RAPIDS on either of these issue threads.) |
@jakirkham experimented with this and it made no material difference in his testing. He also tested with pinned vs unpinned host memory which had minimal impact as well. The only thing I believe he didn't get around to testing is if gdrcopy would make a material difference.
Docker doesn't run the hypervisor here, Azure does and the Azure hypervisor virtualizing the GPU + system is likely what is causing the performance degradation you're seeing here from my guess. I'm by no means a cloud systems expert though.
Even if lspci reports a 16x lane, it's likely a 16x lane through a PLX chip to the CPU and you don't necessarily have visibility into the other devices on the PLX which could be sharing the PCIe bandwidth with your GPU.
Apologies if I came off as impolite in the previous reply, but ultimately this isn't an RMM issue as we just wrap standard CUDA APIs for these things. |
@lmeyerov Here's what I get on my local (local as in the PC that my feet are resting on as I type this) V100:
Suggest you run the same test, as it eliminates RAPIDS as a variable. You can run it on your local footrest PC first, and then run on an equivalent cloud instance with various configuration options, and compare. This CUDA sample is included in the CUDA toolkit. I built it with
|
Yeah I'm going to start with investigating PCI tests for a sense of target. I found I haven't found a clear spec sheet for Azure V100 expected perf, we'll see |
I'd guess you're in a bit of moving sand territory here where the hardware doesn't change, but the virtualization layer on top of the GPU / CPU / PCIe Bus / Memory / etc. is changed over time which can change the performance impact vs bare metal. |
Just to clarify, I think the machine I was testing on used hugepages in all cases (system config). Would expect using hugepages does help. (Though also don't know much about what Azure is doing with virtualization) |
Hm, in-docker bandwidthTest gives H->D
-- Docker: -- Hardware + Host: Wasn't able to progress on theoretical+expected as bandwidth test / nvcc were being tricky to setup on the current instances. For next steps, I'm thinking to drill into achieved vs expected: (A) revisit the RAPIDS test to double check returned GB/s to understand the gap within docker on python vs bandwidthTest, if any (B) see about expected host #'s on a fresh V100 host-level bandwidthTest + maybe ping someone at Azure for their expectations, to see about issues at azure / docker level (C) try the huge pages thing as part of (A) |
@lmeyerov one other thing to try is don't use the "managed" allocator, as UVM can add overhead as well. |
Ah right. And the performance depends on where the data is initially and where you are trying to copy it / access it. |
Yeah, switching from managed -> default (unmanaged?) gets from ~123ms -> ~119ms, so helps but not the main thing. Can you be a bit more concrete on 'where you are trying to access it'? See below for current. Overall, seems to be RE:src location, I'm copying the original reference:
Note one oddity: |
Can you use If |
=>
|
I don't know what that copy_from_host maps to, but I guess it must be more than a cudaMemcpy. |
Implementation of Implementation of Maybe the bit of control flow of using buffer protocol and checking error states is slowing things down? We'd need to get into some low level profiling at this point, but this should all be basically free. Maybe try with larger buffers to try to amortize more of these costs to see if the numbers improve? |
Copy_ptr_to_host synchronizes the whole device! So if the benchmark runs in a loop, it is counting device sync. Not sure whether bandwidthTest does the same. |
Yeah as we would run into this issue otherwise. |
I believe we do have the whole device, so the pinned write would be the only delta for the GPU
|
Also interesting:
|
@lmeyerov no updates in a while, is this still an issue? |
The particular project is on pause so haven't been pushing. I still suspect we're only at 25%-50% of what the hw can do, but without confirmation from azure gpu staff around what to expect, don't see how to make progress |
Let's close and if you determine that this is still an RMM issue and not an Azure issue, please reopen. |
What is your question?
Following up on a cudf-dask experiment on peek IO, we tried to reproduce for an Azure env, and failed to achieve similar speedups. (This is for proving out line-rate stream processing.) Any pointers, and any thoughts on why the repro failed?
Original
rapidsai/dask-cuda#106
... examples seeing results in
us
instead ofms
...I'm having some difficulty reproducing pinned memory for proving out near-peak host->device transfers. This is on an Azure V100 w/ RAPIDS 0.14, and indirected via docker. Our ultimate goal is setting something up for a streaming workload where we feed these at network line rate, and we're happy to reuse these buffers etc. As-is, we're only getting ~2 GB/s on a toy version, and these cards are 16GB PCIE each way afaict.
Thoughts?
Edit: 96ms for 800MB => ~8 GB/s ? Though still mysteries as not seeing #'s like ^^^^, and I think the Azure cards are rated for 16-32GB/s, can't tell. (V100's, nc6s_v3)
Setup CPU data
Setup CPU pinned bufers
Benchmark CPU -> GPU
The text was updated successfully, but these errors were encountered: