-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to access mapped memory from CPU side(Fail data_validation tests) #279
Comments
Hi @cxinyic,
|
Hi,
|
GPUDirect requires all components in the path to work correctly. May I ask you to check the followings?
|
Hi, thanks a lot for the fast response!!!!
I tried with regular pertest(CPU, not cuda) and it can work with all message size. Do you know what might cause this? |
Based on gdrcopy_pplat, small data seems to work fine. Please run Some perftest applications do not work well with CUDA or may require additional environment variables or parameters. Please run
For network-related questions, I suggest that you ask in NVIDIA forum or file a bug here. |
|
There seems to be an issue when the size is large. |
Hi, I checked the dmesg:
Are there any parameters that I should set and may be relevant to the size? |
GPUDirect does not work properly on your system. Unfortunately, there is no clue that can help us identify the root cause. @drossetti Any suggestions? |
@cxinyic Is IOMMU enabled? Can you turn it off or set it to passthrough? Then, please try again. |
Hi, I already set it to passthrough since I am using the RDMA with amd cpu. This is required by RDMA previously.
|
Hi there, I further did some tests based on that I can only use GPUDirect RDMA with Here is my modified function and it can pass check 1 with any size:
Does that imply that there is no coherence guarantee if simply write sth into the mapped memory? And I need to wait for some time(usleep()) to make sure that the data has been written? Are there any functions that I can call to make sure the data is written? |
Hi @cxinyic, Sorry, I missed your last comment. I don't recommend you to use GPUDirect RDMA if it is not fully functional. You can easily run to an issue. One problem is silent data corruption, which is difficult to debug in many applications. If you want to continue with debugging the GPUDirect RDMA issue, I suggest that you file a bug and formally ask for support. |
Hi @pakmarkthub, Thanks so much for your advice. Yes, I found that the GPUDiretct RDMA can work and I have not checked whether the data is corrupted. I will continue debugging this. If you have any other ideas, please tell me. |
Hi there, I am running sanity test and I got this error in the data_validation test:
I debug by myself and I found that in function:
init_hbuf_walking_bit
, thebuf_ptr
cannot be written. The content in it is always0xffffffff
. And the corresponding gpu memory is never changed. (a5a5a5a5
) in this case. There are no other errors during the whole process. All function calls return success.Here is the settings on my server(bare-metal machine):
OS: Ubuntu 18.04.6 LTS
linux version: 4.15.0-213-generic
GPU: Tesla T4
Driver:
module:
The text was updated successfully, but these errors were encountered: