-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] MemoryError: std::bad_alloc: - with workflow.fit on 1 parquet file from Criteo dataset #1181
Comments
Would it be possible to get the full traceback ? |
Hi Benjamin, |
@quasiben Here is the full traceback.
I am using a V100 instead of a T4. The region I am working on GCP is out of T4s right now but the error is the same. |
What are the values for |
Yes, they have 16GB of RAM. |
Sorry for all the questions here, still getting up to speed. Is this data stored is on disk or from gcsfs/other remote sore ? Rick recently made some for nvtabular in 0.7 #1119 which should result in better performance. Would it be possible to test with |
No problem, I am happy to answer all the questions. Sure, I will try that. |
@quasiben I am creating my cluster like this: device_size = device_mem_size()
device_limit = int(device_limit_frac * device_size)
device_pool_size = int(device_pool_frac * device_size)
rmm_pool_size = (device_pool_size // 256) * 256
cluster = LocalCUDACluster(
device_memory_limit=device_limit,
rmm_pool_size=rmm_pool_size
) Is there a better way to create my cluster? |
From the traceback, it looks like the new "optimized" code path is already being used. However, the read_parquet call may not be the root problem here (when some other part of the preprocessing pipeline starts using more memory, the IO call is still the most likely place for a memory error). I will try to reproduce as as soon as I can, but it would be interesting to know if commenting out the |
Just a quick update, I increased the worker memory to 120GB and it worked.
Another detail is that the script crashes while creating the nvtabular.Dataset. I think this problem might be related to the worker memory management, not to the GPU. |
Sorry, my job just crashed again with 2 x T4s, 32 vCPUS and 120GB of RAM.
|
I tried with 4 x T4 and only 1 criteo file.
|
@leiterenato Can you run it through |
@devavret follows the profile with nsys. I will reboot the VM with a T4 and generate the same report. |
This is the execution with 1 x T4. |
@leiterenato - Can you share the code you are using to create and fit your workflow, and to define your dataset? Note that I am able to run the following without error: cluster = LocalCUDACluster(
n_workers=1,
device_memory_limit=device_limit,
rmm_pool_size=rmm_pool_size,
)
client = Client(cluster)
ds = nvt.Dataset("/my-path-to/criteo/day_0.parquet")
workflow = nvt.Workflow(features, client=client)
workflow.fit(ds) However, when I do not create the |
@rjzamora |
No need to apologize! I actullay think that NVTabular should be picking up the global |
@rjzamora Thank you for the clarification. |
No - This is probably not necessary. For now, NVTabular will not use the distributed Dask scheduler in fit/transform unless you define your Overall, I think the |
Perfect! 100% clear. Thanks again. |
Describe the bug
I am trying to run Workflow.fit on 1 parquet file from Criteo Dataset (day_1.parquet).
Here is the transformation:
This code works for nvtabular version 0.5.3, but when I upgraded to version 0.7.0 I started receiving the following error:
MemoryError: std::bad_alloc: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
I am using 1 x T4.
Steps/Code to reproduce bug
Expected behavior
Workflow.fit must generate the statistics and run without any problem on a T4.
Environment details (please complete the following information):
Google Cloud Vertex AI
1 x T4
conda install -c nvidia -c rapidsai -c numba -c conda-forge pynvml dask-cuda nvtabular=0.7.0 cudatoolkit=11.0
Looking at the stack trace, I seems to be a problem in the read_parquet from cudf.
I attached a print with some logs.
The text was updated successfully, but these errors were encountered: