torch.OutOfMemoryError: CUDA out of memory. while performing peft curation with sdg on default configs #520

mohit5tech · 2025-02-05T07:28:05Z

Steps/Code to Reproduce Bug
Please provide minimal steps or a code snippet to reproduce the bug.

Using dataset size of 749,000 samples.
Running fine-tuning on allenai/tulu-3-sft-olmo-2-mixture.
Utilizing the latest NVIDIA drivers.

2025-02-05 07:27:31,421 - distributed.worker - ERROR - Compute Failed
Key: ('lambda-619f7ac64f13a38ca6c6546e6af3af28', 10)
State: executing
Task: <Task ('lambda-619f7ac64f13a38ca6c6546e6af3af28', 10) reify(...)>
Exception: "OutOfMemoryError('CUDA out of memory. Tried to allocate 59.96 GiB. GPU 0 has a total capacity of 79.10 GiB of which 17.46 GiB is free. Process 363562 has 61.61 GiB memory in use. Of the allocated memory 60.08 GiB is allocated by PyTorch, and 284.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)')"
Traceback: ' File "/usr/local/lib/python3.10/dist-packages/dask/bag/core.py", line 1875, in reify\n seq = list(seq)\n File "/usr/local/lib/python3.10/dist-packages/dask/bag/core.py", line 2063, in next\n return self.f(*vals)\n File "/usr/local/lib/python3.10/dist-packages/nemo_curator/modules/semantic_dedup.py", line 524, in \n lambda cluster_id: get_semantic_matches_per_cluster(\n File "/usr/local/lib/python3.10/dist-packages/nemo_curator/utils/semdedup_utils.py", line 272, in get_semantic_matches_per_cluster\n M, M1 = _semdedup(cluster_reps, "cuda")\n File "/usr/local/lib/python3.10/dist-packages/nemo_curator/utils/semdedup_utils.py", line 193, in _semdedup\n triu_sim_mat = torch.triu(pair_w_sim_matrix, diagonal=1)\n'

#####################

How I launch this script on multi GPU to avoid cuda out of memory

ayushdg · 2025-02-06T00:38:54Z

cc: @VibhuJawa , @sarahyurick & @ruchaa-apte in case you have suggestions.

mohit5tech · 2025-02-07T09:33:26Z

Not able to handle large dataset
Dataset saved at /home/jovyan/ddp_2/customize_nemo/tutorials/peft-curation-with-sdg/data/raw/downloads/AceMath-Instruct-Training-Data.json
No synthetic data generation API key provided. Skipping synthetic data generation.
cuDF Spilling is enabled
Running the initial curation pipeline on '/home/jovyan/ddp_2/customize_nemo/tutorials/peft-curation-with-sdg/data/raw/splits/AceMath-Instruct-train.jsonl'...Reading 1 files
2025-02-07 09:11:57,897 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 163.05 GiB -- Worker memory limit: 232.83 GiB
2025-02-07 09:12:18,658 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 186.27 GiB -- Worker memory limit: 232.83 GiB
2025-02-07 09:12:47,532 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:34349 (pid=342998) exceeded 95% memory budget. Restarting...
2025-02-07 09:12:55,665 - distributed.nanny - WARNING - Restarting worker
2025-02-07 09:18:04,016 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 163.10 GiB -- Worker memory limit: 232.83 GiB
2025-02-07 09:18:25,109 - distributed.worker.memory - WARNING - gc.collect() took 1.593s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.
2025-02-07 09:18:25,109 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 186.28 GiB -- Worker memory limit: 232.83 GiB
2025-02-07 09:18:52,931 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:36933 (pid=342986) exceeded 95% memory budget. Restarting...
2025-02-07 09:19:01,082 - distributed.nanny - WARNING - Restarting worker
2025-02-07 09:24:07,853 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 162.99 GiB -- Worker memory limit: 232.83 GiB
2025-02-07 09:24:29,111 - distributed.worker.memory - WARNING - gc.collect() took 1.458s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.
2025-02-07 09:24:29,111 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 186.27 GiB -- Worker memory limit: 232.83 GiB
2025-02-07 09:24:57,429 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:39433 (pid=343409) exceeded 95% memory budget. Restarting...
2025-02-07 09:25:05,560 - distributed.nanny - WARNING - Restarting worker
2025-02-07 09:30:21,883 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 163.10 GiB -- Worker memory limit: 232.83 GiB
2025-02-07 09:30:44,687 - distributed.worker.memory - WARNING - gc.collect() took 1.607s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.
2025-02-07 09:30:44,687 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 186.32 GiB -- Worker memory limit: 232.83 GiB
2025-02-07 09:31:14,930 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:41965 (pid=343655) exceeded 95% memory budget. Restarting...
2025-02-07 09:31:22,794 - distributed.scheduler - ERROR - Task ('read_single_partition-fused-toparquetdata-7079302cfbb64da02a1ac7f7d5840216', 0) marked as failed because 4 workers died while trying to run it

mohit5tech · 2025-02-07T09:34:10Z

Is somebody provide the way to handle large datasets for semantic deduplication

tikboaHIT · 2025-02-10T11:29:35Z

I have encountered this issue. How about the volume of your data?

mohit5tech · 2025-02-10T11:32:52Z

I have encountered this issue. How about the volume of your data?

Upto 5 million samples I have used the semantic deduplication but it failed if samples goes beyond 800k or larger size samples.

ayushdg · 2025-02-11T18:56:59Z

Hi, thanks for raising. Could you give some more details around the GPUs and environment?

How many GPUs is this running on. How much memory do you have per GPU?

Based on the thread it seems like semdedup runs into OOM errors beyond 800k samples. What's the size of embeddings in this dataset?

mohit5tech added the bug Something isn't working label Feb 5, 2025

VibhuJawa mentioned this issue Feb 11, 2025

[FEA] Add Sampling-Based Clustering in SemDedup #538

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.OutOfMemoryError: CUDA out of memory. while performing peft curation with sdg on default configs #520

torch.OutOfMemoryError: CUDA out of memory. while performing peft curation with sdg on default configs #520

mohit5tech commented Feb 5, 2025

ayushdg commented Feb 6, 2025

mohit5tech commented Feb 7, 2025

mohit5tech commented Feb 7, 2025 •

edited

Loading

tikboaHIT commented Feb 10, 2025

mohit5tech commented Feb 10, 2025

ayushdg commented Feb 11, 2025

torch.OutOfMemoryError: CUDA out of memory. while performing peft curation with sdg on default configs #520

torch.OutOfMemoryError: CUDA out of memory. while performing peft curation with sdg on default configs #520

Comments

mohit5tech commented Feb 5, 2025

ayushdg commented Feb 6, 2025

mohit5tech commented Feb 7, 2025

mohit5tech commented Feb 7, 2025 • edited Loading

tikboaHIT commented Feb 10, 2025

mohit5tech commented Feb 10, 2025

ayushdg commented Feb 11, 2025

mohit5tech commented Feb 7, 2025 •

edited

Loading