Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ketos test's dala loader unnecessarily use pinned memory #510

Closed
colibrisson opened this issue May 31, 2023 · 0 comments
Closed

ketos test's dala loader unnecessarily use pinned memory #510

colibrisson opened this issue May 31, 2023 · 0 comments

Comments

@colibrisson
Copy link
Contributor

ketos test perform inference on the CPU but its data loader uses pinned memory: https://github.com/mittagessen/kraken/blob/773cc00cc07df4b44056512a601f7bffba8f2ada/kraken/ketos/recognition.py#LL464C1-L468C61

This can cause various errors. For example, if there is a Cuda device currently used to train a model, torch will throw an error:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/colibri/mambaforge/envs/kraken_master/bin/ketos:10 in <module>                             │
│                                                                                                  │
│    7                                                                                             │
│    8                                                                                             │
│    9 if __name__ == "__main__":                                                                  │
│ ❱ 10 │   sys.exit(cli())                                                                         │
│   11                                                                                             │
│                                                                                                  │
│ /home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/click/core.py:1130 in    │
│ __call__                                                                                         │
│                                                                                                  │
│ /home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/click/core.py:1055 in    │
│ main                                                                                             │
│                                                                                                  │
│ /home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/click/core.py:1657 in    │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/click/core.py:1404 in    │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/click/core.py:760 in     │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/click/decorators.py:26   │
│ in new_func                                                                                      │
│                                                                                                  │
│ /home/colibri/files/kraken/kraken/ketos/recognition.py:474 in test                               │
│                                                                                                  │
│   471 │   │   │   batches = len(ds_loader)                                                       │
│   472 │   │   │   pred_task = progress.add_task('Evaluating', total=batches, visible=True if n   │
│   473 │   │   │                                                                                  │
│ ❱ 474 │   │   │   for batch in ds_loader:                                                        │
│   475 │   │   │   │   im = batch['image']                                                        │
│   476 │   │   │   │   text = batch['target']                                                     │
│   477 │   │   │   │   lens = batch['seq_lens']                                                   │
│                                                                                                  │
│ /home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/torch/utils/data/dataloa │
│ der.py:628 in __next__                                                                           │
│                                                                                                  │
│    625 │   │   │   if self._sampler_iter is None:                                                │
│    626 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/76750)                   │
│    627 │   │   │   │   self._reset()  # type: ignore[call-arg]                                   │
│ ❱  628 │   │   │   data = self._next_data()                                                      │
│    629 │   │   │   self._num_yielded += 1                                                        │
│    630 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \                          │
│    631 │   │   │   │   │   self._IterableDataset_len_called is not None and \                    │
│                                                                                                  │
│ /home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/torch/utils/data/dataloa │
│ der.py:1333 in _next_data                                                                        │
│                                                                                                  │
│   1330 │   │   │   │   self._task_info[idx] += (data,)                                           │
│   1331 │   │   │   else:                                                                         │
│   1332 │   │   │   │   del self._task_info[idx]                                                  │
│ ❱ 1333 │   │   │   │   return self._process_data(data)                                           │
│   1334 │                                                                                         │
│   1335 │   def _try_put_index(self):                                                             │
│   1336 │   │   assert self._tasks_outstanding < self._prefetch_factor * self._num_workers        │
│                                                                                                  │
│ /home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/torch/utils/data/dataloa │
│ der.py:1359 in _process_data                                                                     │
│                                                                                                  │
│   1356 │   │   self._rcvd_idx += 1                                                               │
│   1357 │   │   self._try_put_index()                                                             │
│   1358 │   │   if isinstance(data, ExceptionWrapper):                                            │
│ ❱ 1359 │   │   │   data.reraise()                                                                │
│   1360 │   │   return data                                                                       │
│   1361 │                                                                                         │
│   1362 │   def _mark_worker_as_unavailable(self, worker_id, shutdown=False):                     │
│                                                                                                  │
│ /home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/torch/_utils.py:543 in   │
│ reraise                                                                                          │
│                                                                                                  │
│   540 │   │   │   # If the exception takes multiple arguments, don't try to                      │
│   541 │   │   │   # instantiate since we don't know how to                                       │
│   542 │   │   │   raise RuntimeError(msg) from None                                              │
│ ❱ 543 │   │   raise exception                                                                    │
│   544                                                                                            │
│   545                                                                                            │
│   546 def _get_available_device_type():                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
  File "/home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 32, in do_one_step
    data = pin_memory(data, device)
  File "/home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory
    return type(data)({k: pin_memory(sample, device) for k, sample in data.items()})  # type: ignore[call-arg]
  File "/home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in <dictcomp>
    return type(data)({k: pin_memory(sample, device) for k, sample in data.items()})  # type: ignore[call-arg]
  File "/home/colibri/mambaforge/envs/kraken_master/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 53, in pin_memory
    return data.pin_memory(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I think pin_memory should be set to false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant