Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing bug when using CPU only #177

Closed
sampie opened this issue Apr 27, 2023 · 13 comments · Fixed by #176
Closed

Multiprocessing bug when using CPU only #177

sampie opened this issue Apr 27, 2023 · 13 comments · Fixed by #176
Labels
bug Something isn't working

Comments

@sampie
Copy link

sampie commented Apr 27, 2023

Hi,

I was trying casanovo with a mgf file. The machine has no GPU.

I am running casanovo with following command:
casanovo --mode=denovo --peak_path=small_archive.mgf --output=casanovo_out.txt

The small_archive file is available at: https://node.dy.fi/files/small_archive.mgf

I wonder why does this crash happen, are there some configuration parameters I could try?

BR,
Sami

--------------------------------------------------------------------------------------------------------------------------------------------------

distributed_backend=gloo
All distributed processes registered. Starting with 64 processes
----------------------------------------------------------------------------------------------------

Traceback (most recent call last):
  File "/home/sami/anaconda3/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/home/sami/anaconda3/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/casanovo/casanovo.py", line 165, in main
    model_runner.predict(peak_path, model, config, writer)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/casanovo/denovo/model_runner.py", line 46, in predict
    _execute_existing(peak_path, model_filename, config, False, out_writer)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/casanovo/denovo/model_runner.py", line 170, in _execute_existing
    run_trainer(model, loaders.test_dataloader())
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 892, in predict
    return call._call_and_handle_interrupt(
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
    mp.start_processes(
  File "/home/sami/anaconda3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/sami/anaconda3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 57 terminated with the following error:
Traceback (most recent call last):
  File "/home/sami/anaconda3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 139, in _wrapping_function
    results = function(*args, **kwargs)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 938, in _predict_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1190, in _run_stage
    return self._run_predict()
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1240, in _run_predict
    self.reset_predict_dataloader(self.lightning_module)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1678, in reset_predict_dataloader
    self.num_predict_batches, self.predict_dataloaders = self._data_connector._reset_eval_dataloader(
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 377, in _reset_eval_dataloader
    dataloaders = [self._prepare_dataloader(dl, mode=mode) for dl in dataloaders if dl is not None]
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 377, in <listcomp>
    dataloaders = [self._prepare_dataloader(dl, mode=mode) for dl in dataloaders if dl is not None]
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 283, in _prepare_dataloader
    sampler = self._resolve_sampler(dataloader, shuffle=shuffle, mode=mode)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 300, in _resolve_sampler
    sampler = self._get_distributed_sampler(
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 339, in _get_distributed_sampler
    sampler = cls(dataloader.sampler, **kwargs)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/overrides/distributed.py", line 116, in __init__
    super().__init__(_DatasetSamplerWrapper(sampler), *args, **kwargs)
  File "/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/overrides/distributed.py", line 90, in __init__
    assert self.num_samples >= 1 or self.total_size == 0
AssertionError
@bittremieux bittremieux added the bug Something isn't working label Apr 27, 2023
@bittremieux
Copy link
Collaborator

Hi Sami, I was able to reproduce the issue if there are fewer spectra than threads when running on CPU-only.

Can you try doubling the number of spectra in your file (which will then exceed your 64 cores) to see whether that works?

@sampie
Copy link
Author

sampie commented Apr 27, 2023

Hi Wout,

I did increase the amount of spectra, but unfortunately I got another crash.

The larger archive is available at: https://node.dy.fi/files/medium_archive.mgf

BR,
Sami

/home/sami/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:208: UserWarning: num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)
  rank_zero_warn(
Predicting: 0it [00:00, ?it/s]Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8b665ed630>
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fd1ccec1630>
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fa5152c5630>
Traceback (most recent call last):
Traceback (most recent call last):
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fe81115d630>
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f5dbcd51630>
  File "/home/sami/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1510, in __del__
  File "/home/sami/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1510, in __del__
Traceback (most recent call last):
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f5cdfacd630>
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f5d4d381630>
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f84ec9b9630>
Traceback (most recent call last):
Traceback (most recent call last):
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f21aa229630>
  File "/home/sami/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1510, in __del__
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f2f31fb9630>
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f34d2469630>
Traceback (most recent call last):
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f3bf1991630>
Traceback (most recent call last):
  File "/home/sami/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1510, in __del__
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f7ee1b59630>
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb36ae21630>
  File "/home/sami/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1510, in __del__
  File "/home/sami/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1510, in __del__
Traceback (most recent call last):
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fa238fe5630>
Traceback (most recent call last):
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fd548375630>
Traceback (most recent call last):

@wfondrie
Copy link
Collaborator

@bittremieux, #176 may fix this issue, but we'll need new weights to know for sure.

@bittremieux
Copy link
Collaborator

I agree. There have been a few CPU-only related issues in the past, which should normally be addressed in the next major release of Casanovo. This bug seems a new one though, so we'll have to double-check it when we have the new version.

@bittremieux bittremieux changed the title Crash Multiprocessing bug when using CPU only Apr 27, 2023
@sampie
Copy link
Author

sampie commented Apr 28, 2023

In fact, when I run with this medium archive I get "Out of memory" errors from Linux kernel after which processes are being killed. The machine has 228G of RAM memory and still it is not enough.

Is there a way to reduce casanovo's memory consumption. I can see there are lots of threads. Is it possible to reduce the amount of threads and if it is, would it help to reduce memory consumption?

@bittremieux
Copy link
Collaborator

Casanovo by default uses all available cores. Unfortunately I don't have great insights as to memory consumption when running on CPU only with lots of cores, because we haven't really used it like that. I have locally verified that Casanovo uses a few GB / core in such a situation though, so this could indeed problematic when running on a pc with lots of cores.

There currently isn't a direct config option to restrict the number of cores that Casanovo will use, but on Linux you can use taskset to set the CPU affinity when executing Casanovo.

For example, to run Casanovo on the first 8 cores only:

taskset --cpu-list 0-7 casanovo --mode=denovo --peak_path=medium_archive.mgf --output=casanovo_out.txt

Note that this can be very slow though. The medium_archive file took 1 hour with 8 CPUs and 14 seconds with a GPU.

We are currently already working on changes to make the number of CPUs configurable within Casanovo itself (#176), and this will be provided in the next Casanovo release.

@sampie
Copy link
Author

sampie commented May 4, 2023

Hi,

With taskset the run did complete without errors. I did run a mgf with 237828 entries. It seems that casanovo did not find anything:
https://node.dy.fi/files/casanovo_out.mztab
https://node.dy.fi/files/casanovo_out.log

Could it be that 1) CPU run does not work or mgf is too large, which prevents identifications for some reason or casanovo's default model does not contain for example EColi that is in this mgf?

@bittremieux
Copy link
Collaborator

I assume it took quite a while for that many spectra on only 8 CPUs? The log doesn't indicate that anything is awry. Did you see the timer progress on the console output?

Casanovo works independently of the species, so the final option is not relevant. I also don't think that file size should be a problem. There are known issues with running on the CPU that we are in the progress of fixing, but those result in Casanovo crashing, not the output being empty.

To diagnose the problem, can you check whether the sample MGF file works correctly or not? And please share both the log file and the console output.

@sampie
Copy link
Author

sampie commented May 5, 2023

I think the timer constantly shows "Predicting: 0it [00:00, ?it/s]".

I did now run with the test data. Output file still looks quite empty. The output files can be found from: https://node.dy.fi/files/casanovo/

I managed to get a machine with a GPU, I will try with that next.

@sampie
Copy link
Author

sampie commented May 5, 2023

I did run casanovo (with sample_preprocessed_spectra.mgf) in a new machine that has NVIDIA GA100 graphics card. The casanovo_out.mztab still did not have any identifications. Can I somewhere check that casanovo did use GPU?

@bittremieux
Copy link
Collaborator

If the GPU was utilized correctly, that does at least indicate that the problem is not due to CPU-only, but that there is some other issue. But it is very weird that you're getting empty output on two different systems without doing anything out of the ordinary, and it's unfortunately very hard to diagnose what the problem could be.

Normally there should be an indication that Casanovo used the GPU in the log and on the console output. Additionally, while Casanovo is running, you can use watch nvidia-smi to verify that it is using the GPU and check GPU consumption.

@sampie
Copy link
Author

sampie commented May 10, 2023

The GPU drivers were not properly installed, so casanovo did run on the CPU. Once GPU drivers were properly installed, Casanovo did run successfully and returned the results as expected.

@bittremieux
Copy link
Collaborator

Thanks for verifying. We've now merged updates that should hopefully fix previous CPU issues, which will be included in a new release soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants