-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reuse of checkpoint file #266
Comments
Hi @qianzhengzong , I think your best option would be to retry training on the colab GPU. I think what's happening (based on the log file) is this error #251 related to the work here #263 . That "SIGKILL: 9" error likely means the machine ran out of memory. If you can, please try to install cellbender (in the notebook), with the following command instead of pip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@7fd0dac8fe5c37e705cdd50fa5767064f8f4b980 (note to self: this is a commit from the Hopefully these code changes will help the process complete successfully on the colab GPU. |
Hi @sjfleming , yes, the below is the log in case you want to improve the memory usage part based on it :) CalledProcessError Traceback (most recent call last) 3 frames CalledProcessError: Command 'eval "$(conda shell.bash hook)" smaller .h5 colab log: cellbender:remove-background: Saved summary plots as result.pdf CalledProcessError Traceback (most recent call last) 3 frames CalledProcessError: Command 'eval "$(conda shell.bash hook)" |
Hi @qianzhengzong , thanks for the reply. I didn't realize colab had a CPU memory limit of 13GB. Yes this will be hard to use for large samples... Okay so your real question was: if you do the training on a colab GPU, can you finish the job on CPU by re-using the checkpoint. Let me try to answer that question! The answer is, yes, I hope so! But it might not be quite so simple in practice. I use a "workflow hash code" to try to ensure that a checkpoint that's being re-used is appropriate for re-use (because the run parameters and cellbender source code are identical). I don't actually know if this will work appropriately if you run on one machine and then try to run on another machine. It might! I hope it does! Try using the If it says "workflow hashcode does not match" and starts to re-do the training, then we will have to hack our way around it. The easiest way to hack around it would be the following: The log file starts with lines that look like this
Then it will show the workflow hash of the checkpoint file when it tries to open the checkpoint:
When you run on CPU, look for the workflow hash If you manually go in and change those filenames to match the CPU workflow hash ( |
I hope it "just works" automatically by specifying |
hi @sjfleming , many thanks for the reply, i'll hack aroud it when get time, but i think the better choice is to get a gpu with high cpu memory to run this tool properly :) |
hi @sjfleming , after obtaining ckpt.tar.gz file in colab using GPU and in order to use it in local machine with CPU only, I followed your advice to manually change the filenames of the checkpoint files. However, it seems a GPU is required to calculate posterior.
|
hi great authors,
I runned CellBender with colab, successful for the training part but failed with output generation, so i downloaded the "ckpt.tar.gz" file for local output generation, but seems can not reuse the checkpoint file locally even with the "--checkpoint" and ”--fpr“ flag specified, with an error "Workflow hash does not match that of checkpoint".
since have no GPU on local machine, any idea of how to reuse the checkpoint file :)
Many thanks!
piece of local log:
/tmp/tmpncpo07v2/312105cc26_params.pyro
/tmp/tmpncpo07v2/312105cc26_train.loaderstate
/tmp/tmpncpo07v2/312105cc26_test.loaderstate
/tmp/tmpncpo07v2/312105cc26_args.npy
cellbender:remove-background: Workflow hash does not match that of checkpoint.
cellbender:remove-background: No checkpoint loaded.
cellbender:remove-background: Running inference...
^Ccellbender:remove-background: Inference procedure stopped by keyboard interrupt... will save a checkpoint.
cellbender:remove-background: Saving a checkpoint...
^Ccellbender:remove-background: Keyboard interrupt: will not save checkpoint
piece of colab log:
cellbender:remove-background: Working on chunk (21/255)
cellbender:remove-background: Working on chunk (22/255)
cellbender:remove-background: Working on chunk (23/255)
cellbender:remove-background: Working on chunk (24/255)
cellbender:remove-background: Working on chunk (25/255)
cellbender:remove-background: Working on chunk (26/255)
cellbender:remove-background: Working on chunk (27/255)
cellbender:remove-background: Working on chunk (28/255)
cellbender:remove-background: Working on chunk (29/255)
cellbender:remove-background: Working on chunk (30/255)
CalledProcessError Traceback (most recent call last)
in <cell line: 1>()
----> 1 get_ipython().run_cell_magic('shell', '', 'eval "$(conda shell.bash hook)"\nconda activate cellbender\npython --version\npip install -q cellbender\ncellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5\n')
3 frames
/usr/local/lib/python3.10/dist-packages/google/colab/_system_commands.py in check_returncode(self)
135 def check_returncode(self):
136 if self.returncode:
--> 137 raise subprocess.CalledProcessError(
138 returncode=self.returncode, cmd=self.args, output=self.output
139 )
CalledProcessError: Command 'eval "$(conda shell.bash hook)"
conda activate cellbender
python --version
pip install -q cellbender
cellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5
' died with <Signals.SIGKILL: 9>.
The text was updated successfully, but these errors were encountered: