Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reuse of checkpoint file #266

Open
qianzhengzong opened this issue Aug 31, 2023 · 6 comments
Open

reuse of checkpoint file #266

qianzhengzong opened this issue Aug 31, 2023 · 6 comments
Assignees
Labels
user question User question about a specific dataset

Comments

@qianzhengzong
Copy link

hi great authors,
I runned CellBender with colab, successful for the training part but failed with output generation, so i downloaded the "ckpt.tar.gz" file for local output generation, but seems can not reuse the checkpoint file locally even with the "--checkpoint" and ”--fpr“ flag specified, with an error "Workflow hash does not match that of checkpoint".
since have no GPU on local machine, any idea of how to reuse the checkpoint file :)
Many thanks!

piece of local log:
/tmp/tmpncpo07v2/312105cc26_params.pyro
/tmp/tmpncpo07v2/312105cc26_train.loaderstate
/tmp/tmpncpo07v2/312105cc26_test.loaderstate
/tmp/tmpncpo07v2/312105cc26_args.npy
cellbender:remove-background: Workflow hash does not match that of checkpoint.
cellbender:remove-background: No checkpoint loaded.
cellbender:remove-background: Running inference...
^Ccellbender:remove-background: Inference procedure stopped by keyboard interrupt... will save a checkpoint.
cellbender:remove-background: Saving a checkpoint...
^Ccellbender:remove-background: Keyboard interrupt: will not save checkpoint

piece of colab log:
cellbender:remove-background: Working on chunk (21/255)
cellbender:remove-background: Working on chunk (22/255)
cellbender:remove-background: Working on chunk (23/255)
cellbender:remove-background: Working on chunk (24/255)
cellbender:remove-background: Working on chunk (25/255)
cellbender:remove-background: Working on chunk (26/255)
cellbender:remove-background: Working on chunk (27/255)
cellbender:remove-background: Working on chunk (28/255)
cellbender:remove-background: Working on chunk (29/255)
cellbender:remove-background: Working on chunk (30/255)

CalledProcessError Traceback (most recent call last)
in <cell line: 1>()
----> 1 get_ipython().run_cell_magic('shell', '', 'eval "$(conda shell.bash hook)"\nconda activate cellbender\npython --version\npip install -q cellbender\ncellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5\n')

3 frames
/usr/local/lib/python3.10/dist-packages/google/colab/_system_commands.py in check_returncode(self)
135 def check_returncode(self):
136 if self.returncode:
--> 137 raise subprocess.CalledProcessError(
138 returncode=self.returncode, cmd=self.args, output=self.output
139 )

CalledProcessError: Command 'eval "$(conda shell.bash hook)"
conda activate cellbender
python --version
pip install -q cellbender
cellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5
' died with <Signals.SIGKILL: 9>.

@sjfleming
Copy link
Member

Hi @qianzhengzong , I think your best option would be to retry training on the colab GPU. I think what's happening (based on the log file) is this error #251 related to the work here #263 .

That "SIGKILL: 9" error likely means the machine ran out of memory. If you can, please try to install cellbender (in the notebook), with the following command instead of pip install cellbender:

pip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@7fd0dac8fe5c37e705cdd50fa5767064f8f4b980

(note to self: this is a commit from the sf_memory_efficient_posterior_generation branch on Aug 30, 2023)

Hopefully these code changes will help the process complete successfully on the colab GPU.

@sjfleming sjfleming self-assigned this Aug 31, 2023
@qianzhengzong
Copy link
Author

Hi @sjfleming , yes, the sf_memory_efficient_posterior_generation branch get me further on colab, but still get killed after all chunks finished.
since colab has a limit of 13G free cpu memory, i used a smaller .h5, the result_posterior.h5 file was produced but still not fully successful.
finally using a much more smaller .h5 file, reached complete successful.

below is the log in case you want to improve the memory usage part based on it :)
large .h5 colab logs:
cellbender:remove-background: Working on chunk (254/255)
cellbender:remove-background: Working on chunk (255/255)

CalledProcessError Traceback (most recent call last)
in <cell line: 1>()
----> 1 get_ipython().run_cell_magic('shell', '', 'eval "$(conda shell.bash hook)"\nconda activate cellbender\npython --version\n# pip install -q cellbender\npip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@7fd0dac8fe5c37e705cdd50fa5767064f8f4b980\ncellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5\n')

3 frames
/usr/local/lib/python3.10/dist-packages/google/colab/_system_commands.py in check_returncode(self)
135 def check_returncode(self):
136 if self.returncode:
--> 137 raise subprocess.CalledProcessError(
138 returncode=self.returncode, cmd=self.args, output=self.output
139 )

CalledProcessError: Command 'eval "$(conda shell.bash hook)"
conda activate cellbender
python --version
pip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@7fd0dac8fe5c37e705cdd50fa5767064f8f4b980
cellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5
' died with <Signals.SIGKILL: 9>.

smaller .h5 colab log:
cellbender:remove-background: Working on chunk (73/73)
cellbender:remove-background: Writing full posterior to result_posterior.h5
cellbender:remove-background: Succeeded in writing posterior to file result_posterior.h5
cellbender:remove-background: Added posterior object to checkpoint file.
cellbender:remove-background: 2023-09-01 05:59:17

cellbender:remove-background: Saved summary plots as result.pdf
cellbender:remove-background: Saved cell barcodes in result_cell_barcodes.csv
cellbender:remove-background: Computing target noise counts per gene for MCKP estimator

CalledProcessError Traceback (most recent call last)
in <cell line: 1>()
----> 1 get_ipython().run_cell_magic('shell', '', 'eval "$(conda shell.bash hook)"\nconda activate cellbender\npython --version\n# pip install -q cellbender\npip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@7fd0dac8fe5c37e705cdd50fa5767064f8f4b980\ncellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5\n')

3 frames
/usr/local/lib/python3.10/dist-packages/google/colab/_system_commands.py in check_returncode(self)
135 def check_returncode(self):
136 if self.returncode:
--> 137 raise subprocess.CalledProcessError(
138 returncode=self.returncode, cmd=self.args, output=self.output
139 )

CalledProcessError: Command 'eval "$(conda shell.bash hook)"
conda activate cellbender
python --version
pip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@7fd0dac8fe5c37e705cdd50fa5767064f8f4b980
cellbender remove-background --cuda --input /content/drive/MyDrive/raw_feature_bc_matrix.h5 --output result.h5
' died with <Signals.SIGKILL: 9>.

@sjfleming
Copy link
Member

Hi @qianzhengzong , thanks for the reply. I didn't realize colab had a CPU memory limit of 13GB. Yes this will be hard to use for large samples...

Okay so your real question was: if you do the training on a colab GPU, can you finish the job on CPU by re-using the checkpoint. Let me try to answer that question!

The answer is, yes, I hope so! But it might not be quite so simple in practice. I use a "workflow hash code" to try to ensure that a checkpoint that's being re-used is appropriate for re-use (because the run parameters and cellbender source code are identical). I don't actually know if this will work appropriately if you run on one machine and then try to run on another machine. It might! I hope it does! Try using the --checkpoint input argument to specify your checkpoint file saved from the colab run.

If it says "workflow hashcode does not match" and starts to re-do the training, then we will have to hack our way around it. The easiest way to hack around it would be the following:

The log file starts with lines that look like this

cellbender:remove-background: CellBender 0.3.0
cellbender:remove-background: (Workflow hash ee55b84ac9)

Then it will show the workflow hash of the checkpoint file when it tries to open the checkpoint:

cellbender:remove-background: Attempting to unpack tarball "ckpt.tar.gz" to /var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku
cellbender:remove-background: Successfully unpacked tarball to /var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_optim.pyro
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_params.pyro
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_random.pyro
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_train.loaderstate
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_optim.torch
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_args.npy
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_model.torch
/var/folders/p4/dmtz_tld60z0n_xnqfx61j64rjplbx/T/tmp73q8u8ku/284932b0a1_test.loaderstate

When you run on CPU, look for the workflow hash (Workflow hash ee55b84ac9) and make note of it. Also make note of the workflow hash from the GPU run (the first part of the filenames of the tarball files) 284932b0a1. CellBender demands they match.

If you manually go in and change those filenames to match the CPU workflow hash (ee55b84ac9 in this case), the CellBender will be able to use that checkpoint.

@sjfleming
Copy link
Member

I hope it "just works" automatically by specifying --checkpoint and that you don't have to go through that trouble.

@qianzhengzong
Copy link
Author

hi @sjfleming , many thanks for the reply, i'll hack aroud it when get time, but i think the better choice is to get a gpu with high cpu memory to run this tool properly :)

@sjfleming sjfleming added the user question User question about a specific dataset label Sep 20, 2023
@kiklata
Copy link

kiklata commented Nov 17, 2023

hi @sjfleming , after obtaining ckpt.tar.gz file in colab using GPU and in order to use it in local machine with CPU only, I followed your advice to manually change the filenames of the checkpoint files. However, it seems a GPU is required to calculate posterior.

cellbender:remove-background: Loaded partially-trained checkpoint from newckpt.tar.gz  
cellbender:remove-background: Checkpoint loaded successfully. 
cellbender:remove-background: Running inference... 
cellbender:remove-background: 2023-11-18 00:02:03 
cellbender:remove-background: Inference procedure complete. 
cellbender:remove-background: Attempting to unpack tarball "newckpt.tar.gz" to /tmp/tmpu__d5kz8 
cellbender:remove-background: Successfully unpacked tarball to /tmp/tmpu__d5kz8 
/tmp/tmpu__d5kz8/df7718350c_model.torch 
/tmp/tmpu__d5kz8/df7718350c_optim.torch
/tmp/tmpu__d5kz8/df7718350c_train.loaderstate
/tmp/tmpu__d5kz8/df7718350c_params.pyro 
/tmp/tmpu__d5kz8/df7718350c_test.loaderstate 
/tmp/tmpu__d5kz8/df7718350c_random.cuda 
/tmp/tmpu__d5kz8/df7718350c_random.pyro 
/tmp/tmpu__d5kz8/df7718350c_optim.pyro 
/tmp/tmpu__d5kz8/df7718350c_args.npy 
cellbender:remove-background: Posterior not currently included in checkpoint. 
....
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user question User question about a specific dataset
Projects
Status: To Do
Development

No branches or pull requests

3 participants