-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move hash computation so that it is recomputed on retry, and now-inva… #258
Conversation
…lid checkpoint is not loaded. If number of tries is exhausted, and ELBO tests are still failing, allow to complete anyway (using checkpoint) so that outputs are produced, but exit(1).
95baec7
to
3c8cb38
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you! One question about the final re-run strategy
# The following settings do not affect the results, and can change when retrying, | ||
# so remove them. | ||
'epoch_elbo_fail_fraction', 'final_elbo_fail_fraction', | ||
'num_failed_attempts', 'checkpoint_filename'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! These had been overlooked by me
@@ -771,7 +789,12 @@ def run_inference(dataset_obj: SingleCellRNACountsDataset, | |||
sys.exit(0) | |||
else: | |||
logger.info(f'No more attempts are specified by --num-training-tries. ' | |||
f'Therefore the workflow will abort here.') | |||
f'Therefore the workflow will run once more without ELBO restrictions.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to re-run it? Does this re-run "cache", i.e. use the checkpoint? Or does it actually retrain the whole thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, after the changes to the elements that are included in the hash computation, this uses the most recent checkpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, sounds perfect
args.final_elbo_fail_fraction = None | ||
run_remove_background(args) # start from scratch | ||
# non-zero exit status in order to draw user's attention to the fact that ELBO tests | ||
# were never satisfied. | ||
sys.exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the value of still wanting to exit 1 so that the run will be flagged as a failure
This looks good to me! Thanks for contributing it. I think you're the main person using this functionality, so if the behavior gives you what you want in your testing, I'm happy to merge this. Unfortunately I don't have any unit tests written yet for this kind of retry functionality... some day! |
Hi @sjfleming , Yes, it appears to be working fine, so please go ahead and merge. Thanks, Alec |
* Add WDL input to set number of retries. (#247) * Move hash computation so that it is recomputed on retry, and now-invalid checkpoint is not loaded. (#258) * Bug fix for WDL using MTX input (#246) * Memory-efficient posterior generation (#263) * Fix posterior and estimator integer overflow bugs on Windows (#259) * Move from setup.py to pyproject.toml (#240) * Fix bugs with report generation across platforms (#302) --------- Co-authored-by: kshakir <[email protected]> Co-authored-by: alecw <[email protected]>
…lid checkpoint is not loaded.
If number of tries is exhausted, and ELBO tests are still failing, allow to complete anyway (using checkpoint) so that outputs are produced, but exit(1).