-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a logical error of mask rcnn training script #1249
Conversation
After this fix, the training memory will be steady. If given a very big epoch(often with small dataset), this error will eat all GPU memory as epoch grows. The script of faster rcnn training has this problem too. I‘ll fix it later when I get some tests passed.
If you have a dataset with about 20 images to train with this script, and then you set a very big epoch number, you will get |
The position of hybridization might affect the validation code. @Jerryzcn can you take a look? The exisiting parallel module and network is hybridized inside the epoch loop |
I move only the executor creation out of the loop now. The net is passed into the forwardbackwardtask by reference, so it's well. |
The later fix may be not thread safe. I think we should to re-design the executor to avoid the memory leaks. As I've tested, If put the executor in the loop, training a small mask_rcnn_resnet18 even get |
@Jerryzcn the fix is verified by another user, can you double check and confirm? |
Job PR-1249-3 is done. |
Looks good to me. This appears after 20191016 it appears |
It seems to cause error with validation...I need to revert this.. |
* move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs
Yes, mask rcnn would get error with the former fix, But the memory consumption should be focused as our rcnn series memory consumption is far more than |
@Jerryzcn With this reverted, do we have other plan to fix the training memory issue? |
* move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit
* move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint
* move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn
* move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn * add config docs
* move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn * add config docs * move all logging into base estimator logdir
* [WIP] Fit api with sacred config (#1331) * config using sacred * update * update base * allow attribute access, add warning to config modification (#1348) * Faster R-CNN estimator (#1338) * move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * update full centernet example (#1349) * Autogluon Integration (#1355) * move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn (#1358) * move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn * Finish centernet fit estimator (#1359) * add voc detection pipeline * update * fix errors * Add docs for Faster R-CNN config (#1361) * move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn * add config docs * Estimator rcnn (#1366) * move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn * add config docs * move all logging into base estimator logdir * raise key error when config is not found in freezed config (#1367) * auto object detection refactor (#1371) * auto object detection refactor * change logdir and common config * centernet config change * autogluon * fix auto object detection task * add mask_rcnn estimator (#1398) * fix typo (#1378) Co-authored-by: Ubuntu <[email protected]> * add ssd estimator (#1394) * add ssd estimator * modify ssd estimator * provide options to customize network structure * minor changes * minor changes * add auto detection using ssd * add auto detection using ssd * add custom model for ssd * minor changes * add tiny dataset for testing; fix errors in training auto detector * [WIP] Add estimator(yolo) (#1380) * yolo * yolo * add yolo * [chore] update name, add fit script Co-authored-by: Ubuntu <[email protected]> * auto register args (#1419) * Auto detector (#1425) * auto detector * auto detector * auto detector * auto detector * auto detector * auto detector Co-authored-by: Joshua Z. Zhang <[email protected]> * update auto detection (#1436) * auto detector * auto detector * auto detector * auto detector * auto detector * auto detector * update auto detection * add a framework for automatic suggestion of hyperparameter search space * 1) change auto_resume default to False; 2) change input config to estimator is a pure dict * [fix] bugs for test_auto_detection (#1438) Co-authored-by: Ubuntu <[email protected]> * add cls (#1381) * update auto suggest (#1443) * update auto suggest * update auto suggest * fix yolo errors (#1445) * remove dependencies on AutoGluon non-core functions (#1452) * remove dependency to autogluon non-core functions * fix errors on importing estimators * [Lint] fix pylint for estimator branch (#1451) * fix pylint * update mxnet build * remove py2 * remove py2.yml * fix jenkinsfile * fix post_nms in rcnn * fix * fix doc build * fix lint con't * add sacred * no tutorial yet * estimator prototype for save/load (#1458) * prototype for save/load * add type check * handle ctx * fix * collect * fix classmethod self * fix * pickle only init args * cast to numpy to avoid ctypes * fix get data * base estimator * [WIP] Add detailed logging information for auto estimator (#1470) * add detailed logging information for auto estimator * fix lint error * [WIP] Estimator data (#1471) * dataframe for object detection * fix pack and unpack for bboxes * update * refactor fit * fix * update pickle behavior * update * fix __all__ * Dataset as class property not module * fix centernet, add image classification dataset * fix * fix * fix logger not inited before init_network * reuse weights from known classes * add predict * fix index * format returned prediction * fix id to int * improve predict with pd.dataframe * add numpy * reset index * clean up * update image classification dataset * dataset improvements * valid url checker * setup.py improve * fix * fix import utils * add display to object detection * fix * change fit functions * add coco import * fix lint * fix lint * fix lint * fix * Estimator con't improvements (#1484) * allow ssd/faster-rcnn to take in train/val dataset * update * fix * update ssd * fix ctx * fix ctx * fix self.datasets * fix self.epoch * remove async_net * fix predict * debug predict * fix predict scores * filter out invalid predictions * fix faster_rcnn * fix * fix * fix deepcopy * fix fpn anchor generator * fix ctx * fix frcnn predict * fix * fix skipping logic * fix yolo3 * fix import * fix rename yoloestimator * fix import * fix yolo3 train * fix * fix * fix * fix * fix * fix * fix * fix ctx * fix trainer * fix num_class < 5 for topk * fix unpickable batch_fn * fix print * add predict * fix cls predict * fix cls predict * fix cls predict * fix cls predict * improve auto fit * improve auto fit * fix * fix * fix * fix * fix * debug * fix * fix * fix * fix * fix * fix * fix reporter pickle * change epochs to smaller * update image cls search space * fix * fix * fix * fix * fix * fix * fix * replace sacred with autocfg * fix * fix tuple type * fix * fix * fix * clean up * remove sacred * fix import * fix import * add types * fix * fix * defaults for object detection * fix * fix * update image classification * change lr * update * Fix pylint * Fix pylint * fit summary * pprint summary * fix * update * fix single trial * fix sample_config * fix sample_config * fix sample_config * fix lint * fix lint * adjust batch size * fix * stacktrace * fix * fix traceback * fix traceback * fix train evaluation * default networks * default networks * improves * fix * fix lint Co-authored-by: tmwangcas <[email protected]> * update script to master * add unittests for auto * update conda * pin autogluon * fix test * fix * fix ssd/yolo * fix * update defaults * fix kv_store being overwriten * fix rcnn batch size Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Tianming Wang <[email protected]> Co-authored-by: Chongruo Wu <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ubuntu <[email protected]>
After this fix, the training memory will be steady. If given a very big epoch(often with small dataset), this error will eat all GPU memory as epoch grows.
The script of faster rcnn training has this problem too. I‘ll fix it later when I get some tests passed.