Fix a logical error of mask rcnn training script #1249

chinakook · 2020-04-07T07:19:39Z

After this fix, the training memory will be steady. If given a very big epoch（often with small dataset), this error will eat all GPU memory as epoch grows.
The script of faster rcnn training has this problem too. I‘ll fix it later when I get some tests passed.

After this fix, the training memory will be steady. If given a very big epoch（often with small dataset), this error will eat all GPU memory as epoch grows. The script of faster rcnn training has this problem too. I‘ll fix it later when I get some tests passed.

chinakook · 2020-04-07T07:31:30Z

If you have a dataset with about 20 images to train with this script, and then you set a very big epoch number, you will get out of memory error or get stuck or get malloc or double free or corruption errors after some epochs with whichever kind of GPU hardware.

zhreshold · 2020-04-07T17:03:06Z

The position of hybridization might affect the validation code. @Jerryzcn can you take a look? The exisiting parallel module and network is hybridized inside the epoch loop

chinakook · 2020-04-08T03:33:52Z

I move only the executor creation out of the loop now. The net is passed into the forwardbackwardtask by reference, so it's well.

chinakook · 2020-04-08T03:59:23Z

The later fix may be not thread safe. I think we should to re-design the executor to avoid the memory leaks. As I've tested, If put the executor in the loop, training a small mask_rcnn_resnet18 even get out of memory error as long as the epoch number is big.

zhreshold · 2020-04-23T00:44:42Z

@Jerryzcn the fix is verified by another user, can you double check and confirm?

mli · 2020-04-24T20:12:30Z

Job PR-1249-3 is done.
Docs are uploaded to http://gluon-vision-staging.s3-website-us-west-2.amazonaws.com/PR-1249/3/index.html
Code coverage of this PR: vs. Master:

Jerryzcn · 2020-04-25T02:12:44Z

Looks good to me. This appears after 20191016 it appears

Jerryzcn · 2020-05-06T02:53:15Z

It seems to cause error with validation...I need to revert this..

* move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs

chinakook · 2020-05-07T00:22:02Z

Yes, mask rcnn would get error with the former fix, But the memory consumption should be focused as our rcnn series memory consumption is far more than mmdet and detectron2 of torch, and its memory management is not stable.

zhreshold · 2020-05-07T17:38:05Z

@Jerryzcn With this reverted, do we have other plan to fix the training memory issue?

* move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit

* move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint

* move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn

* move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn * add config docs

* move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn * add config docs * move all logging into base estimator logdir

* [WIP] Fit api with sacred config (#1331) * config using sacred * update * update base * allow attribute access, add warning to config modification (#1348) * Faster R-CNN estimator (#1338) * move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * update full centernet example (#1349) * Autogluon Integration (#1355) * move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn (#1358) * move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn * Finish centernet fit estimator (#1359) * add voc detection pipeline * update * fix errors * Add docs for Faster R-CNN config (#1361) * move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn * add config docs * Estimator rcnn (#1366) * move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs * faster rcnn estimator * refactor * move dataset to init * lint * merge * disable sacred config for now * logger fix * fix fit * autogluon integration * fix small bug. training working * lint * sacred config for faster rcnn * add config docs * move all logging into base estimator logdir * raise key error when config is not found in freezed config (#1367) * auto object detection refactor (#1371) * auto object detection refactor * change logdir and common config * centernet config change * autogluon * fix auto object detection task * add mask_rcnn estimator (#1398) * fix typo (#1378) Co-authored-by: Ubuntu <[email protected]> * add ssd estimator (#1394) * add ssd estimator * modify ssd estimator * provide options to customize network structure * minor changes * minor changes * add auto detection using ssd * add auto detection using ssd * add custom model for ssd * minor changes * add tiny dataset for testing; fix errors in training auto detector * [WIP] Add estimator(yolo) (#1380) * yolo * yolo * add yolo * [chore] update name, add fit script Co-authored-by: Ubuntu <[email protected]> * auto register args (#1419) * Auto detector (#1425) * auto detector * auto detector * auto detector * auto detector * auto detector * auto detector Co-authored-by: Joshua Z. Zhang <[email protected]> * update auto detection (#1436) * auto detector * auto detector * auto detector * auto detector * auto detector * auto detector * update auto detection * add a framework for automatic suggestion of hyperparameter search space * 1) change auto_resume default to False; 2) change input config to estimator is a pure dict * [fix] bugs for test_auto_detection (#1438) Co-authored-by: Ubuntu <[email protected]> * add cls (#1381) * update auto suggest (#1443) * update auto suggest * update auto suggest * fix yolo errors (#1445) * remove dependencies on AutoGluon non-core functions (#1452) * remove dependency to autogluon non-core functions * fix errors on importing estimators * [Lint] fix pylint for estimator branch (#1451) * fix pylint * update mxnet build * remove py2 * remove py2.yml * fix jenkinsfile * fix post_nms in rcnn * fix * fix doc build * fix lint con't * add sacred * no tutorial yet * estimator prototype for save/load (#1458) * prototype for save/load * add type check * handle ctx * fix * collect * fix classmethod self * fix * pickle only init args * cast to numpy to avoid ctypes * fix get data * base estimator * [WIP] Add detailed logging information for auto estimator (#1470) * add detailed logging information for auto estimator * fix lint error * [WIP] Estimator data (#1471) * dataframe for object detection * fix pack and unpack for bboxes * update * refactor fit * fix * update pickle behavior * update * fix __all__ * Dataset as class property not module * fix centernet, add image classification dataset * fix * fix * fix logger not inited before init_network * reuse weights from known classes * add predict * fix index * format returned prediction * fix id to int * improve predict with pd.dataframe * add numpy * reset index * clean up * update image classification dataset * dataset improvements * valid url checker * setup.py improve * fix * fix import utils * add display to object detection * fix * change fit functions * add coco import * fix lint * fix lint * fix lint * fix * Estimator con't improvements (#1484) * allow ssd/faster-rcnn to take in train/val dataset * update * fix * update ssd * fix ctx * fix ctx * fix self.datasets * fix self.epoch * remove async_net * fix predict * debug predict * fix predict scores * filter out invalid predictions * fix faster_rcnn * fix * fix * fix deepcopy * fix fpn anchor generator * fix ctx * fix frcnn predict * fix * fix skipping logic * fix yolo3 * fix import * fix rename yoloestimator * fix import * fix yolo3 train * fix * fix * fix * fix * fix * fix * fix * fix ctx * fix trainer * fix num_class < 5 for topk * fix unpickable batch_fn * fix print * add predict * fix cls predict * fix cls predict * fix cls predict * fix cls predict * improve auto fit * improve auto fit * fix * fix * fix * fix * fix * debug * fix * fix * fix * fix * fix * fix * fix reporter pickle * change epochs to smaller * update image cls search space * fix * fix * fix * fix * fix * fix * fix * replace sacred with autocfg * fix * fix tuple type * fix * fix * fix * clean up * remove sacred * fix import * fix import * add types * fix * fix * defaults for object detection * fix * fix * update image classification * change lr * update * Fix pylint * Fix pylint * fit summary * pprint summary * fix * update * fix single trial * fix sample_config * fix sample_config * fix sample_config * fix lint * fix lint * adjust batch size * fix * stacktrace * fix * fix traceback * fix traceback * fix train evaluation * default networks * default networks * improves * fix * fix lint Co-authored-by: tmwangcas <[email protected]> * update script to master * add unittests for auto * update conda * pin autogluon * fix test * fix * fix ssd/yolo * fix * update defaults * fix kv_store being overwriten * fix rcnn batch size Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Tianming Wang <[email protected]> Co-authored-by: Chongruo Wu <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ubuntu <[email protected]>

chinakook mentioned this pull request Apr 7, 2020

Excessive memory consumption when training Mask-RCNN-Res50-FPN #825

Closed

fix hybridize

c7070c7

Jerryzcn merged commit db8559f into dmlc:master Apr 25, 2020

Jerryzcn added a commit to Jerryzcn/gluon-cv that referenced this pull request May 6, 2020

revert dmlc#1249

f71a984

Jerryzcn mentioned this pull request May 6, 2020

move rcnn forward backward task to model zoo #1288

Merged

Jerryzcn added a commit that referenced this pull request May 6, 2020

move rcnn forward backward task to model zoo (#1288)

e043d56

* move rcnn forward backward task to model zoo * revert #1249 * fix * fix * docstring * fix style * add docs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a logical error of mask rcnn training script #1249

Fix a logical error of mask rcnn training script #1249

chinakook commented Apr 7, 2020

chinakook commented Apr 7, 2020

zhreshold commented Apr 7, 2020

chinakook commented Apr 8, 2020

chinakook commented Apr 8, 2020

zhreshold commented Apr 23, 2020

mli commented Apr 24, 2020

Jerryzcn commented Apr 25, 2020

Jerryzcn commented May 6, 2020

chinakook commented May 7, 2020

zhreshold commented May 7, 2020

Fix a logical error of mask rcnn training script #1249

Fix a logical error of mask rcnn training script #1249

Conversation

chinakook commented Apr 7, 2020

chinakook commented Apr 7, 2020

zhreshold commented Apr 7, 2020

chinakook commented Apr 8, 2020

chinakook commented Apr 8, 2020

zhreshold commented Apr 23, 2020

mli commented Apr 24, 2020

Jerryzcn commented Apr 25, 2020

Jerryzcn commented May 6, 2020

chinakook commented May 7, 2020

zhreshold commented May 7, 2020