Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a logical error of mask rcnn training script #1249

Merged
merged 2 commits into from
Apr 25, 2020

Conversation

chinakook
Copy link
Member

After this fix, the training memory will be steady. If given a very big epoch(often with small dataset), this error will eat all GPU memory as epoch grows.
The script of faster rcnn training has this problem too. I‘ll fix it later when I get some tests passed.

After this fix, the training memory will be steady. If given a very big epoch(often with small dataset), this error will eat all GPU memory as epoch grows.
The script of faster rcnn training has this problem too. I‘ll fix it later when I get some tests passed.
@chinakook
Copy link
Member Author

If you have a dataset with about 20 images to train with this script, and then you set a very big epoch number, you will get out of memory error or get stuck or get malloc or double free or corruption errors after some epochs with whichever kind of GPU hardware.

@zhreshold
Copy link
Member

The position of hybridization might affect the validation code. @Jerryzcn can you take a look? The exisiting parallel module and network is hybridized inside the epoch loop

@chinakook
Copy link
Member Author

I move only the executor creation out of the loop now. The net is passed into the forwardbackwardtask by reference, so it's well.

@chinakook
Copy link
Member Author

The later fix may be not thread safe. I think we should to re-design the executor to avoid the memory leaks. As I've tested, If put the executor in the loop, training a small mask_rcnn_resnet18 even get out of memory error as long as the epoch number is big.

@zhreshold
Copy link
Member

@Jerryzcn the fix is verified by another user, can you double check and confirm?

@mli
Copy link
Member

mli commented Apr 24, 2020

Job PR-1249-3 is done.
Docs are uploaded to http://gluon-vision-staging.s3-website-us-west-2.amazonaws.com/PR-1249/3/index.html
Code coverage of this PR: pr.svg vs. Master: master.svg

@Jerryzcn
Copy link
Contributor

Looks good to me. This appears after 20191016 it appears

@Jerryzcn Jerryzcn merged commit db8559f into dmlc:master Apr 25, 2020
@Jerryzcn
Copy link
Contributor

Jerryzcn commented May 6, 2020

It seems to cause error with validation...I need to revert this..

Jerryzcn added a commit to Jerryzcn/gluon-cv that referenced this pull request May 6, 2020
Jerryzcn added a commit that referenced this pull request May 6, 2020
* move rcnn forward backward task to model zoo

* revert #1249

* fix

* fix

* docstring

* fix style

* add docs
@chinakook
Copy link
Member Author

Yes, mask rcnn would get error with the former fix, But the memory consumption should be focused as our rcnn series memory consumption is far more than mmdet and detectron2 of torch, and its memory management is not stable.

@zhreshold
Copy link
Member

@Jerryzcn With this reverted, do we have other plan to fix the training memory issue?

Jerryzcn added a commit that referenced this pull request Jun 17, 2020
* move rcnn forward backward task to model zoo

* revert #1249

* fix

* fix

* docstring

* fix style

* add docs

* faster rcnn estimator

* refactor

* move dataset to init

* lint

* merge

* disable sacred config for now

* logger fix

* fix fit
Jerryzcn added a commit that referenced this pull request Jun 30, 2020
* move rcnn forward backward task to model zoo

* revert #1249

* fix

* fix

* docstring

* fix style

* add docs

* faster rcnn estimator

* refactor

* move dataset to init

* lint

* merge

* disable sacred config for now

* logger fix

* fix fit

* autogluon integration

* fix small bug. training working

* lint
Jerryzcn added a commit that referenced this pull request Jul 8, 2020
* move rcnn forward backward task to model zoo

* revert #1249

* fix

* fix

* docstring

* fix style

* add docs

* faster rcnn estimator

* refactor

* move dataset to init

* lint

* merge

* disable sacred config for now

* logger fix

* fix fit

* autogluon integration

* fix small bug. training working

* lint

* sacred config for faster rcnn
Jerryzcn added a commit that referenced this pull request Jul 9, 2020
* move rcnn forward backward task to model zoo

* revert #1249

* fix

* fix

* docstring

* fix style

* add docs

* faster rcnn estimator

* refactor

* move dataset to init

* lint

* merge

* disable sacred config for now

* logger fix

* fix fit

* autogluon integration

* fix small bug. training working

* lint

* sacred config for faster rcnn

* add config docs
Jerryzcn added a commit that referenced this pull request Jul 10, 2020
* move rcnn forward backward task to model zoo

* revert #1249

* fix

* fix

* docstring

* fix style

* add docs

* faster rcnn estimator

* refactor

* move dataset to init

* lint

* merge

* disable sacred config for now

* logger fix

* fix fit

* autogluon integration

* fix small bug. training working

* lint

* sacred config for faster rcnn

* add config docs

* move all logging into base estimator logdir
zhreshold added a commit that referenced this pull request Nov 2, 2020
* [WIP] Fit api with sacred config (#1331)

* config using sacred

* update

* update base

* allow attribute access, add warning to config modification (#1348)

* Faster R-CNN estimator (#1338)

* move rcnn forward backward task to model zoo

* revert #1249

* fix

* fix

* docstring

* fix style

* add docs

* faster rcnn estimator

* refactor

* move dataset to init

* lint

* merge

* disable sacred config for now

* logger fix

* fix fit

* update full centernet example (#1349)

* Autogluon Integration (#1355)

* move rcnn forward backward task to model zoo

* revert #1249

* fix

* fix

* docstring

* fix style

* add docs

* faster rcnn estimator

* refactor

* move dataset to init

* lint

* merge

* disable sacred config for now

* logger fix

* fix fit

* autogluon integration

* fix small bug. training working

* lint

* sacred config for faster rcnn (#1358)

* move rcnn forward backward task to model zoo

* revert #1249

* fix

* fix

* docstring

* fix style

* add docs

* faster rcnn estimator

* refactor

* move dataset to init

* lint

* merge

* disable sacred config for now

* logger fix

* fix fit

* autogluon integration

* fix small bug. training working

* lint

* sacred config for faster rcnn

* Finish centernet fit estimator (#1359)

* add voc detection pipeline

* update

* fix errors

* Add docs for Faster R-CNN config (#1361)

* move rcnn forward backward task to model zoo

* revert #1249

* fix

* fix

* docstring

* fix style

* add docs

* faster rcnn estimator

* refactor

* move dataset to init

* lint

* merge

* disable sacred config for now

* logger fix

* fix fit

* autogluon integration

* fix small bug. training working

* lint

* sacred config for faster rcnn

* add config docs

* Estimator rcnn (#1366)

* move rcnn forward backward task to model zoo

* revert #1249

* fix

* fix

* docstring

* fix style

* add docs

* faster rcnn estimator

* refactor

* move dataset to init

* lint

* merge

* disable sacred config for now

* logger fix

* fix fit

* autogluon integration

* fix small bug. training working

* lint

* sacred config for faster rcnn

* add config docs

* move all logging into base estimator logdir

* raise key error when config is not found in freezed config (#1367)

* auto object detection refactor (#1371)

* auto object detection refactor

* change logdir and common config

* centernet config change

* autogluon

* fix auto object detection task

* add mask_rcnn estimator (#1398)

* fix typo (#1378)

Co-authored-by: Ubuntu <[email protected]>

* add ssd estimator (#1394)

* add ssd estimator

* modify ssd estimator

* provide options to customize network structure

* minor changes

* minor changes

* add auto detection using ssd

* add auto detection using ssd

* add custom model for ssd

* minor changes

* add tiny dataset for testing; fix errors in training auto detector

* [WIP] Add estimator(yolo) (#1380)

* yolo

* yolo

* add yolo

* [chore] update name, add fit script

Co-authored-by: Ubuntu <[email protected]>

* auto register args (#1419)

* Auto detector (#1425)

* auto detector

* auto detector

* auto detector

* auto detector

* auto detector

* auto detector

Co-authored-by: Joshua Z. Zhang <[email protected]>

* update auto detection (#1436)

* auto detector

* auto detector

* auto detector

* auto detector

* auto detector

* auto detector

* update auto detection

* add a framework for automatic suggestion of hyperparameter search space

* 1) change auto_resume default to False; 2) change input config to estimator is a pure dict

* [fix] bugs for test_auto_detection (#1438)

Co-authored-by: Ubuntu <[email protected]>

* add cls (#1381)

* update auto suggest (#1443)

* update auto suggest

* update auto suggest

* fix yolo errors (#1445)

* remove dependencies on AutoGluon non-core functions (#1452)

* remove dependency to autogluon non-core functions

* fix errors on importing estimators

* [Lint] fix pylint for estimator branch (#1451)

* fix pylint

* update mxnet build

* remove py2

* remove py2.yml

* fix jenkinsfile

* fix post_nms in rcnn

* fix

* fix doc build

* fix lint con't

* add sacred

* no tutorial yet

* estimator prototype for save/load (#1458)

* prototype for save/load

* add type check

* handle ctx

* fix

* collect

* fix classmethod self

* fix

* pickle only init args

* cast to numpy to avoid ctypes

* fix get data

* base estimator

* [WIP] Add detailed logging information for auto estimator (#1470)

* add detailed logging information for auto estimator

* fix lint error

* [WIP] Estimator data (#1471)

* dataframe for object detection

* fix pack and unpack for bboxes

* update

* refactor fit

* fix

* update pickle behavior

* update

* fix __all__

* Dataset as class property not module

* fix centernet, add image classification dataset

* fix

* fix

* fix logger not inited before init_network

* reuse weights from known classes

* add predict

* fix index

* format returned prediction

* fix id to int

* improve predict with pd.dataframe

* add numpy

* reset index

* clean up

* update image classification dataset

* dataset improvements

* valid url checker

* setup.py improve

* fix

* fix import utils

* add display to object detection

* fix

* change fit functions

* add coco import

* fix lint

* fix lint

* fix lint

* fix

* Estimator con't improvements (#1484)

* allow ssd/faster-rcnn to take in train/val dataset

* update

* fix

* update ssd

* fix ctx

* fix ctx

* fix self.datasets

* fix self.epoch

* remove async_net

* fix predict

* debug predict

* fix predict scores

* filter out invalid predictions

* fix faster_rcnn

* fix

* fix

* fix deepcopy

* fix fpn anchor generator

* fix ctx

* fix frcnn predict

* fix

* fix skipping logic

* fix yolo3

* fix import

* fix rename yoloestimator

* fix import

* fix yolo3 train

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix ctx

* fix trainer

* fix num_class < 5 for topk

* fix unpickable batch_fn

* fix print

* add predict

* fix cls predict

* fix cls predict

* fix cls predict

* fix cls predict

* improve auto fit

* improve auto fit

* fix

* fix

* fix

* fix

* fix

* debug

* fix

* fix

* fix

* fix

* fix

* fix

* fix reporter pickle

* change epochs to smaller

* update image cls search space

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* replace sacred with autocfg

* fix

* fix tuple type

* fix

* fix

* fix

* clean up

* remove sacred

* fix import

* fix import

* add types

* fix

* fix

* defaults for object detection

* fix

* fix

* update image classification

* change lr

* update

* Fix pylint

* Fix pylint

* fit summary

* pprint summary

* fix

* update

* fix single trial

* fix sample_config

* fix sample_config

* fix sample_config

* fix lint

* fix lint

* adjust batch size

* fix

* stacktrace

* fix

* fix traceback

* fix traceback

* fix train evaluation

* default networks

* default networks

* improves

* fix

* fix lint

Co-authored-by: tmwangcas <[email protected]>

* update script to master

* add unittests for auto

* update conda

* pin autogluon

* fix test

* fix

* fix ssd/yolo

* fix

* update defaults

* fix kv_store being overwriten

* fix rcnn batch size

Co-authored-by: Jerry Zhang <[email protected]>
Co-authored-by: Tianming Wang <[email protected]>
Co-authored-by: Chongruo Wu <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants