-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix bug when the target is empty in FCOS #5267
Conversation
💊 CI failures summary and remediationsAs of commit 622c16c (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 9 ongoing upstream failures:These were probably caused by upstream breakages that are not fixed yet.
This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
@@ -59,9 +59,13 @@ def compute_loss( | |||
all_gt_classes_targets = [] | |||
all_gt_boxes_targets = [] | |||
for targets_per_image, matched_idxs_per_image in zip(targets, matched_idxs): | |||
gt_classes_targets = targets_per_image["labels"][matched_idxs_per_image.clip(min=0)] | |||
if len(targets_per_image["labels"]) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to confirm that this will works as expected and doesn't produce errors when find_unused_params=False
(see discussion at #2784 (comment))
The runtime error to be looking for is something like:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
@jdsgomes Could you confirm by kicking off a run for an epoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@datumbox thanks. I have not considered this problem. By the way, I want to know that how retinanet can handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had this issue previously with a couple of models. Retina was one of them. The way we avoided it was by rewriting the loss estimations in a way that the vectors can cope with empty indeces. Let's check first if it is an issue before starting rewriting.
Edit: I found the PR with the patch: #3032
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@datumbox I think the current implementation will not have this trouble, as you see, the regression loss and centerness loss is just in the way that the vectors can cope with empty indeces. but for safe, we had better check once. but I don't have the empty datasets, so can you help to check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's our understanding as well, that you should be OK. A single run on the scripts for 1 epoch should be enough to confirm if it's a problem (at least that was the case previously). I'll sync with Joao to confirm. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the other user confirmed the patch works, see #5266 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that the patch works and also that after training for an epoch no runtime errors are observed.
Add unittest for empty instance training
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me and it seems we have confirmation that the patch works from #5266 (comment). I'm approving but I'll leave @jdsgomes merge once he reviews and gets the results of the 1 epoch run.
Thanks for the update @xiaohu2015!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you for the fix!
@@ -59,9 +59,13 @@ def compute_loss( | |||
all_gt_classes_targets = [] | |||
all_gt_boxes_targets = [] | |||
for targets_per_image, matched_idxs_per_image in zip(targets, matched_idxs): | |||
gt_classes_targets = targets_per_image["labels"][matched_idxs_per_image.clip(min=0)] | |||
if len(targets_per_image["labels"]) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that the patch works and also that after training for an epoch no runtime errors are observed.
Hey @jdsgomes! You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py |
Summary: * fix bug when the target is empty * Add unittest for empty instance training Reviewed By: kazhang Differential Revision: D33927512 fbshipit-source-id: e92355380948d9181e135b7612596c5309afeeda Co-authored-by: Zhiqiang Wang <[email protected]> Co-authored-by: Joao Gomes <[email protected]>
Last week, we released the detetion model: FCOS at #4961
some users have found there is a problem when the inputs has empty targets.
#5266
same issus can also be found in detectron2: facebookresearch/detectron2#3851, facebookresearch/detectron2#3910
we fix this bug in this pr, you can test: