Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add min_area and pj_crop #3435

Closed
wants to merge 11 commits into from
Closed

add min_area and pj_crop #3435

wants to merge 11 commits into from

Conversation

alemelis
Copy link

What is this?

This PR brings back the data augmentation method implemented in pjreddie's repo.
Related to #3119

It also adds a new parameter, min_area, to be used during training to filter out detections min_area-.times smaller than the training image (net-work size). This helps with datasets ill-annotated in which too small objects are labelled even if no useful information can be retrieved.

Why?

During training the image is randomly cropped and re-scaled to simulate a zoom-in/out effect. This process is different in pjreddie and AlexeyAB repos as these employ statistically different cropping behaviours (Figure 2a)
img

The resulting behaviour is depicted below (blue and pink for AB's and pjreddie's, respectively). AB's method ensure that the corners of the cropped image are always in the dashed regions so that the image centre is always cropped. This goes at the expense of allowing the image orientation to randomly go from landscape to portrait (squeeze effect). pjreddie's method instead allows harsher zoom-in/out while retaining the original orientation (also, the center of the image is not forced to be always in the crop)
img

How to use

The cropping behaviours are regulated by the following parameters

  • jitter (which is defined multiple times in the [yolo] layers) is used to identify two deltas (dw and dh) to be used to generate the random crop [6]. In the ice-cream plot, the jitter parameter identify the cone base diameter.
  • scale_min and scale_max (used only in PJ's random cropping strategy) are used to define the random width range (the ice-cream cone length)

This PR makes possible to use both cropping methods via pj_crop switch:

  • 0 for AlexeyAB's method
  • 1 for pjreddie's,
  • 2 for random between the two on an image-by-image basis

Change the .cfg file as in yolov3-tiny_pjcrop.cfg (provided)

pj_crop = 1
scale_min = 0.25
scale_max = 4.0
min_area = 0.001
...
jitter = 0.2

@alemelis
Copy link
Author

@AlexeyAB @cenit
any thought on this?

@AlexeyAB
Copy link
Owner

@alemelis Hi, Thanks!

This PR brings back the data augmentation method implemented in pjreddie's repo.

  1. Do you meant, that it is uses letter_box (keeping aspect-ratio Resizing : keeping aspect ratio, or not #232 (comment) ) how it is done in pjreddie repo? https://github.com/AlexeyAB/darknet/pull/3435/files#diff-2ceac7e68fdac00b370188285ab286f7R890

  2. Why is min_area better than lowest_w / lowest_h? https://github.com/AlexeyAB/darknet/pull/3435/files#diff-2ceac7e68fdac00b370188285ab286f7R385 lowest_w and lowest_h are calculated automatically regard to resized relative size of object and network size (when size of object less than 1x1 pixel) instead of a fixed value

  3. Why do we need scale_min and scale_max which are absent in pjreddie repo?

  4. All random-generator (include your crop_style = random_gen()%2;) should be placed inside this if:

    darknet/src/data.c

    Lines 860 to 861 in 819ace3

    if (!augmentation_calculated || !track)
    {

    since it is used to train on frame-seqneces from video, so data augmentation must be the same for all frames from one seqnece Implement Yolo-LSTM (~+4-9 AP) for detection on Video with high mAP and without blinking issues #3114 (comment)

  5. It isn't correct, since it doesn't use break; at the end of each case and doesn't default: so I think it would be better to use if else instead of switch case there:

    darknet/src/data.c

    Lines 890 to 935 in 819ace3

    switch(crop_style){
    case 0:
    { // pjreddie
    float new_ar = (ow + rand_uniform(-dw, dw)) / (oh + rand_uniform(-dh, dh));
    float scale = rand_uniform(scale_min, scale_max);
    float nw, nh;
    if(new_ar < 1){
    nh = scale * h;
    nw = nh * new_ar;
    } else {
    nw = scale * w;
    nh = nw / new_ar;
    }
    pleft = rand_uniform(0, w - nw);
    ptop = rand_uniform(0, h - nh);
    dx = (float)pleft/nw;
    dy = (float)ptop/nh;
    swidth = (int)nw;
    sheight = (int)nh;
    sx = nw/ow;
    sy = nh/oh;
    }
    case 1:
    { // AlexeyAB
    int pright, pbot;
    pleft = rand_precalc_random(-dw, dw, r1);
    pright = rand_precalc_random(-dw, dw, r2);
    ptop = rand_precalc_random(-dh, dh, r3);
    pbot = rand_precalc_random(-dh, dh, r4);
    swidth = ow - pleft - pright;
    sheight = oh - ptop - pbot;
    sx = (float)swidth / ow;
    sy = (float)sheight / oh;
    dx = ((float)pleft / ow) / sx;
    dy = ((float)ptop / oh) / sy;
    }
    }

  6. Do you mean that

  • if we use pj_crop=1 then will be used pjreddie crop method
  • if we use pj_crop=2 then will be used randomly one of
    • pjreddie crop method (letterbox)
    • AlexeyAB crop method (resize)
  1. Did you test models by using all 3 cases: (1) OLD and new changed code (2)pj_crop=1 (3)pj_crop=2 ? What mAP did you get?

@AlexeyAB
Copy link
Owner

@alemelis

I added param letter_box (like your pj_crop) that can be used in the [net] section in cfg-file to train with keeping aspect ratio: c9129c2

[net]
letter_box=1

So you can test your additional suggestions

scale_min = 0.25
scale_max = 4.0
min_area = 0.001

and if you get higher mAP in some cases by using new parameters, then I will be happy to merge your pool request.

@alemelis
Copy link
Author

@AlexeyAB
it took me a while, but I've just finished 6 trainings setting pjcrop to 0, 1, and 1 and switching ON/OFF min_area parameter. I used the tiny v3 model on a custom dataset of ~50k fullHD images with 13 classes annotated.

these are the 6 training logs

Imgur

two things to notice:

  1. the loss is lower when min_area is switched ON, as too small annotations are filtered out and the model can focus on the best ones
  2. the mAP curve seems to be more stable when pjcrop = 1 is set. As you can see there are random drops for pjcrop = 0 and = 2

The best mAP during training is shown below

Imgur

alemelis and others added 7 commits July 2, 2019 14:24
warning: format specifies type 'long' but the argument has type 'uint64_t'
./src/col2im.c:43:9: warning: implicitly declaring library function 'memset' with type 'void *(void *, int,
      unsigned long)' [-Wimplicit-function-declaration]
        memset(Y, 0, sizeof(float) * N);  // NOLINT(caffe/alt_fn)
@alemelis
Copy link
Author

alemelis commented Jul 2, 2019

@AlexeyAB

this is to reply to your questions above

  1. yep

  2. I do not think min_area is better than lowest_w and lowest_h, but it can be manually set depending on your dataset; I found it useful

  3. scale_min and scale_max were hardcoded in pj's. Now they have been discarded, but again I found it useful to make them as parameters

  4. done

  5. done

  6. yes

  7. see above

I also fixed few compilation warnings

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jul 2, 2019

@alemelis Thank you very much!
Currently, unfortunately, there are problems and no time. As time appears, I will definitely review your code.

Yes, I think scale_min and scale_max are good feature.

Yes, I see that pjcrop=2 without minarea, and minarea with pjcrop=0/1 gives some improvements.

@@ -380,7 +382,7 @@ void fill_truth_detection(const char *path, int num_boxes, float *truth, int cla
++sub;
continue;
}
if ((w < lowest_w || h < lowest_h)) {
if ((w < lowest_w || h < lowest_h) || (area < min_area)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowest_w and lowest_h are both calculated relative to the network size.

Interested in why you've used a constant (w.r.t. network size) float, rather than defining it as the (integer) area in input pixels.

Copy link
Author

@alemelis alemelis Jul 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

area is expressed in "yolo" units. Hence, this is independent from the network resolution and it is related to the image size

In my dataset, I mainly use 1920x1080px images, and a min_area=0.0001 will ignore all the annotations smaller than 14x14px. These are arguably useful for learning, but I can reintroduce them by changing min_area without touching the annotations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more detail on my thinking..

There may be two reasons for ignoring these small annotations:

a) Because of human error annotating the data.
b) Because we assume there is not enough information to accurately detect an object with a bounding area of <min_area

if (a) is true, then more data should surely fix this (law of large numbers etc). If we really want to remove these then this a trivial quick data-cleanup operation on our training images prior to learning.

if (b) is the reason, then shouldn't this filter be applied after data augmentation, based on the amount of data available during training - rather than the amount of data in the input image?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that is trivial to fix case a) through some data cleaning, but I found that different architectures (v2, v3, tiny v3, etc...) at different resolutions may require a slightly different min_area threshold. I suspect this may indicate that the amount of information in a single annotation may be or not relevant depending on the net we want to train. Hence, I'd be worried of discarding a-priori some annotations only because a certain net couldn't learn anything from it.

This, in turn, links to b).

shouldn't this filter be applied after data augmentation, based on the amount of data available during training - rather than the amount of data in the input image?

Probably this would be viable in case of fixed augmentation, i.e., the images are scaled all the times in the same manner. This is not darkness case as I'm showing in the first comment to this PR.

I hope this make sense :)

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants