-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blur #101
Blur #101
Conversation
@vrandme thanks for the submission, been focused on a few other things lately so a bit slow on this right now, I don't see any issues with what's there so far but ultimately (as you allude to), the success depends on the integration in an actual network. What sort of testing have you done so far on this? I assume you did some basic checks inputing simple patterns where an inspection would confirm appropriate blur. Have you tried integrating it into a basic network and training? As for the combination with the other layers (let's call it integration strategies), there was some relevant twitter chatter I was involved in a while back on the topic... |
I implemented the rest of the blur. I did not spend too much time on how to best incorporate the blur pool to ResNets since in my view, the original authors and the researchers from clova-ai did more than enough testing already. I merely implemented the two best methods they have already found. As for independent testing, the BlurPool2d function has been tested both manually and against the original implementation for several random inputs. However, the fully implemented version is almost untested. This is far from enough testing obviously, but since I am unable to port the original weights for various reasons at this time I think this is as far as I can go at the moment. I think testing wise the following could be done:
|
@vrandme thanks, I'll see if I can get a train session running at some point against a somewhat recent non blur resnet comparison point |
@vrandme I fixed the bottleneck and ran a training session on resnetblur50 ... 79.288 top-1 without using the JSD loss, not bad. There is some other cleanup I want to do before merging, but definitely seems worth adding. Also, noticed someone tweeting about this https://github.com/mrT23/TResNet yesterday, uses the AA blur, a stem modification, and the Inplace ABN for BN |
I did read the paper and here are my thoughts. Getting back to the their Blur implementation, you can consider it as a further cutdown and optimized version of my code. Pseudocode is in the paper as follows: Actual code is contained here Here is my conclusion: Unless you want to maintain your code base towards the maximally experiment-able direction, When this is merged, I'll make a TResNet issue here discussing overall changes and low hanging fruits regarding this codebase. |
@vrandme I'm in no rush to replicate TResNet, just curious if you'd seen it since it's in the same lineage of ResNet experiments. I'm not sure if JIT will actually have an impact for the blur, but that's pretty easy to test. Agreed that some of the changes are hard to integrate nicely in a on/off fashion that won't impact backwards weight compat. I was thinking of fiddling with the ABN impl, before reading this paper, may try a BnAct factory version of ResNet and see if I can maintain weight compat. I made some changes to the blur filter generation in this PR so it works for any filter size. Also got the filters out of the state_dict without too much of a hack. |
Initial implementation of blur layer. currently tests as correct against Downsample of original github
clean up code for PR
1. add ResNet argument blur='' 2. implement blur for maxpool and strided convs in downsampling blocks
…and tweaks to go...
Haven't forgotten about this, since it impacts the main resnet class, wanted to tweak a few things, run full regression test of all the other dep models. I did try a filter size 5 run, was worse than the default 3 |
hmm considering that the recent TResnet merge included an anti aliasing layer, Although I feel like there is enough evidence that filter size 3 is the best compromise for the models and datasets that are widespread, if you would like to test more here are some pointers. There is an obvious trade off between resource use (runtime and memory use) vs accuracy with increasing filter size. All in all, I personally am content with the #121 pulled antialiasing layer with the JIT optimized code. |
Continuing my attempts at Issue #90,
I implemented a blur pool layer for future integration.
Although I wanted to copy Kornia's implementation I found that they tightly coupled max pooling with the downsampling. Also it had an internal dependency on for pyramid downsampling.
As the original implementation suggests three different strategies for antialiasing(each involving Max and average pooling and strided convolution), I had to decouple the code both from both max pooling and the internal dependency.
I ended up re-implementing the whole thing.
However, I did end up making several design choices.
First of all, I only implemented two most common binomial filters(3 and 5) for downsampling.
(Kornia seems to have implemented one while the original repo implements 7)
This is because the original paper says the most common meaning filters to be used would be the filters of 3 and 5, and since the code base I'm working up to(#90) tested only 3 and 5 (and settled for 3). It is not to hard to implement other filters should the need arise.
Secondly, I removed most options for several padding strategies since neither kornia, the original anti-aliasing paper nor #90 discusses the tradeoffs for each padding and just used the original reflective padding strategy.
I thought about the consequences of the various padding strategies but it seems that only the makes original one makes sense.
Currently the blurpool function is not used by ResNet as intended.
This is because further design choices are necessary.
Although potentially any and all maxpool, avg-pool and s=2 strided conv can be blurred with this,
the original paper and #90 show that it is both unnecessary and inefficient.
So my next step would be to implement this in the style of #90(certain downsampling strided convs only) and only exposing a boolean argument somewhere.
Next, it would be implemented as value argument to enable either Vanilla, #90 style, and the original paper style (all? strided convs and maxpools).
I guess I'd need to make some design decisions and then see how it works.
In the meantime I would like for you to look at the code and point out improvements or changes you would like to see before I move forward.