-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Aggregated adamw update #16398
Aggregated adamw update #16398
Conversation
@eric-haibin-lin FYI |
and some minor changes requested by Przemek
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @eric-haibin-lin do you have any more comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment below is not specific to this PR:
It looks like there are lots of code changes to register operators for multi_xx_update ops. What do you think would be helpful to reduce the development cycle for such ops? As we add more ops like this, the c++ code becomes less readable. Maybe TVM can help generate multi-tensor kernels?
…to aggregated_adamw_update
Generally I would opt for cleaning the optimizers so that only the |
* Trigger CI * MxNet operator for aggregated Adam update * Fixing problem with getRescaleGrad(...) call in Python2 and some minor changes requested by Przemek * Fix a problem appearing in Python2 * Minor cleanup * Changing function name * Trigger CI * Eliminating "asnumpy()" conversion * Trigger CI
Description
MxNet operator for aggregated Adam update
Checklist
Essentials
Changes
tests, (and when applicable, API doc)
clip_gradient
parameter and random variations forlr
,eta
,wd
andshape
.