Adding GRUv3 support. #1675

mkschleg · 2021-07-23T15:58:04Z

As per the starting discussion in #1671, we should provide support for variations on the GRU and LSTM cell.

In this PR, I added support for the GRU found in v3 of the original GRU paper. Current support in Flux is for v1 only. Tensorflow supports several variations, with this as one of the variations.

While the feature is added and usable in this PR, this is only a first pass at a design and could use further iterations. Some questions I have:

Should we have new types for each variation of these cells? (another possibility is through parametric options)
Should we have a shared constructor similar to Tensorflow/Pytorch? (it might make sense to rename the current GRU to GRUv1 if we want to do this).

PR Checklist

Tests are added
Entry in NEWS.md
Documentation, if applicable
API changes require approval from a committer (different from the author, if applicable)

mkschleg · 2021-07-23T15:59:06Z

I added tests to test/layers/recurrent.jl. I'm not sure there are more tests I should add? Also added doc strings for the new method and updated the doc string for the original version to clarify it is v1 of the arxiv paper.

DhairyaLGandhi

Thanks for the contribution! Looking forward to it!

I've left a couple of starter comments. Could you please also add CUDA tests?

DhairyaLGandhi · 2021-07-23T16:06:44Z

src/layers/recurrent.jl

+  state0::S
+end
+
+GRUv3Cell(in, out; init = glorot_uniform, initb = zeros32, init_state = zeros32) =


Needs an activation

I'm following the exact constructor for the GRU currently in Flux line. If we want to add activations here it would make sense to add them for the original GRU and LSTMs for consistency.

AFAIK the activations in LSTM/GRU are very specifically chosen. That's why they are currently not options, and we should probably keep that consistent.

Worth pointing out that TF and JAX-based libraries do allow you to customize the activation. I presume PyTorch doesn't because it lacks a non-CuDNN path for it's GPU RNN backend. That said, this would be better as a separate PR that changes every RNN layer.

Interesting, is it just the output activation or all of them?

All of them I believe: tensorflow GRU

All of them, IIUC. There's a distinction between "activation" functions (by default tanh) and "gate" functions (by default sigmoid).

For a non-Google implementation, here's MXNet.

Prior art in Flux: #964

Yeah, I agree that we should have the activations in Flux generally across all layers.

DhairyaLGandhi · 2021-07-23T16:07:59Z

src/layers/recurrent.jl

+See [this article](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
+for a good overview of the internals.
+"""
+GRUv3(a...; ka...) = Recur(GRUv3Cell(a...; ka...))


How do the versions differ api wise? Does this need any extra terms?

We shouldn't. The main difference is the Wh matrix needs to be split so we can apply the reset vector appropriately. But this doesn't require any extra parameters for the constructor.

CarloLucibello · 2021-07-23T16:27:31Z

there seems to be a lot of code duplication, probably adding a symbol parameter to the type, GRU{..., Mode}, and only dispatch the call method on that seems a better alternative

mkschleg · 2021-07-23T16:37:05Z

Right. That is definitely the question I had. The conversation was derailed a bit in #1671, as we started discussing CuDNN support.

If we were to do the parametric option version we would have to figure out how to do the operation in the line I tagged you in with the current struct layout (it is this line). I'm not sure how that would work tbh without extra unnecessary operations or adding a new parametric type for Wh.

mkschleg · 2021-07-23T16:37:28Z

src/layers/recurrent.jl

+  gx, gh = m.Wi*x, m.Wh*h
+  r = σ.(gate(gx, o, 1) .+ gate(gh, o, 1) .+ gate(b, o, 1))
+  z = σ.(gate(gx, o, 2) .+ gate(gh, o, 2) .+ gate(b, o, 2))
+  h̃ = tanh.(gate(gx, o, 3) .+ (m.Wh_h̃ * (r .* h)) .+ gate(b, o, 3))


@CarloLucibello This line.

mkschleg · 2021-07-23T16:39:15Z

@DhairyaLGandhi cuda tests added.

DhairyaLGandhi · 2021-07-23T16:43:07Z

It might make sense to have a function that is a no-op for the regular GRU layer (without changing the struct) and splits the array for v3.

mkschleg · 2021-07-23T16:52:15Z

I remember having issues with views + zygote awhile ago. Is that still the case? (I've not tried since v0.10.x). If we can't use views the split function would create a larger memory footprint, no? I think @darsnack had some thoughts on this in terms of clarity of API.

DhairyaLGandhi · 2021-07-23T16:58:08Z

Should be the same memory wise

mkschleg · 2021-07-23T17:14:23Z

I think I misunderstood what you suggested. This would be in the constructor, right? Then I agree memory would be the same.

We would have to add an extra parametric type to separate Wi and Wh (as they could be different depending on the mode).

So the struct would be:

struct GRUCell{M, Ai, Ah, V, S}
  Wi::Ai
  Wh::Ah
  b::V
  state0::S
end

Where M is the mode. My instinct is to put this first to make dispatch very clear, but we could put it elsewhere.

darsnack · 2021-07-23T17:15:43Z

Repeating what I mentioned from the other thread: GRU(...; mode = :v3)/GRU{:v3}(...) are just as verbose as GRUv3(...). At first glance, the forward passes seem quite different for the line @mkschleg highlighted. Different forward pass == different types makes sense to me here. Having conditional forward passes based on a parameter is messier IMO, and I don't there is a function argument we can pass in that runs the two alternate passes (w/o this function being complex).

We can still reduce code duplication. As an example:

function _gru_output(Wi, Wh, x, h)
   gx, gh = Wi*x, Wh*h
   r = σ.(gate(gx, o, 1) .+ gate(gh, o, 1) .+ gate(b, o, 1))
   z = σ.(gate(gx, o, 2) .+ gate(gh, o, 2) .+ gate(b, o, 2))

   return r, z
end

Then use _gru_output in all the variants. IMO it makes sense to reduce duplication via the API when it results in a intuitive interpretation of what's going on under the hood. I don't think that's the case here.

ToucheSir · 2021-07-23T18:19:31Z

Exploring the other extreme for a second, I wonder if we could make GRUv3Cell the default GRU cell and set Wh_h̃ to Zeros in the GRUCell constructor. In theory, this would reduce .+ (m.Wh_h̃ * (r .* h)) to a no-op and allow us to use one codepath.

mkschleg · 2021-07-23T18:38:59Z

Maybe I'm misunderstanding. That line of the forward for the GRUv3Cell would then look like

  h̃ = tanh.(gate(gx, o, 3) .+ r .* gate(gh, o, 3) .+ (m.Wh_h̃ * (r .* h)) .+ gate(b, o, 3))

How would we turn off r .* gate(gh, o, 3) for the v1 version?

darsnack · 2021-07-23T19:01:04Z

If I understand it correctly, Wh in v1 == [Wh Wh_h̃] in v3. So always adopting the gate(Wh * h, o, 1) .+ Wh_h̃ * (r .* h)) means that Wh takes on a slightly different interpretation each time. In v1, Wh is plain Wh and Wh_h̃ == Zeros(). In v3, we would initialize Wh to be the first two slices and Wh_h̃ to be the last.

ToucheSir · 2021-07-23T19:01:46Z

Sorry, that was my mistake! Missed that the v1 also had a middle term and assumed it was another case of https://julialang.zulipchat.com/#narrow/stream/238249-machine-learning/topic/Elman.20RNN.20definition.

DhairyaLGandhi · 2021-07-23T19:05:27Z

src/layers/recurrent.jl

+    GRUv3Cell(init(out * 3, in), init(out * 2, out), initb(out * 3), 
+              init(out, out), init_state(out,1))
+
+function (m::GRUv3Cell{A,V,<:AbstractMatrix{T}})(h, x::Union{AbstractVecOrMat{T},OneHotArray}) where {A,V,T}


Do we need to remove the types on the input and parameters?

I believe these were introduced as part of #1521. We should tackle them separately for all recurrent cells.

Maybe we should have a central issue which details all the updates to recurrent cells the discussion in this PR and related issue has mentioned?

I'm definitely invested in the recurrent architectures for flux, so would like to help. But knowing all the outstanding issues is out of scope for what I can use my time for right now.

Yeah, let's get this through and litigate general changes to the RNN interface in a separate issue.

mkschleg · 2021-07-31T15:40:35Z

What needs to be done to push this PR through?

src/layers/recurrent.jl

Updating docs Co-authored-by: Kyle Daruwalla <[email protected]>

ToucheSir · 2021-07-31T18:37:57Z

Running CI to see what it thinks. Otherwise I think the only other change would be to squash some of those intermediate update commits.

darsnack

Can you add an entry to NEWS.md?

src/layers/recurrent.jl

mkschleg · 2021-08-02T01:09:00Z

@darsnack Where should I put the announcement in NEWS.md?

darsnack · 2021-08-02T01:11:16Z

As a new bullet under 0.12.7 (we're missing a 0.12.6 new entry for some reason).

mkschleg · 2021-08-02T01:11:20Z

Also, how would I squash those commits? I've not had to do that before.

darsnack · 2021-08-02T01:14:28Z

https://stackoverflow.com/questions/35703556/what-does-it-mean-to-squash-commits-in-git

It's just 3 extra commits than necessary, so I don't think you need to bother. In the future, you can go to the "Files" tab of the PR to add review suggestions to a single commit batch (instead of accepting each suggestion as a separate commit).

darsnack

Looks good, just gotta wait for CI to pass.

darsnack · 2021-08-02T12:29:57Z

bors r+

bors · 2021-08-02T12:55:31Z

Build succeeded:

buildkite/flux-dot-jl

mkschleg added 2 commits July 23, 2021 11:42

First pass on adding GRUv3 support.

dcf5afb

Adding more documentation.

2cf1288

DhairyaLGandhi reviewed Jul 23, 2021

View reviewed changes

mkschleg commented Jul 23, 2021

View reviewed changes

Added cuda tests.

f08c679

DhairyaLGandhi reviewed Jul 23, 2021

View reviewed changes

mkschleg mentioned this pull request Jul 26, 2021

Recurrent network interface updates/design #1678

Open

7 tasks

darsnack requested changes Jul 31, 2021

View reviewed changes

mkschleg and others added 6 commits July 31, 2021 12:10

Update src/layers/recurrent.jl

e62e06f

Updating docs Co-authored-by: Kyle Daruwalla <[email protected]>

Update src/layers/recurrent.jl

a83baec

Updating docs Co-authored-by: Kyle Daruwalla <[email protected]>

Update src/layers/recurrent.jl

56273f0

Updating docs Co-authored-by: Kyle Daruwalla <[email protected]>

Update src/layers/recurrent.jl

6c04915

Updating docs Co-authored-by: Kyle Daruwalla <[email protected]>

Adding _gru_output to reduce code duplication.

2a5420d

Fixing _gru_output function.

f1ea924

darsnack requested changes Jul 31, 2021

View reviewed changes

DhairyaLGandhi reviewed Jul 31, 2021

View reviewed changes

src/layers/recurrent.jl Show resolved Hide resolved

Added entry in NEWS.md, fixed indentation.

434c10e

darsnack approved these changes Aug 2, 2021

View reviewed changes

bors bot merged commit 5d2a955 into FluxML:master Aug 2, 2021

Adding GRUv3 support. #1675

Adding GRUv3 support. #1675

Conversation

mkschleg commented Jul 23, 2021 • edited Loading

PR Checklist

mkschleg commented Jul 23, 2021

DhairyaLGandhi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ToucheSir Jul 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarloLucibello commented Jul 23, 2021

mkschleg commented Jul 23, 2021 • edited Loading

Choose a reason for hiding this comment

mkschleg commented Jul 23, 2021

DhairyaLGandhi commented Jul 23, 2021

mkschleg commented Jul 23, 2021

DhairyaLGandhi commented Jul 23, 2021

mkschleg commented Jul 23, 2021

darsnack commented Jul 23, 2021

ToucheSir commented Jul 23, 2021

mkschleg commented Jul 23, 2021 • edited Loading

darsnack commented Jul 23, 2021

ToucheSir commented Jul 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkschleg commented Jul 31, 2021

ToucheSir commented Jul 31, 2021

darsnack left a comment

Choose a reason for hiding this comment

mkschleg commented Aug 2, 2021

darsnack commented Aug 2, 2021

mkschleg commented Aug 2, 2021

darsnack commented Aug 2, 2021

darsnack left a comment

Choose a reason for hiding this comment

darsnack commented Aug 2, 2021

bors bot commented Aug 2, 2021

mkschleg commented Jul 23, 2021 •

edited

Loading

ToucheSir Jul 23, 2021 •

edited

Loading

mkschleg commented Jul 23, 2021 •

edited

Loading

mkschleg commented Jul 23, 2021 •

edited

Loading