Error when training simple Flux model #1777

egolep · 2021-11-23T13:00:44Z

I'm training a pretty simple model:

shallow_net = Chain(
Dense(18, 256, relu),
Dense(256, 512, relu),
Dense(512, 1024, relu),
Dense(1024, 1)
)

the dataset is a simple Matrix of Float64 (as reported in the error below), I'm using the ADAM optimizer and a dataloader:
train_loader = DataLoader((data=train_X, label=train_y),
batchsize=batch_size,
shuffle=true)

and I'm training using Flux's default train! function and the @epochs macro.

The code was working with no problem on my office machine, but on my home machine I keep getting:

ERROR: Need an adjoint for constructor Base.SkipMissing{Matrix{Float64}}. Gradient is of type Vector{Float64}

I don't get what the problem is (it's literally the same code since I pulled it from a private git repo)
Is it a problem with a new version of Flux?

The two machines are pretty similar and both have an AMD Ryzen 5 CPU and a 2xxx NVIDIA GPU

DhairyaLGandhi · 2021-11-23T14:02:34Z

What version of Julia, flux and zygote are being used on the two machines?

egolep · 2021-11-23T15:00:20Z

I updated my system this morning and now it is not working (same error) even on my office machine.
The system is:
- Julia: 1.6.4
- Flux: 0.12.8
- Zygote: 0.6.30

probably, I was running julia 1.6.3 before the update.

ToucheSir · 2021-11-23T16:40:25Z

As with any issue, the two things we need are:

A full stacktrace. Just the first line is not near enough to see where the error propagated from.
A runnable MWE. Dummy data is fine, but it must repro the error.

For this issue specifically, I find the presence of Base.SkipMissing extremely suspect. Perhaps some package code is now calling skipmissing under the hood where it wasn't before?

DhairyaLGandhi · 2021-11-23T17:07:59Z

Yeah maybe @nograd Base.SkipMissing is all we need, but I don't know how skipmissing is implemented exactly in the latest Julia and whether it ends up messing with iterate.

egolep · 2021-11-23T17:31:55Z

Forgive me if I reported only the first line, but the error looked so strange I thought it would be easily understandable.

The full stack is here:
https://pastebin.com/gN1JSTc6

An extract of the dataset can be found here:
https://drive.google.com/file/d/1myc1P-JESrq24m4yhUWV2yzdckwgcn2e/view?usp=sharing

The full code is here:
https://pastebin.com/iMmth8xn (sorry there's an error: MLJ is imported as a whole, since I also use rms)

(it's one of the first time I'm trying to use Flux, so the code is probably garbage, and any suggestion other than understanding the error is very welcome)

other libraries status is:

DataFrames: 1.2.2
CSV: 0.9.11
CUDA: 3.5.0
MLJ: 0.16.11

ToucheSir · 2021-11-23T17:46:11Z

Thanks, for future reference you can use gists or markdown codeblocks for smaller snippets (including attaching CSV data files). A couple of questions:

Where is rms defined? I don't see it anywhere in the imported packages or by searching on JuliaHub.
What are the types of train_X, train_y, test_X and test_y? I would assume they are plain Arrays because of collect, but there may be other wrappers leaking through.

egolep · 2021-11-23T17:53:37Z

rms is defined in MLJ, that's why it should be included and not just used to import partition.
They are Matrix{Float64}. I added the collect function because I saw "adjoint" in the error message and initially I thought that it could be a problem with the fact that X is transposed and so also train_X, train_y, test_X and test_y was adjoint(::Matrix{Float64}) with eltype Float64

Thanks for the suggestions about gists and codeblocks, I will definitely use them!

egolep · 2021-11-23T18:22:05Z

And it looks like the problem really was rms: using Flux's mse and aplying root square to it resolved the issue.
Thanks for your advices and for your time.

ToucheSir · 2021-11-23T18:31:09Z

Glad to managed to solve it! For posterity's sake, MLJBase's rms calls https://github.com/JuliaAI/MLJBase.jl/blob/5c2a98cba32c094414c71ed01d028e4be5dee865/src/measures/measures.jl#L163, which calls https://github.com/JuliaAI/MLJBase.jl/blob/v0.18.26/src/data/data.jl#L388 which calls skipmissing. We ought to figure out how to make that function work with Zygote, but for this case calling sqrt(Flux.mse(...)) as you describe is definitely the best (and fastest) way to go.

ToucheSir added the gradients label Nov 23, 2021

ToucheSir closed this as completed Nov 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when training simple Flux model #1777

Error when training simple Flux model #1777

egolep commented Nov 23, 2021

DhairyaLGandhi commented Nov 23, 2021

egolep commented Nov 23, 2021

ToucheSir commented Nov 23, 2021

DhairyaLGandhi commented Nov 23, 2021

egolep commented Nov 23, 2021 •

edited

Loading

ToucheSir commented Nov 23, 2021

egolep commented Nov 23, 2021 •

edited

Loading

egolep commented Nov 23, 2021 •

edited

Loading

ToucheSir commented Nov 23, 2021

Error when training simple Flux model #1777

Error when training simple Flux model #1777

Comments

egolep commented Nov 23, 2021

DhairyaLGandhi commented Nov 23, 2021

egolep commented Nov 23, 2021

ToucheSir commented Nov 23, 2021

DhairyaLGandhi commented Nov 23, 2021

egolep commented Nov 23, 2021 • edited Loading

ToucheSir commented Nov 23, 2021

egolep commented Nov 23, 2021 • edited Loading

egolep commented Nov 23, 2021 • edited Loading

ToucheSir commented Nov 23, 2021

egolep commented Nov 23, 2021 •

edited

Loading

egolep commented Nov 23, 2021 •

edited

Loading

egolep commented Nov 23, 2021 •

edited

Loading