Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when training simple Flux model #1777

Closed
egolep opened this issue Nov 23, 2021 · 9 comments
Closed

Error when training simple Flux model #1777

egolep opened this issue Nov 23, 2021 · 9 comments

Comments

@egolep
Copy link

egolep commented Nov 23, 2021

I'm training a pretty simple model:

shallow_net = Chain(
Dense(18, 256, relu),
Dense(256, 512, relu),
Dense(512, 1024, relu),
Dense(1024, 1)
)

the dataset is a simple Matrix of Float64 (as reported in the error below), I'm using the ADAM optimizer and a dataloader:
train_loader = DataLoader((data=train_X, label=train_y),
batchsize=batch_size,
shuffle=true)

and I'm training using Flux's default train! function and the @epochs macro.

The code was working with no problem on my office machine, but on my home machine I keep getting:

ERROR: Need an adjoint for constructor Base.SkipMissing{Matrix{Float64}}. Gradient is of type Vector{Float64}

I don't get what the problem is (it's literally the same code since I pulled it from a private git repo)
Is it a problem with a new version of Flux?

The two machines are pretty similar and both have an AMD Ryzen 5 CPU and a 2xxx NVIDIA GPU

@DhairyaLGandhi
Copy link
Member

What version of Julia, flux and zygote are being used on the two machines?

@egolep
Copy link
Author

egolep commented Nov 23, 2021

I updated my system this morning and now it is not working (same error) even on my office machine.
The system is:
- Julia: 1.6.4
- Flux: 0.12.8
- Zygote: 0.6.30

probably, I was running julia 1.6.3 before the update.

@ToucheSir
Copy link
Member

As with any issue, the two things we need are:

  1. A full stacktrace. Just the first line is not near enough to see where the error propagated from.
  2. A runnable MWE. Dummy data is fine, but it must repro the error.

For this issue specifically, I find the presence of Base.SkipMissing extremely suspect. Perhaps some package code is now calling skipmissing under the hood where it wasn't before?

@DhairyaLGandhi
Copy link
Member

Yeah maybe @nograd Base.SkipMissing is all we need, but I don't know how skipmissing is implemented exactly in the latest Julia and whether it ends up messing with iterate.

@egolep
Copy link
Author

egolep commented Nov 23, 2021

Forgive me if I reported only the first line, but the error looked so strange I thought it would be easily understandable.

The full stack is here:
https://pastebin.com/gN1JSTc6

An extract of the dataset can be found here:
https://drive.google.com/file/d/1myc1P-JESrq24m4yhUWV2yzdckwgcn2e/view?usp=sharing

The full code is here:
https://pastebin.com/iMmth8xn (sorry there's an error: MLJ is imported as a whole, since I also use rms)

(it's one of the first time I'm trying to use Flux, so the code is probably garbage, and any suggestion other than understanding the error is very welcome)

other libraries status is:

  • DataFrames: 1.2.2
  • CSV: 0.9.11
  • CUDA: 3.5.0
  • MLJ: 0.16.11

@ToucheSir
Copy link
Member

Thanks, for future reference you can use gists or markdown codeblocks for smaller snippets (including attaching CSV data files). A couple of questions:

  1. Where is rms defined? I don't see it anywhere in the imported packages or by searching on JuliaHub.
  2. What are the types of train_X, train_y, test_X and test_y? I would assume they are plain Arrays because of collect, but there may be other wrappers leaking through.

@egolep
Copy link
Author

egolep commented Nov 23, 2021

  1. rms is defined in MLJ, that's why it should be included and not just used to import partition.
  2. They are Matrix{Float64}. I added the collect function because I saw "adjoint" in the error message and initially I thought that it could be a problem with the fact that X is transposed and so also train_X, train_y, test_X and test_y was adjoint(::Matrix{Float64}) with eltype Float64

Thanks for the suggestions about gists and codeblocks, I will definitely use them!

@egolep
Copy link
Author

egolep commented Nov 23, 2021

And it looks like the problem really was rms: using Flux's mse and aplying root square to it resolved the issue.
Thanks for your advices and for your time.

@ToucheSir
Copy link
Member

Glad to managed to solve it! For posterity's sake, MLJBase's rms calls https://github.com/JuliaAI/MLJBase.jl/blob/5c2a98cba32c094414c71ed01d028e4be5dee865/src/measures/measures.jl#L163, which calls https://github.com/JuliaAI/MLJBase.jl/blob/v0.18.26/src/data/data.jl#L388 which calls skipmissing. We ought to figure out how to make that function work with Zygote, but for this case calling sqrt(Flux.mse(...)) as you describe is definitely the best (and fastest) way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants