Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make gpu(::DataLoader) work or error loudly if it doesn't #1974

Closed
ToucheSir opened this issue May 27, 2022 · 6 comments · Fixed by #1976
Closed

Make gpu(::DataLoader) work or error loudly if it doesn't #1974

ToucheSir opened this issue May 27, 2022 · 6 comments · Fixed by #1976

Comments

@ToucheSir
Copy link
Member

This has come up a couple times, most recently in https://discourse.julialang.org/t/training-with-flux-jl-on-the-gpu-causes-argumenterror-cannot-take-the-cpu-address-of-a-cuarray/81751/2. RFC on what the desired approach is.

@DhairyaLGandhi
Copy link
Member

Is DataLoader(gpu(..)) sufficiently succinct syntax?

@darsnack
Copy link
Member

I don't think gpu(DataLoader(...)) should be made to work because we don't know whether the entire dataset fits in memory or not. Depending on your data set, you might have gpu.(...) or mapobs(gpu, ...). A big error suggesting either would be good.

@DhairyaLGandhi
Copy link
Member

right, suggesting moving the data prior to constructing the DataLoader should allow for expression of intent. Whether they want to move the whole data set or not. Beyond that, allowing DataLoader to accept gpu as a preprocessing step would allow expressing "load minibatches". CuIterator is another option.

@lfenzo
Copy link
Contributor

lfenzo commented May 28, 2022

When I created this post I knew that my data would fit in the GPU memory (because it was not a very big dataset), but I wasn't really understanding why such data was not being loaded into the GPU. I thought that if the whole DataLoader object was fed to the GPU (using dataloader |> gpu) then so would be the data inside it. I also thought that transferring the whole training dataset to the GPU, as it had enough memory for it, would be better than iteractively transferring the training batches, but my tests showed no significant difference for that particular dataset and training configuration.

IMO loading the whole object with the correspondent data to the GPU with gpu(::DataLoader) is fine. If the GPU runs out of memory during the allocation process then an error will be thrown and, by the stacktrace, it will be possible to identify the source of the error; or maybe the gpu methods could point out which object is causing the allocation that originated the error.

The real issue though was that I couldn't find any good examples of GPU usage in Flux until I decided to dig into the model zoo for more practical usage scenarios. It would have been easier to solve this issue if the "GPU Support" page in the docs presented some workflows similar to the ones in the model zoo as for how to transfer artifacts between CPU and GPU. Common operations such as transferring data and models exemplified and explained would be extremely helpful!

Some of the examples could include, for instance:

  • Loading data to the GPU to perform training
train_loader = Flux.DataLoader((xtrain, ytrain), batchsize = 64, shuffle = true)
# ... model, optimizer and loss definitions
for epoch in 1:nepochs
    for (xtrain_batch, ytrain_batch) in train_loader
        x = gpu(xtrain_batch)
        y = gpu(ytrain_batch)
        gradients = gradient(() -> loss(x, y), parameters)
        Flux.Optimise.update!(optimizer, parameters, gradients)
    end
end
  • Or returning model to back to the CPU in order to save it
let model = cpu(model)
    BSON.@save modelpath model epoch
end

@ToucheSir
Copy link
Member Author

I think a big argument for not exposing gpu(::DataLoader) is that you can't tell whether or not it worked. gpu is necessarily a no-op for any types it doesn't understand, and the internals of DataLoader are completely opaque to the user. Therefore, it's much easier to ask that data be moved before it is passed to or after it is pulled from the data loader. If I had to guess, this is also part of why PyTorch exposes .to(device) on Datasets but not DataLoaders.

But as you pointed out, the first-order issue is a lack of good docs. If you have a rough idea of how you'd like them to look, please do drop a PR and we'll make sure it gets in. One reason I think the status quo has persisted for so long is that the most efficient method we have for GPU data movement is https://cuda.juliagpu.org/stable/usage/memory/#Batching-iterator, but it's not quite general enough for all datasets people want to use. However, that can be remedied by just having slightly longer documentation which presents multiple alternatives.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented May 29, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants