Replies: 1 comment 3 replies
-
A default |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When your dataset generates random numbers (e.g. for data augmentation), the seed in the workers is only correct for torch but not other libraries like numpy will set an identical seed in all workers [1]. The result is that data augmentations are the same for all workers which can obviously affect performance.
Lightning could provide a default worker init fn which takes care of this issue. Even if we don't add a default, the question is how would the user do it properly today in Lightning? The tricky part is that it is required to reset the seed every epoch (when the dataloader is consumed completely).
Note: An easy way out of this is to not use numpy at all to generate random numbers in the dataloader. When using torch, you're fine.
[1] https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects/
Beta Was this translation helpful? Give feedback.
All reactions