-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fused layernorm #1105
fused layernorm #1105
Conversation
yang
commented
Dec 24, 2023
- Add simple util for timings
- Add fused layernorm kernel from Megatron
6af218c
to
e9eed53
Compare
Howdy, Upon testing, we found no speedup and eventual slowdown when training a 1-3B model for 320 iterations. You can see my test here. |
If anything, it looks like the baseline is the one with anomalous behavior here. Did you double and triple check by running it multiple times? How sure are you that something weird didn't just happen by magic to cause the baseline to speed up? |
@StellaAthena I agree, so I ran some extra tests, still available from the above link. I see no difference between fused layer norm being on/off within the variance of the runs. |
I looked over these results with @jahatef and agree it falls within variance. I'm going to run some CUDA profiling to check if data movement is decreased with the layernorm kernel, and will merge if so. This would be on the grounds that as newer GPUs spend progressively less time on GEMMs, data movement becomes more critical. |
Yeah that looks correct to me as well. |
I see meaningful decrease in data movement and confirmed the preservation of accuracy. Merging. Thanks a ton for this @yang |