Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fused layernorm #1105

Merged
merged 5 commits into from
Jan 26, 2024
Merged

fused layernorm #1105

merged 5 commits into from
Jan 26, 2024

Conversation

yang
Copy link
Contributor

@yang yang commented Dec 24, 2023

  • Add simple util for timings
  • Add fused layernorm kernel from Megatron

@CLAassistant
Copy link

CLAassistant commented Dec 24, 2023

CLA assistant check
All committers have signed the CLA.

@yang yang force-pushed the fused-layernorm branch 2 times, most recently from 6af218c to e9eed53 Compare January 5, 2024 01:43
@jahatef
Copy link
Collaborator

jahatef commented Jan 20, 2024

Howdy,

Upon testing, we found no speedup and eventual slowdown when training a 1-3B model for 320 iterations. You can see my test here.

@StellaAthena
Copy link
Member

StellaAthena commented Jan 20, 2024

Howdy,

Upon testing, we found no speedup and eventual slowdown when training a 1-3B model for 320 iterations. You can see my test here.

If anything, it looks like the baseline is the one with anomalous behavior here. Did you double and triple check by running it multiple times? How sure are you that something weird didn't just happen by magic to cause the baseline to speed up?

@jahatef
Copy link
Collaborator

jahatef commented Jan 21, 2024

@StellaAthena I agree, so I ran some extra tests, still available from the above link. I see no difference between fused layer norm being on/off within the variance of the runs.

@Quentin-Anthony
Copy link
Member

I looked over these results with @jahatef and agree it falls within variance. I'm going to run some CUDA profiling to check if data movement is decreased with the layernorm kernel, and will merge if so. This would be on the grounds that as newer GPUs spend progressively less time on GEMMs, data movement becomes more critical.

@StellaAthena
Copy link
Member

Yeah that looks correct to me as well.

@Quentin-Anthony
Copy link
Member

I see meaningful decrease in data movement and confirmed the preservation of accuracy. Merging. Thanks a ton for this @yang

@Quentin-Anthony Quentin-Anthony merged commit 3d8fec0 into EleutherAI:main Jan 26, 2024
2 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants