You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed when applying addition of token embedding and position embedding, llm.c uses float4 type.
// use of float4 leads to using 128-bit LDG / STG instructions in SASS,
// very helpful in memory-bound kernels like encoder_forward
__global__ void encoder_forward_kernel3(float4* out,
const int* inp, const float4* wte, const float4* wpe,
int B, int T, int C) {
int C4 = C / 4;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int N = B * T * C4;
if (idx < N) {
int bt = idx / C4;
int b = bt / T;
int t = bt % T;
int c4 = idx % C4;
int ix = inp[b * T + t];
out[b * T * C4 + t * C4 + c4] = add_float4(wte[ix * C4 + c4], wpe[t * C4 + c4]);
}
}
Is float4 more faster than float?? Im curious because I check the float4 is a struct containing 4 float element, where the calling of pointers will consume a little bit of time.
the operation of C4 will still do more computation by the way.
The text was updated successfully, but these errors were encountered:
I noticed when applying addition of token embedding and position embedding, llm.c uses
float4
type.as well as
Is
float4
more faster thanfloat
?? Im curious because I check thefloat4
is a struct containing 4float
element, where the calling of pointers will consume a little bit of time.the operation of
C4
will still do more computation by the way.The text was updated successfully, but these errors were encountered: