Question concerning `float4` #767

dongrixinyu · 2024-09-25T02:19:25Z

I noticed when applying addition of token embedding and position embedding, llm.c uses float4 type.

// use of float4 leads to using 128-bit LDG / STG instructions in SASS,
// very helpful in memory-bound kernels like encoder_forward
__global__ void encoder_forward_kernel3(float4* out,
                               const int* inp, const float4* wte, const float4* wpe,
                               int B, int T, int C) {
    int C4 = C / 4;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int N = B * T * C4;
    if (idx < N) {
        int bt = idx / C4;
        int b = bt / T;
        int t = bt % T;
        int c4 = idx % C4;
        int ix = inp[b * T + t];
        out[b * T * C4 + t * C4 + c4] = add_float4(wte[ix * C4 + c4], wpe[t * C4 + c4]);
    }
}

as well as

    float vals[8][8] = {};
    if(bias != NULL) {
        for (int i = 0; i < 8; i++) {
            for (int j = 0; j < 8; j += 4) {
                float4 b = ld_vec(bias + oc + j);
                vals[i][j+0] = b.x;
                vals[i][j+1] = b.y;
                vals[i][j+2] = b.z;
                vals[i][j+3] = b.w;
            }
        }
    }

Is float4 more faster than float?? Im curious because I check the float4 is a struct containing 4 float element, where the calling of pointers will consume a little bit of time.
the operation of C4 will still do more computation by the way.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question concerning `float4` #767

Question concerning `float4` #767

dongrixinyu commented Sep 25, 2024 •

edited

Loading

Question concerning float4 #767

Question concerning float4 #767

Comments

dongrixinyu commented Sep 25, 2024 • edited Loading

Question concerning `float4` #767

Question concerning `float4` #767

dongrixinyu commented Sep 25, 2024 •

edited

Loading