Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix more int overflow during quant (PPL/CUDA). #6563

Merged
merged 3 commits into from
Apr 28, 2024

Conversation

dranger003
Copy link
Contributor

@dranger003 dranger003 commented Apr 9, 2024

Running perplexity on Command-R+ using CUDA is currently broken without this commit (more info here #6491 (comment)).
Although perplexity now works with all tested quants, I may have move some extra vars to int64_t than needed.

@slaren
Copy link
Collaborator

slaren commented Apr 9, 2024

It would be good to have a set of tests in test-backend-ops that use very large tensors to check for overflows. That will also allow testing other backends. These tests will probably take too long to be enabled by default, but they can be left behind an #ifdef or command line parameter.

Copy link
Contributor

github-actions bot commented Apr 9, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 435 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=10823.95ms p(95)=29142.4ms fails=, finish reason: stop=380 truncated=55
  • Prompt processing (pp): avg=122.92tk/s p(95)=555.33tk/s
  • Token generation (tg): avg=26.03tk/s p(95)=38.23tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=ppl-int-overflow-fix commit=0258f9bd3ddbcbfafcfd8019e8902f4cecc9c276

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 435 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714344693 --> 1714345317
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 381.54, 381.54, 381.54, 381.54, 381.54, 651.82, 651.82, 651.82, 651.82, 651.82, 429.52, 429.52, 429.52, 429.52, 429.52, 443.33, 443.33, 443.33, 443.33, 443.33, 465.45, 465.45, 465.45, 465.45, 465.45, 517.84, 517.84, 517.84, 517.84, 517.84, 522.69, 522.69, 522.69, 522.69, 522.69, 527.49, 527.49, 527.49, 527.49, 527.49, 546.1, 546.1, 546.1, 546.1, 546.1, 563.26, 563.26, 563.26, 563.26, 563.26, 565.42, 565.42, 565.42, 565.42, 565.42, 575.2, 575.2, 575.2, 575.2, 575.2, 577.34, 577.34, 577.34, 577.34, 577.34, 592.12, 592.12, 592.12, 592.12, 592.12, 608.56, 608.56, 608.56, 608.56, 608.56, 629.16, 629.16, 629.16, 629.16, 629.16, 637.88, 637.88, 637.88, 637.88, 637.88, 585.33, 585.33, 585.33, 585.33, 585.33, 568.0, 568.0, 568.0, 568.0, 568.0, 577.47, 577.47, 577.47, 577.47, 577.47, 580.0, 580.0, 580.0, 580.0, 580.0, 580.36, 580.36, 580.36, 580.36, 580.36, 600.43, 600.43, 600.43, 600.43, 600.43, 600.12, 600.12, 600.12, 600.12, 600.12, 605.14, 605.14, 605.14, 605.14, 605.14, 611.47, 611.47, 611.47, 611.47, 611.47, 612.44, 612.44, 612.44, 612.44, 612.44, 616.09, 616.09, 616.09, 616.09, 616.09, 617.97, 617.97, 617.97, 617.97, 617.97, 594.28, 594.28, 594.28, 594.28, 594.28, 594.36, 594.36, 594.36, 594.36, 594.36, 598.25, 598.25, 598.25, 598.25, 598.25, 600.36, 600.36, 600.36, 600.36, 600.36, 609.16, 609.16, 609.16, 609.16, 609.16, 611.44, 611.44, 611.44, 611.44, 611.44, 611.64, 611.64, 611.64, 611.64, 611.64, 618.21, 618.21, 618.21, 618.21, 618.21, 620.53, 620.53, 620.53, 620.53, 620.53, 620.36, 620.36, 620.36, 620.36, 620.36, 622.6, 622.6, 622.6, 622.6, 622.6, 629.23, 629.23, 629.23, 629.23, 629.23, 639.68, 639.68, 639.68, 639.68, 639.68, 637.07, 637.07, 637.07, 637.07, 637.07, 633.43, 633.43, 633.43, 633.43, 633.43, 633.7, 633.7, 633.7, 633.7, 633.7, 633.11, 633.11, 633.11, 633.11, 633.11, 633.26, 633.26, 633.26, 633.26, 633.26, 630.95, 630.95, 630.95, 630.95, 630.95, 632.35, 632.35, 632.35, 632.35, 632.35, 636.89, 636.89, 636.89, 636.89, 636.89, 644.01, 644.01, 644.01, 644.01, 644.01, 644.55, 644.55, 644.55, 644.55, 644.55, 643.52, 643.52, 643.52, 643.52, 643.52, 641.55, 641.55, 641.55, 641.55, 641.55, 639.58, 639.58, 639.58, 639.58, 639.58, 637.88, 637.88, 637.88, 637.88, 637.88, 636.92, 636.92, 636.92, 636.92, 636.92, 639.22, 639.22, 639.22, 639.22, 639.22, 642.23, 642.23, 642.23, 642.23, 642.23, 642.47, 642.47, 642.47, 642.47, 642.47, 642.86, 642.86, 642.86]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 435 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714344693 --> 1714345317
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 36.59, 36.59, 36.59, 36.59, 36.59, 35.89, 35.89, 35.89, 35.89, 35.89, 26.22, 26.22, 26.22, 26.22, 26.22, 25.17, 25.17, 25.17, 25.17, 25.17, 21.51, 21.51, 21.51, 21.51, 21.51, 21.67, 21.67, 21.67, 21.67, 21.67, 22.25, 22.25, 22.25, 22.25, 22.25, 23.56, 23.56, 23.56, 23.56, 23.56, 24.42, 24.42, 24.42, 24.42, 24.42, 24.56, 24.56, 24.56, 24.56, 24.56, 24.77, 24.77, 24.77, 24.77, 24.77, 24.32, 24.32, 24.32, 24.32, 24.32, 24.16, 24.16, 24.16, 24.16, 24.16, 24.02, 24.02, 24.02, 24.02, 24.02, 23.44, 23.44, 23.44, 23.44, 23.44, 23.3, 23.3, 23.3, 23.3, 23.3, 22.76, 22.76, 22.76, 22.76, 22.76, 22.61, 22.61, 22.61, 22.61, 22.61, 22.1, 22.1, 22.1, 22.1, 22.1, 21.59, 21.59, 21.59, 21.59, 21.59, 21.7, 21.7, 21.7, 21.7, 21.7, 21.83, 21.83, 21.83, 21.83, 21.83, 21.85, 21.85, 21.85, 21.85, 21.85, 21.77, 21.77, 21.77, 21.77, 21.77, 21.73, 21.73, 21.73, 21.73, 21.73, 21.77, 21.77, 21.77, 21.77, 21.77, 21.79, 21.79, 21.79, 21.79, 21.79, 21.86, 21.86, 21.86, 21.86, 21.86, 21.96, 21.96, 21.96, 21.96, 21.96, 21.99, 21.99, 21.99, 21.99, 21.99, 21.84, 21.84, 21.84, 21.84, 21.84, 21.92, 21.92, 21.92, 21.92, 21.92, 22.06, 22.06, 22.06, 22.06, 22.06, 22.05, 22.05, 22.05, 22.05, 22.05, 21.92, 21.92, 21.92, 21.92, 21.92, 22.0, 22.0, 22.0, 22.0, 22.0, 22.27, 22.27, 22.27, 22.27, 22.27, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.61, 22.61, 22.61, 22.61, 22.61, 22.7, 22.7, 22.7, 22.7, 22.7, 22.65, 22.65, 22.65, 22.65, 22.65, 22.64, 22.64, 22.64, 22.64, 22.64, 22.63, 22.63, 22.63, 22.63, 22.63, 22.35, 22.35, 22.35, 22.35, 22.35, 22.37, 22.37, 22.37, 22.37, 22.37, 22.33, 22.33, 22.33, 22.33, 22.33, 22.42, 22.42, 22.42, 22.42, 22.42, 22.55, 22.55, 22.55, 22.55, 22.55, 22.61, 22.61, 22.61, 22.61, 22.61, 22.61, 22.61, 22.61, 22.61, 22.61, 22.6, 22.6, 22.6, 22.6, 22.6, 22.47, 22.47, 22.47, 22.47, 22.47, 22.31, 22.31, 22.31, 22.31, 22.31, 21.95, 21.95, 21.95, 21.95, 21.95, 21.15, 21.15, 21.15, 21.15, 21.15, 20.92, 20.92, 20.92, 20.92, 20.92, 20.65, 20.65, 20.65, 20.65, 20.65, 20.63, 20.63, 20.63, 20.63, 20.63, 20.63, 20.63, 20.63, 20.63, 20.63, 20.67, 20.67, 20.67]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 435 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714344693 --> 1714345317
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.09, 0.09, 0.09, 0.09, 0.09, 0.38, 0.38, 0.38, 0.38, 0.38, 0.23, 0.23, 0.23, 0.23, 0.23, 0.31, 0.31, 0.31, 0.31, 0.31, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24, 0.24, 0.24, 0.24, 0.24, 0.19, 0.19, 0.19, 0.19, 0.19, 0.29, 0.29, 0.29, 0.29, 0.29, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.22, 0.22, 0.22, 0.22, 0.22, 0.39, 0.39, 0.39, 0.39, 0.39, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.34, 0.34, 0.34, 0.34, 0.34, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.22, 0.22, 0.22, 0.22, 0.22, 0.38, 0.38, 0.38, 0.38, 0.38, 0.43, 0.43, 0.43, 0.43, 0.43, 0.53, 0.53, 0.53, 0.53, 0.53, 0.6, 0.6, 0.6, 0.6, 0.6, 0.5, 0.5, 0.5, 0.5, 0.5, 0.4, 0.4, 0.4, 0.4, 0.4, 0.26, 0.26, 0.26, 0.26, 0.26, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 435 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714344693 --> 1714345317
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0]
                    
Loading

@jxy
Copy link
Contributor

jxy commented Apr 10, 2024

Given blockIdx.x, blockDim.x, and threadIdx.x are all basically uint32_t, we could keep some of those as uint32_t and only cast them to uint64_t or int64_t when actually necessary.

@randoentity
Copy link

randoentity commented Apr 10, 2024

Edit: Ignore below. I was using the wrong environment. Retesting.
Edit 2: Either I'm doing something wrong with my environment or there's some regression because I keep getting a segmentation fault in text-generation-webui where before it was working. Most likely the former. It's working fine with llama-cpp-python serving to SillyTavern. No repetition issues that way either!

I'm still getting a segmentation fault when running inference using both the latest master as well as this branch. I've tried ggml-c4ai-command-r-plus-104b-iq3_xs.gguf and ggml-c4ai-command-r-plus-104b-iq4_xs.gguf (I know about gguf --merge). At an earlier commit inference did work, although the model went into a repetition loop (I've seen this mentioned on Reddit as well).
I'm running inference on text-generation-webui out of habit.
Sorry if I'm missing some key detail.

@JohannesGaessler
Copy link
Collaborator

Given blockIdx.x, blockDim.x, and threadIdx.x are all basically uint32_t, we could keep some of those as uint32_t and only cast them to uint64_t or int64_t when actually necessary.

There are two disadvantages with 64 bit integers over 32 bit integers: they need 2 registers and they are slower. But for dequantize kernels I would intuitively assume that this is not going to matter because you need very few registers and you're going to be heavily IO bound anyways. So for simplicity I would say to just use 64 bits throughout unless someone can demonstrate that this actually makes a performance difference (I'm not seeing any performance difference on my RTX 3090, my other GPUs are currently busy).

@dranger003
Copy link
Contributor Author

Just saw @JohannesGaessler's comment (after I pushed the revert). I can revert the revert if decided to be the right approach.

@JohannesGaessler
Copy link
Collaborator

I personally would in this case prefer to just consistently use 64 bit ints, but ultimately I would say either way is fine. The biggest issue would have been the additional effort from actually changing the code but this has already been done anyways.

@JohannesGaessler
Copy link
Collaborator

I completely forgot about this PR. @slaren even without the tests, do you think we should just merge it, given that it seems to fix the issue for at least one backend?

@slaren
Copy link
Collaborator

slaren commented Apr 28, 2024

Yes absolutely, we should merge this now if it solves the immediate problem. The changes look good to me.

ggml-cuda.cu Outdated
@@ -1225,7 +1225,7 @@ static void ggml_cuda_op_mul_mat_cublas(

// the main device has a larger memory buffer to hold the results from all GPUs
// ldc == nrows of the matrix that cuBLAS writes into
int64_t ldc = id == ctx.device ? ne0 : row_diff;
int ldc = id == ctx.device ? ne0 : row_diff;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why is this being changed? I thought the problem was that certain ints had too few bits for large models.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you maybe, in response to one of my earlier comments, accidentally change more places than just the ones originally touched in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohannesGaessler This one was reverted following an earlier comment questionning why it was changed in the first place. As previously mentioned, I have limited knowledge about these vars and rely on others expertise for the review. And because of the large number of ints that was overflowing, I had to guess and change them in batches until all the crashes were fixed, but surely I most likely changed more than needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to change more int to int64_t than necessary. But this is a change where a value was int64_t on master to int with your PR. I think this was done on accident when you reverted some of your other changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is because my previous PR was merged into master, this is a subsequent PR. I can revert them back if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The revert is in a single commit dranger003@9acb43d so if these are all fine I can delete that one commit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just delete the commit I'd say. Using int64_t has no disadvantages other than maybe slightly worse performance and I was not able to measure any performance difference whatsoever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a rebase to remove the revert commit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this particular change is still there. Revert it and I'll merge.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cublasGemmEx takes an int anyway, so this doesn't really matter. There is a 64-bit interface to cublas, but I don't think there are any cases where a single dimension is larger than 2^31-1.

ggml-cuda.cu Outdated
Comment on lines 1709 to 1710
int i13 = blockIdx.x * blockDim.x + threadIdx.x;
int i12 = blockIdx.y * blockDim.y + threadIdx.y;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, why the int64_t -> int change?

@@ -5,16 +5,16 @@

template <int qk, int qr, dequantize_kernel_t dequantize_kernel, typename dst_t>
static __global__ void dequantize_block(const void * __restrict__ vx, dst_t * __restrict__ y, const int64_t k) {
const int64_t i = 2*(blockDim.x*blockIdx.x + threadIdx.x);
const int i = 2*(blockDim.x*blockIdx.x + threadIdx.x);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

Comment on lines 320 to 323
const int64_t tid = threadIdx.x;
const int64_t ip = tid/32; // ip is 0 or 1
const int64_t il = tid - 32*ip; // 0...32
const int64_t is = 8*ip + il/16;
const int tid = threadIdx.x;
const int ip = tid/32; // ip is 0 or 1
const int il = tid - 32*ip; // 0...32
const int is = 8*ip + il/16;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

Comment on lines 340 to 342
const int64_t tid = threadIdx.x;
const int64_t ip = tid/16; // 0 or 1
const int64_t il = tid - 16*ip; // 0...15
const int tid = threadIdx.x;
const int ip = tid/16; // 0 or 1
const int il = tid - 16*ip; // 0...15
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

const int64_t i = (int64_t)blockDim.x*blockIdx.x + threadIdx.x;
const int i = blockDim.x*blockIdx.x + threadIdx.x;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were all originally int and I reverted them to avoid changing more than needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No they were not. Go to the "files changed" tab and look at the combined changes of all of your commits relative to master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed them in PR #6491.

@dranger003 dranger003 force-pushed the ppl-int-overflow-fix branch from 4947778 to 91c10ef Compare April 28, 2024 22:21
@JohannesGaessler JohannesGaessler merged commit e00b4a8 into ggerganov:master Apr 28, 2024
51 of 58 checks passed
@dranger003
Copy link
Contributor Author

Closes #6948.

@dranger003 dranger003 deleted the ppl-int-overflow-fix branch May 1, 2024 11:29
nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024
* Fix more int overflow during quant.

* Fix some more int overflow in softmax.

* Revert back to int64_t.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants