[LLM Runtime] Control printing information using NEURAL_SPEED_VERBOSE #1054

zhenwei-intel · 2023-12-21T02:34:12Z

Type of Change

feature
API changed

Description

NEURAL_SPEED_VERBOSE for c++ and python api.

Enable verbose mode and control tracing information using the NEURAL_SPEED_VERBOSE environment variable.

Available modes:

0: Print all tracing information. Comprehensive output, including: evaluation time and operator profiling.
1: Print evaluation time. Time taken for each evaluation.
2: Profile individual operators. Identify performance bottlenecks within the model.

example:

NEURAL_SPEED_VERBOSE=1 ./build/bin/run_llama -m runtime_outs/ne_llama_q_int4_jblas_cint8_g32.bin -p "once upon a time, a little girl" -n 10

...................................................................................................
model_init_from_file: support_jblas_kv = 0
model_init_from_file: kv self size =  128.00 MB

system_info: n_threads = 56 / 112 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 10, n_keep = 0


 once upon a time, a little girl named Lily lived in a small village nestled
model_print_timings:        load time =  2233.65 ms
model_print_timings:      sample time =     7.48 ms /    10 runs   (    0.75 ms per token)
model_print_timings: prompt eval time =   222.95 ms /     9 tokens (   24.77 ms per token)
model_print_timings:        eval time =   408.06 ms /     9 runs   (   45.34 ms per token)
model_print_timings:       total time =  2653.31 ms
========== eval time log of each prediction ==========
prediction   0, time: 222.95ms
prediction   1, time: 43.97ms
prediction   2, time: 43.83ms
prediction   3, time: 43.74ms
prediction   4, time: 43.80ms
prediction   5, time: 50.83ms
prediction   6, time: 45.16ms
prediction   7, time: 46.75ms
prediction   8, time: 44.81ms
prediction   9, time: 45.19ms

Expected Behavior & Potential Risk

the expected behavior that triggered by this PR

How has this PR been tested?

how to reproduce the test (including hardware information)

Dependency Change?

any library dependency introduced or removed

Signed-off-by: zhenwei-intel <[email protected]>

a32543254

LGTM

intel_extension_for_transformers/llm/runtime/graph/__init__.py

Signed-off-by: zhenwei-intel <[email protected]>

intel_extension_for_transformers/llm/runtime/graph/scripts/python_api_example.py

Signed-off-by: zhenwei-intel <[email protected]>

zhenwei-intel · 2023-12-22T05:49:39Z

NEURAL_SPEED_VERBOSE

0: print all
1: print eval time
2: profile op

@a32543254 , help review this design
example:

NEURAL_SPEED_VERBOSE=1 ./build/bin/run_llama -m runtime_outs/ne_llama_q_int4_jblas_cint8_g32.bin -p "once upon a time, a little girl" -n 10

...................................................................................................
model_init_from_file: support_jblas_kv = 0
model_init_from_file: kv self size =  128.00 MB

system_info: n_threads = 56 / 112 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 10, n_keep = 0


 once upon a time, a little girl named Lily lived in a small village nestled
model_print_timings:        load time =  2233.65 ms
model_print_timings:      sample time =     7.48 ms /    10 runs   (    0.75 ms per token)
model_print_timings: prompt eval time =   222.95 ms /     9 tokens (   24.77 ms per token)
model_print_timings:        eval time =   408.06 ms /     9 runs   (   45.34 ms per token)
model_print_timings:       total time =  2653.31 ms
========== eval time log of each prediction ==========
prediction   0, time: 222.95ms
prediction   1, time: 43.97ms
prediction   2, time: 43.83ms
prediction   3, time: 43.74ms
prediction   4, time: 43.80ms
prediction   5, time: 50.83ms
prediction   6, time: 45.16ms
prediction   7, time: 46.75ms
prediction   8, time: 44.81ms
prediction   9, time: 45.19ms

a32543254

LGTM

Signed-off-by: zhenwei-intel <[email protected]>

zhenwei-intel · 2023-12-25T02:18:10Z

@hshen14 @kevinintel @airMeng @DDEle , redesign log, help review again~

airMeng · 2023-12-25T13:18:30Z

intel_extension_for_transformers/llm/runtime/graph/README.md

+- 0: Print all tracing information. Comprehensive output, including: evaluation time and operator profiling.
+- 1: Print evaluation time. Time taken for each evaluation.
+- 2: Profile individual operator. Identify performance bottleneck within the model.


why 0 is the most comprehensive, which is quite rare

It's the same as log level, which 0 for debug and larger the less info. not rare

how about 1. 2. 3. because VERBOSE=0 we usually think it is disabling verbose

I got your point, agreed~
@zhenwei-intel could you help change this 0 for disable and 1 for print all .

.github/workflows/script/models/cpp_graph_inference.sh

print eval time in python api

d0b1d2d

Signed-off-by: zhenwei-intel <[email protected]>

zhenwei-intel requested a review from airMeng as a code owner December 21, 2023 02:34

zhenwei-intel requested review from a32543254 and kevinintel December 21, 2023 02:34

update readme

b4eaea5

Signed-off-by: zhenwei-intel <[email protected]>

a32543254 approved these changes Dec 21, 2023

View reviewed changes

airMeng reviewed Dec 21, 2023

View reviewed changes

intel_extension_for_transformers/llm/runtime/graph/__init__.py Outdated Show resolved Hide resolved

DDEle reviewed Dec 21, 2023

View reviewed changes

intel_extension_for_transformers/llm/runtime/graph/__init__.py Show resolved Hide resolved

zhenwei-intel added 2 commits December 21, 2023 10:46

update

38a27dd

Signed-off-by: zhenwei-intel <[email protected]>

update

1ab5357

Signed-off-by: zhenwei-intel <[email protected]>

kevinintel reviewed Dec 21, 2023

View reviewed changes

intel_extension_for_transformers/llm/runtime/graph/scripts/python_api_example.py Outdated Show resolved Hide resolved

update

0b430ce

Signed-off-by: zhenwei-intel <[email protected]>

airMeng approved these changes Dec 21, 2023

View reviewed changes

airMeng added the ITREX.cpp label Dec 21, 2023

zhenwei-intel added ready for merge WIP and removed ready for merge labels Dec 21, 2023

zhenwei-intel marked this pull request as draft December 21, 2023 06:19

zhenwei-intel added 3 commits December 22, 2023 13:34

use env NEURAL_SPEED_VERBOSE

68c965e

Signed-off-by: zhenwei-intel <[email protected]>

log level for profiling

9dbf735

Signed-off-by: zhenwei-intel <[email protected]>

update for each model

e25bf85

Signed-off-by: zhenwei-intel <[email protected]>

a32543254 approved these changes Dec 22, 2023

View reviewed changes

zhenwei-intel added 2 commits December 25, 2023 09:12

update readme

e241bb2

Signed-off-by: zhenwei-intel <[email protected]>

cpp lint

5097e9c

Signed-off-by: zhenwei-intel <[email protected]>

zhenwei-intel changed the title ~~[LLM Runtime] Print eval time in python api~~ [LLM Runtime] Control printing information using NEURAL_SPEED_VERBOSE Dec 25, 2023

update

fcc9728

Signed-off-by: zhenwei-intel <[email protected]>

zhenwei-intel marked this pull request as ready for review December 25, 2023 02:17

zhenwei-intel requested a review from VincyZhang as a code owner December 25, 2023 02:17

airMeng reviewed Dec 25, 2023

View reviewed changes

hshen14 reviewed Dec 25, 2023

View reviewed changes

.github/workflows/script/models/cpp_graph_inference.sh Show resolved Hide resolved

VincyZhang added the ITREX 1.3.1 label Dec 27, 2023

mengfei25 pushed a commit to mengfei25/intel-extension-for-transformers that referenced this pull request Dec 27, 2023

add MLperf example (intel#1054)

0dd104e

zhenwei-intel mentioned this pull request Jan 3, 2024

Features request llama.cpp statistics for llama2 models #1091

Closed

zhenwei-intel added ready for merge and removed WIP labels Jan 3, 2024

VincyZhang added Migrate to NeuralSpeed and removed ITREX 1.3.1 labels Jan 3, 2024

zhenwei-intel mentioned this pull request Jan 4, 2024

Control printing information using NEURAL_SPEED_VERBOSE intel/neural-speed#26

Merged

VincyZhang closed this Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM Runtime] Control printing information using NEURAL_SPEED_VERBOSE #1054

[LLM Runtime] Control printing information using NEURAL_SPEED_VERBOSE #1054

zhenwei-intel commented Dec 21, 2023 •

edited

Loading

a32543254 left a comment

zhenwei-intel commented Dec 22, 2023 •

edited

Loading

a32543254 left a comment

zhenwei-intel commented Dec 25, 2023

airMeng Dec 25, 2023

a32543254 Dec 26, 2023

airMeng Dec 26, 2023

a32543254 Dec 26, 2023

[LLM Runtime] Control printing information using NEURAL_SPEED_VERBOSE #1054

[LLM Runtime] Control printing information using NEURAL_SPEED_VERBOSE #1054

Conversation

zhenwei-intel commented Dec 21, 2023 • edited Loading

Type of Change

Description

Expected Behavior & Potential Risk

How has this PR been tested?

Dependency Change?

a32543254 left a comment

Choose a reason for hiding this comment

zhenwei-intel commented Dec 22, 2023 • edited Loading

a32543254 left a comment

Choose a reason for hiding this comment

zhenwei-intel commented Dec 25, 2023

airMeng Dec 25, 2023

Choose a reason for hiding this comment

a32543254 Dec 26, 2023

Choose a reason for hiding this comment

airMeng Dec 26, 2023

Choose a reason for hiding this comment

a32543254 Dec 26, 2023

Choose a reason for hiding this comment

zhenwei-intel commented Dec 21, 2023 •

edited

Loading

zhenwei-intel commented Dec 22, 2023 •

edited

Loading