Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

[LLM Runtime] Control printing information using NEURAL_SPEED_VERBOSE #1054

Closed
wants to merge 11 commits into from

Conversation

zhenwei-intel
Copy link
Contributor

@zhenwei-intel zhenwei-intel commented Dec 21, 2023

Type of Change

feature
API changed

Description

NEURAL_SPEED_VERBOSE for c++ and python api.

Enable verbose mode and control tracing information using the NEURAL_SPEED_VERBOSE environment variable.

Available modes:

  • 0: Print all tracing information. Comprehensive output, including: evaluation time and operator profiling.
  • 1: Print evaluation time. Time taken for each evaluation.
  • 2: Profile individual operators. Identify performance bottlenecks within the model.

example:

NEURAL_SPEED_VERBOSE=1 ./build/bin/run_llama -m runtime_outs/ne_llama_q_int4_jblas_cint8_g32.bin -p "once upon a time, a little girl" -n 10
...................................................................................................
model_init_from_file: support_jblas_kv = 0
model_init_from_file: kv self size =  128.00 MB

system_info: n_threads = 56 / 112 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 10, n_keep = 0


 once upon a time, a little girl named Lily lived in a small village nestled
model_print_timings:        load time =  2233.65 ms
model_print_timings:      sample time =     7.48 ms /    10 runs   (    0.75 ms per token)
model_print_timings: prompt eval time =   222.95 ms /     9 tokens (   24.77 ms per token)
model_print_timings:        eval time =   408.06 ms /     9 runs   (   45.34 ms per token)
model_print_timings:       total time =  2653.31 ms
========== eval time log of each prediction ==========
prediction   0, time: 222.95ms
prediction   1, time: 43.97ms
prediction   2, time: 43.83ms
prediction   3, time: 43.74ms
prediction   4, time: 43.80ms
prediction   5, time: 50.83ms
prediction   6, time: 45.16ms
prediction   7, time: 46.75ms
prediction   8, time: 44.81ms
prediction   9, time: 45.19ms

Expected Behavior & Potential Risk

the expected behavior that triggered by this PR

How has this PR been tested?

how to reproduce the test (including hardware information)

Dependency Change?

any library dependency introduced or removed

Signed-off-by: zhenwei-intel <[email protected]>
Signed-off-by: zhenwei-intel <[email protected]>
Copy link
Contributor

@a32543254 a32543254 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: zhenwei-intel <[email protected]>
Signed-off-by: zhenwei-intel <[email protected]>
Signed-off-by: zhenwei-intel <[email protected]>
Signed-off-by: zhenwei-intel <[email protected]>
Signed-off-by: zhenwei-intel <[email protected]>
Signed-off-by: zhenwei-intel <[email protected]>
@zhenwei-intel
Copy link
Contributor Author

zhenwei-intel commented Dec 22, 2023

NEURAL_SPEED_VERBOSE

  • 0: print all
  • 1: print eval time
  • 2: profile op

@a32543254 , help review this design
example:

NEURAL_SPEED_VERBOSE=1 ./build/bin/run_llama -m runtime_outs/ne_llama_q_int4_jblas_cint8_g32.bin -p "once upon a time, a little girl" -n 10
...................................................................................................
model_init_from_file: support_jblas_kv = 0
model_init_from_file: kv self size =  128.00 MB

system_info: n_threads = 56 / 112 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 10, n_keep = 0


 once upon a time, a little girl named Lily lived in a small village nestled
model_print_timings:        load time =  2233.65 ms
model_print_timings:      sample time =     7.48 ms /    10 runs   (    0.75 ms per token)
model_print_timings: prompt eval time =   222.95 ms /     9 tokens (   24.77 ms per token)
model_print_timings:        eval time =   408.06 ms /     9 runs   (   45.34 ms per token)
model_print_timings:       total time =  2653.31 ms
========== eval time log of each prediction ==========
prediction   0, time: 222.95ms
prediction   1, time: 43.97ms
prediction   2, time: 43.83ms
prediction   3, time: 43.74ms
prediction   4, time: 43.80ms
prediction   5, time: 50.83ms
prediction   6, time: 45.16ms
prediction   7, time: 46.75ms
prediction   8, time: 44.81ms
prediction   9, time: 45.19ms

Copy link
Contributor

@a32543254 a32543254 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: zhenwei-intel <[email protected]>
Signed-off-by: zhenwei-intel <[email protected]>
@zhenwei-intel zhenwei-intel changed the title [LLM Runtime] Print eval time in python api [LLM Runtime] Control printing information using NEURAL_SPEED_VERBOSE Dec 25, 2023
Signed-off-by: zhenwei-intel <[email protected]>
@zhenwei-intel zhenwei-intel marked this pull request as ready for review December 25, 2023 02:17
@zhenwei-intel
Copy link
Contributor Author

@hshen14 @kevinintel @airMeng @DDEle , redesign log, help review again~

Comment on lines +596 to +598
- 0: Print all tracing information. Comprehensive output, including: evaluation time and operator profiling.
- 1: Print evaluation time. Time taken for each evaluation.
- 2: Profile individual operator. Identify performance bottleneck within the model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 0 is the most comprehensive, which is quite rare

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same as log level, which 0 for debug and larger the less info. not rare

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about 1. 2. 3. because VERBOSE=0 we usually think it is disabling verbose

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point, agreed~
@zhenwei-intel could you help change this 0 for disable and 1 for print all .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants