Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

Control printing information using NEURAL_SPEED_VERBOSE #26

Merged
merged 2 commits into from
Jan 4, 2024

Conversation

zhenwei-intel
Copy link
Contributor

Type of Change

feature
API changed

Description

cherry-pick from intel/intel-extension-for-transformers#1054
NEURAL_SPEED_VERBOSE for c++ and python api.

Enable verbose mode and control tracing information using the NEURAL_SPEED_VERBOSE environment variable.

Available modes:

  • 0: Print all tracing information. Comprehensive output, including: evaluation time and operator profiling.
  • 1: Print evaluation time. Time taken for each evaluation.
  • 2: Profile individual operators. Identify performance bottlenecks within the model.

example:

NEURAL_SPEED_VERBOSE=1 ./build/bin/run_llama -m runtime_outs/ne_llama_q_int4_jblas_cint8_g32.bin -p "once upon a time, a little girl" -n 10
...................................................................................................
model_init_from_file: support_jblas_kv = 0
model_init_from_file: kv self size =  128.00 MB

system_info: n_threads = 56 / 112 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 10, n_keep = 0


 once upon a time, a little girl named Lily lived in a small village nestled
model_print_timings:        load time =  2233.65 ms
model_print_timings:      sample time =     7.48 ms /    10 runs   (    0.75 ms per token)
model_print_timings: prompt eval time =   222.95 ms /     9 tokens (   24.77 ms per token)
model_print_timings:        eval time =   408.06 ms /     9 runs   (   45.34 ms per token)
model_print_timings:       total time =  2653.31 ms
========== eval time log of each prediction ==========
prediction   0, time: 222.95ms
prediction   1, time: 43.97ms
prediction   2, time: 43.83ms
prediction   3, time: 43.74ms
prediction   4, time: 43.80ms
prediction   5, time: 50.83ms
prediction   6, time: 45.16ms
prediction   7, time: 46.75ms
prediction   8, time: 44.81ms
prediction   9, time: 45.19ms

Expected Behavior & Potential Risk

the expected behavior that triggered by this PR

How has this PR been tested?

how to reproduce the test (including hardware information)

Dependency Change?

any library dependency introduced or removed

Copy link
Contributor

@a32543254 a32543254 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@VincyZhang VincyZhang merged commit a8d9e7d into main Jan 4, 2024
9 checks passed
@zhenwei-intel zhenwei-intel deleted the lzw/loginfo branch January 8, 2024 01:11
DDEle pushed a commit to DDEle/neural-speed that referenced this pull request Feb 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants