Support qwen2 #1820

minhthuc2502 · 2024-11-25T11:42:04Z

No description provided.

BBC-Esq · 2024-11-25T14:08:20Z

Ctranslate2==4.5.1 release please?

I can confirm that the conversion (for int8 at least, which is i had time to test) for:

Qwen 2.5:

.5, 1.5, 3, 7b instruct models
and
3, 14, and 32b coder instruct models

I used:

Platform:

Microsoft Windows 10

Dependencies:

nvidia-cublas-cu12==12.4.2.65
nvidia-cuda-nvrtc-cu12==12.4.99
nvidia-cuda-runtime-cu12==12.4.99
nvidia-cudnn-cu12==9.1.0.70
ctranslate2==4.5.0
copy/pasted this version of transformers.py, overriding the version provided in ctranslate2==4.5.0
downgraded to numpy==1.26.4
pip install other necessary dependencies like transformers, accelerate, etc.

Preliminary benchmarks (int8, rtx4090, cuda):

Qwen--Qwen2.5-0.5B-Instruct-ct2-int8
Average characters per second: 250.77
Average VRAM Usage: 1011.93 MB

Qwen--Qwen2.5-1.5B-Instruct-ct2-int8
Average characters per second: 270.17
Average VRAM Usage: 2179.37 MB

Qwen--Qwen2.5-3B-Instruct-ct2-int8
Average characters per second: 200.67
Average VRAM Usage: 3728.93 MB

Qwen--Qwen2.5-7B-Instruct-ct2-int8
Average characters per second: 170.69
Average VRAM Usage: 8277.05 MB

Qwen--Qwen2.5-Coder-3B-Instruct-ct2-int8
Average characters per second: 172.26
Average VRAM Usage: 3731.43 MB

Qwen--Qwen2.5-Coder-14B-Instruct-ct2-int8
Average characters per second: 96.48
Average VRAM Usage: 15872.86 MB

32B RUNNING AT int8 DOES NOT FIT ON 24gb of vram~

However, for reasons I don't fully understanding it's necessary to add this line to the benchmarking script:

os.environ['KMP_DUPLICATE_LIB_OK']='TRUE'

Here is the benchmarking script. NOTE, it relies on pip installing the necessary CUDA libraries, which I find infinitely more convenient than having to install/re-install CUDA system wide to simply test different versions.

BENCHMARKING SCRIPT IS HERE

import sys
import os
os.environ['KMP_DUPLICATE_LIB_OK']='TRUE' # only necessary when trying to run a version of ctranslate2 that supports torch 2.2.2 but torch greater than 2.2.2 is installed (not recommended)
from pathlib import Path

def set_cuda_paths():
    venv_base = Path(sys.executable).parent.parent
    nvidia_base_path = venv_base / 'Lib' / 'site-packages' / 'nvidia'
    cuda_path = nvidia_base_path / 'cuda_runtime' / 'bin'
    cublas_path = nvidia_base_path / 'cublas' / 'bin'
    cudnn_path = nvidia_base_path / 'cudnn' / 'bin'
    # nvcc_path = nvidia_base_path / 'cuda_nvcc' / 'bin'
    nvrtc_path = nvidia_base_path / 'cuda_nvrtc' / 'bin'
    
    paths_to_add = [
        str(cuda_path), # CUDA runtime
        str(cublas_path), # cuBLAS
        str(cudnn_path), # cuDNN
        # str(nvcc_path), # NVIDIA CUDA compiler
        str(nvrtc_path), # NVIDIA runtime compiler
    ]

    env_vars = ['CUDA_PATH', 'CUDA_PATH_V12_4', 'PATH']
    
    for env_var in env_vars:
        current_value = os.environ.get(env_var, '')
        new_value = os.pathsep.join(paths_to_add + [current_value] if current_value else paths_to_add)
        os.environ[env_var] = new_value

set_cuda_paths()

import ctranslate2
import gc
import torch
from time import perf_counter
from transformers import AutoTokenizer
import pynvml
import threading
import time
from constants import user_message, system_message, user_message_long

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)

# model_dir = r"D:\Scripts\bench_chat\models\Qwen--Qwen2.5-0.5B-Instruct-ct2-int8"
# model_dir = r"D:\Scripts\bench_chat\models\Qwen--Qwen2.5-1.5B-Instruct-ct2-int8"
# model_dir = r"D:\Scripts\bench_chat\models\Qwen--Qwen2.5-3B-Instruct-ct2-int8"
# model_dir = r"D:\Scripts\bench_chat\models\Qwen--Qwen2.5-7B-Instruct-ct2-int8"

# model_dir = r"D:\Scripts\bench_chat\models\Qwen--Qwen2.5-Coder-3B-Instruct-ct2-int8"
model_dir = r"D:\Scripts\bench_chat\models\Qwen--Qwen2.5-Coder-14B-Instruct-ct2-int8"

def build_prompt():

    prompt = f"""<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
"""

    return prompt

def poll_vram_usage(stop_event, vram_readings):
    while not stop_event.is_set():
        memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
        vram_usage = memory_info.used / 1024**2
        vram_readings.append(vram_usage)
        time.sleep(0.1)

#def main(num_runs=1, beam_sizes=[1, 2, 3, 4, 5]):
def main(num_runs=1, beam_sizes=[1]):
    start_time = perf_counter()
    model_name = os.path.basename(model_dir)
    
    metrics = []
    
    for beam_size_value in beam_sizes:
        print(f"\033[32mLoading the model: {model_name}...\033[0m")
        
        load_start_time = perf_counter()
        intra_threads = max(os.cpu_count() - 4, 4)

        baseline_vram_usage = pynvml.nvmlDeviceGetMemoryInfo(handle).used / 1024**2
        
        generator = ctranslate2.Generator(
            model_dir,
            device="cuda",
            compute_type="int8",
            intra_threads=intra_threads
        )
        
        tokenizer = AutoTokenizer.from_pretrained(model_dir, add_prefix_space=None)
        load_end_time = perf_counter()
        load_time = load_end_time - load_start_time
        
        prompt = build_prompt()
        
        tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
        
        total_generation_time = 0
        total_characters_generated = 0
        total_vram_usage = 0
        
        for i in range(num_runs):
            print(f"\nRun {i+1} (Beam Size: {beam_size_value}):")
            
            stop_event = threading.Event()
            vram_readings = []
            poll_thread = threading.Thread(target=poll_vram_usage, args=(stop_event, vram_readings))
            poll_thread.start()
            
            generation_start_time = perf_counter()
            
            results_batch = generator.generate_batch(
                [tokens],
                include_prompt_in_result=False,
                max_batch_size=4096,
                batch_type="tokens",
                beam_size=beam_size_value,
                num_hypotheses=1,
                max_length=512,
                sampling_temperature=0.0,
            )
        
            generation_end_time = perf_counter()
            generation_time = generation_end_time - generation_start_time
            
            stop_event.set()
            poll_thread.join()
            
            output = tokenizer.decode(results_batch[0].sequences_ids[0])
            generated_characters = len(output)
            max_vram_usage = max(vram_readings) - baseline_vram_usage if vram_readings else 0
            
            print("\nGenerated response:")
            print(output)
            print(f"\nResponse generation time: {generation_time:.4f} seconds")
            print(f"Generated characters: {generated_characters}")
            print(f"Max VRAM Usage: {max_vram_usage:.2f} MB")
            
            total_generation_time += generation_time
            total_characters_generated += generated_characters
            total_vram_usage += max_vram_usage
        
        average_generation_time = total_generation_time / num_runs
        characters_per_second = total_characters_generated / total_generation_time
        average_vram_usage = total_vram_usage / num_runs
        
        print(f"\033[32m\nSummary for Beam Size {beam_size_value}:\n\n{model_name}\033[0m")
        print(f"\033[32mAverage characters per second: {characters_per_second:.2f}\033[0m")
        print(f"\033[32mAverage VRAM Usage: {average_vram_usage:.2f} MB\033[0m")
        
        beam_metrics = {
            "model_name": model_name,
            "beam_size": beam_size_value,
            "characters_per_second": characters_per_second,
            "average_vram_usage": average_vram_usage
        }
        metrics.append(beam_metrics)
        
        del generator
        del tokenizer
        torch.cuda.empty_cache()
        gc.collect()
    
    end_time = perf_counter()
    total_time = end_time - start_time
    
    print("\n\033[32mFinal Summary:\033[0m")
    print("\033[32m{:<20} {:<10} {:<25} {:<20}\033[0m".format("Model", "Beam Size", "Characters per Second", "VRAM Usage (MB)"))
    for metric in metrics:
        print("{:<20} {:<10} {:<25.2f} {:<20.2f}".format(
            metric["model_name"],
            metric["beam_size"],
            metric["characters_per_second"],
            metric["average_vram_usage"]
        ))
    
if __name__ == "__main__":
    num_runs = 1
    main(num_runs)

minhthuc2502 · 2024-11-25T14:37:35Z

I will release CTranslate2 when some features are done.

For the next point, using KMP_DUPLICATE_LIB_OK can prevent OpenMP warnings about runtime library duplication. Ensure you use the correct supported version.

minhthuc2502 added 2 commits November 25, 2024 12:41

support qwen2

ddf2997

fix flake

c49570b

minhthuc2502 merged commit 2870fe3 into OpenNMT:master Nov 25, 2024
13 checks passed

minhthuc2502 mentioned this pull request Nov 25, 2024

Qwen Support #1602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support qwen2 #1820

Support qwen2 #1820

minhthuc2502 commented Nov 25, 2024

BBC-Esq commented Nov 25, 2024 •

edited

Loading

minhthuc2502 commented Nov 25, 2024

Support qwen2 #1820

Support qwen2 #1820

Conversation

minhthuc2502 commented Nov 25, 2024

BBC-Esq commented Nov 25, 2024 • edited Loading

Platform:

Dependencies:

Preliminary benchmarks (int8, rtx4090, cuda):

minhthuc2502 commented Nov 25, 2024

BBC-Esq commented Nov 25, 2024 •

edited

Loading