Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support qwen2 #1820

Merged
merged 2 commits into from
Nov 25, 2024
Merged

Support qwen2 #1820

merged 2 commits into from
Nov 25, 2024

Conversation

minhthuc2502
Copy link
Collaborator

No description provided.

@BBC-Esq
Copy link

BBC-Esq commented Nov 25, 2024

Ctranslate2==4.5.1 release please?

I can confirm that the conversion (for int8 at least, which is i had time to test) for:

Qwen 2.5:

.5, 1.5, 3, 7b instruct models
and
3, 14, and 32b coder instruct models

I used:

Platform:

Microsoft Windows 10

Dependencies:

nvidia-cublas-cu12==12.4.2.65
nvidia-cuda-nvrtc-cu12==12.4.99
nvidia-cuda-runtime-cu12==12.4.99
nvidia-cudnn-cu12==9.1.0.70
ctranslate2==4.5.0
copy/pasted this version of transformers.py, overriding the version provided in ctranslate2==4.5.0
downgraded to numpy==1.26.4
pip install other necessary dependencies like transformers, accelerate, etc.

Preliminary benchmarks (int8, rtx4090, cuda):

Qwen--Qwen2.5-0.5B-Instruct-ct2-int8
Average characters per second: 250.77
Average VRAM Usage: 1011.93 MB

Qwen--Qwen2.5-1.5B-Instruct-ct2-int8
Average characters per second: 270.17
Average VRAM Usage: 2179.37 MB

Qwen--Qwen2.5-3B-Instruct-ct2-int8
Average characters per second: 200.67
Average VRAM Usage: 3728.93 MB

Qwen--Qwen2.5-7B-Instruct-ct2-int8
Average characters per second: 170.69
Average VRAM Usage: 8277.05 MB

Qwen--Qwen2.5-Coder-3B-Instruct-ct2-int8
Average characters per second: 172.26
Average VRAM Usage: 3731.43 MB

Qwen--Qwen2.5-Coder-14B-Instruct-ct2-int8
Average characters per second: 96.48
Average VRAM Usage: 15872.86 MB

32B RUNNING AT int8 DOES NOT FIT ON 24gb of vram~

However, for reasons I don't fully understanding it's necessary to add this line to the benchmarking script:

os.environ['KMP_DUPLICATE_LIB_OK']='TRUE'

Here is the benchmarking script. NOTE, it relies on pip installing the necessary CUDA libraries, which I find infinitely more convenient than having to install/re-install CUDA system wide to simply test different versions.

BENCHMARKING SCRIPT IS HERE
import sys
import os
os.environ['KMP_DUPLICATE_LIB_OK']='TRUE' # only necessary when trying to run a version of ctranslate2 that supports torch 2.2.2 but torch greater than 2.2.2 is installed (not recommended)
from pathlib import Path

def set_cuda_paths():
    venv_base = Path(sys.executable).parent.parent
    nvidia_base_path = venv_base / 'Lib' / 'site-packages' / 'nvidia'
    cuda_path = nvidia_base_path / 'cuda_runtime' / 'bin'
    cublas_path = nvidia_base_path / 'cublas' / 'bin'
    cudnn_path = nvidia_base_path / 'cudnn' / 'bin'
    # nvcc_path = nvidia_base_path / 'cuda_nvcc' / 'bin'
    nvrtc_path = nvidia_base_path / 'cuda_nvrtc' / 'bin'
    
    paths_to_add = [
        str(cuda_path), # CUDA runtime
        str(cublas_path), # cuBLAS
        str(cudnn_path), # cuDNN
        # str(nvcc_path), # NVIDIA CUDA compiler
        str(nvrtc_path), # NVIDIA runtime compiler
    ]

    env_vars = ['CUDA_PATH', 'CUDA_PATH_V12_4', 'PATH']
    
    for env_var in env_vars:
        current_value = os.environ.get(env_var, '')
        new_value = os.pathsep.join(paths_to_add + [current_value] if current_value else paths_to_add)
        os.environ[env_var] = new_value

set_cuda_paths()

import ctranslate2
import gc
import torch
from time import perf_counter
from transformers import AutoTokenizer
import pynvml
import threading
import time
from constants import user_message, system_message, user_message_long

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)

# model_dir = r"D:\Scripts\bench_chat\models\Qwen--Qwen2.5-0.5B-Instruct-ct2-int8"
# model_dir = r"D:\Scripts\bench_chat\models\Qwen--Qwen2.5-1.5B-Instruct-ct2-int8"
# model_dir = r"D:\Scripts\bench_chat\models\Qwen--Qwen2.5-3B-Instruct-ct2-int8"
# model_dir = r"D:\Scripts\bench_chat\models\Qwen--Qwen2.5-7B-Instruct-ct2-int8"

# model_dir = r"D:\Scripts\bench_chat\models\Qwen--Qwen2.5-Coder-3B-Instruct-ct2-int8"
model_dir = r"D:\Scripts\bench_chat\models\Qwen--Qwen2.5-Coder-14B-Instruct-ct2-int8"

def build_prompt():

    prompt = f"""<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
"""

    return prompt

def poll_vram_usage(stop_event, vram_readings):
    while not stop_event.is_set():
        memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
        vram_usage = memory_info.used / 1024**2
        vram_readings.append(vram_usage)
        time.sleep(0.1)

#def main(num_runs=1, beam_sizes=[1, 2, 3, 4, 5]):
def main(num_runs=1, beam_sizes=[1]):
    start_time = perf_counter()
    model_name = os.path.basename(model_dir)
    
    metrics = []
    
    for beam_size_value in beam_sizes:
        print(f"\033[32mLoading the model: {model_name}...\033[0m")
        
        load_start_time = perf_counter()
        intra_threads = max(os.cpu_count() - 4, 4)

        baseline_vram_usage = pynvml.nvmlDeviceGetMemoryInfo(handle).used / 1024**2
        
        generator = ctranslate2.Generator(
            model_dir,
            device="cuda",
            compute_type="int8",
            intra_threads=intra_threads
        )
        
        tokenizer = AutoTokenizer.from_pretrained(model_dir, add_prefix_space=None)
        load_end_time = perf_counter()
        load_time = load_end_time - load_start_time
        
        prompt = build_prompt()
        
        tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
        
        total_generation_time = 0
        total_characters_generated = 0
        total_vram_usage = 0
        
        for i in range(num_runs):
            print(f"\nRun {i+1} (Beam Size: {beam_size_value}):")
            
            stop_event = threading.Event()
            vram_readings = []
            poll_thread = threading.Thread(target=poll_vram_usage, args=(stop_event, vram_readings))
            poll_thread.start()
            
            generation_start_time = perf_counter()
            
            results_batch = generator.generate_batch(
                [tokens],
                include_prompt_in_result=False,
                max_batch_size=4096,
                batch_type="tokens",
                beam_size=beam_size_value,
                num_hypotheses=1,
                max_length=512,
                sampling_temperature=0.0,
            )
        
            generation_end_time = perf_counter()
            generation_time = generation_end_time - generation_start_time
            
            stop_event.set()
            poll_thread.join()
            
            output = tokenizer.decode(results_batch[0].sequences_ids[0])
            generated_characters = len(output)
            max_vram_usage = max(vram_readings) - baseline_vram_usage if vram_readings else 0
            
            print("\nGenerated response:")
            print(output)
            print(f"\nResponse generation time: {generation_time:.4f} seconds")
            print(f"Generated characters: {generated_characters}")
            print(f"Max VRAM Usage: {max_vram_usage:.2f} MB")
            
            total_generation_time += generation_time
            total_characters_generated += generated_characters
            total_vram_usage += max_vram_usage
        
        average_generation_time = total_generation_time / num_runs
        characters_per_second = total_characters_generated / total_generation_time
        average_vram_usage = total_vram_usage / num_runs
        
        print(f"\033[32m\nSummary for Beam Size {beam_size_value}:\n\n{model_name}\033[0m")
        print(f"\033[32mAverage characters per second: {characters_per_second:.2f}\033[0m")
        print(f"\033[32mAverage VRAM Usage: {average_vram_usage:.2f} MB\033[0m")
        
        beam_metrics = {
            "model_name": model_name,
            "beam_size": beam_size_value,
            "characters_per_second": characters_per_second,
            "average_vram_usage": average_vram_usage
        }
        metrics.append(beam_metrics)
        
        del generator
        del tokenizer
        torch.cuda.empty_cache()
        gc.collect()
    
    end_time = perf_counter()
    total_time = end_time - start_time
    
    print("\n\033[32mFinal Summary:\033[0m")
    print("\033[32m{:<20} {:<10} {:<25} {:<20}\033[0m".format("Model", "Beam Size", "Characters per Second", "VRAM Usage (MB)"))
    for metric in metrics:
        print("{:<20} {:<10} {:<25.2f} {:<20.2f}".format(
            metric["model_name"],
            metric["beam_size"],
            metric["characters_per_second"],
            metric["average_vram_usage"]
        ))
    
if __name__ == "__main__":
    num_runs = 1
    main(num_runs)

@minhthuc2502
Copy link
Collaborator Author

I will release CTranslate2 when some features are done.

For the next point, using KMP_DUPLICATE_LIB_OK can prevent OpenMP warnings about runtime library duplication. Ensure you use the correct supported version.

@minhthuc2502 minhthuc2502 merged commit 2870fe3 into OpenNMT:master Nov 25, 2024
13 checks passed
@minhthuc2502 minhthuc2502 mentioned this pull request Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants