Skip to content

Latest commit

 

History

History
648 lines (568 loc) · 22.2 KB

File metadata and controls

648 lines (568 loc) · 22.2 KB
blogpost date author tags category language
true
17 April 2024
Douglas Jia, Michael Wootton
Profiling, HIP, ROCm, Tracing, HPC
Software tools & optimizations
English

Unveiling Performance Insights: Profiling and Tracing Applications on AMD GPU with rocmProfileData

In this blog, we delve into the capabilities of rocmProfileData, a powerful tool developed by AMD for profiling and tracing applications across various programming languages on AMD GPU. Our goal is to equip developers and data scientists with the insights and resources to fully leverage the performance potential of their GPU-accelerated applications in production, using a straightforward code example.

Introduction

In the dynamic landscape of GPU-accelerated computing, achieving optimal performance and efficiency is key to improve user experience and increase revenue. This pursuit often leads engineers and data scientists to delve into the realms of profiling and tracing, two indispensable techniques for gaining deep insights into the behavior and performance characteristics of applications running on GPUs.

Profiling focuses on analyzing performance metrics to quantify the behavior of the application during execution. By profiling key performance indicators such as execution time, memory usage, and kernel occupancy, developers can pinpoint areas of inefficiency and prioritize optimization efforts. Profiling provides actionable insights into the runtime behavior of the application, guiding developers towards optimizations that yield tangible performance improvements.

Tracing, on the other hand, involves monitoring and recording the sequence of operations performed by an application as it executes on the GPU. This detailed log provides invaluable visibility into the inner workings of the application, allowing developers to understand how data flows through the computation pipeline, identify potential bottlenecks, and optimize algorithmic implementations. Tracing essentially offers a "birds-eye view" of the application's execution, enabling developers to diagnose performance issues and fine-tune their code for maximum efficiency.

The importance of profiling and tracing cannot be overstated in the context of GPU-accelerated computing. As GPUs continue to play a pivotal role in a wide range of applications, specifically generative AI applications serving large models, understanding and optimizing their performance can generate more revenue by improving end user experience and lowering model serving cost. Profiling and tracing empower developers to unlock the full potential of GPU hardware, enabling them to harness its parallel computing power efficiently and effectively.

Developed in-house at AMD by Michael Wootton and his colleagues, rocmProfileData is specifically tailored to profile and trace applications running with ROCm on AMD GPUs. Throughout this blog, we will demonstrate how easily the core functionalities of this package can be executed on a simple code example.

Environment setup

We run the code example in a PyTorch ROCm 6.02 docker container with an AMD GPU in Ubuntu. For a list of supported OS and AMD hardware refer to System Requirements.

Pull and run the docker container with the following code in a Linux shell:

docker run -it --ipc=host --network=host --device=/dev/kfd --device=/dev/dri \
           --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
           --name=blog-rpd602 rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2 /bin/bash

You can verify the number of GPUs detected by PyTorch on your machine by executing the following two lines of code in the Python console. If the Docker configuration is correct, the detected number of GPUs should match the number of GPUs installed on your machine.

import torch
torch.cuda.device_count()

Implementation

Package installation

Install rocmProfileData and other required softwares by running the following commands in your Linux Shell.

apt-get update
apt-get install libfmt-dev sqlite3 libsqlite3-dev

git clone https://github.com/ROCm/rocmProfileData.git
cd rocmProfileData
make; make install
cd ..

Profiling a PyTorch multiplication function

In this code example, we'll profile a Python script featuring matrix multiplication implemented using PyTorch. You can locate the script matrix_mult.py in the src folder of this blog's GitHub repository: Link. Alternatively, you can create your own matrix_mult.py file by copying the code provided below.

import argparse
import torch

def matmult_gpu(input_data, weights):
    """
    Perform matrix multiplication of two tensors on GPU.
    
    Args:
    input_data (torch.Tensor): Input tensor.
    weights (torch.Tensor): Weight tensor.
    
    Returns:
    torch.Tensor: Result of matrix multiplication.
    """
    # Creating tensors on GPU
    input_data = input_data.to('cuda')
    weights = weights.to('cuda')
    
    # Optimized matrix multiplication using torch.matmul
    output = torch.matmul(input_data, weights)
    
    return output

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Perform matrix multiplication of two tensors.')
    parser.add_argument('--x_shape', nargs=2, type=int, default=[1000, 500], metavar=('N', 'M'), help='Shape of input data matrix')
    parser.add_argument('--w_shape', nargs=2, type=int, default=[500, 500], metavar=('J', 'K'), help='Shape of weight matrix')
    args = parser.parse_args()

    input_data = torch.randn(*args.x_shape)
    weights = torch.randn(*args.w_shape)

    output = matmult_gpu(input_data, weights)    
    print(f'Shape of input data matrix: {args.x_shape}, weight matrix: {args.w_shape}, result matrix:{output.shape}')
    print(output)

The primary command for rocmProfileData is runTracer.sh. To profile a Python script, execute the following command:

runTracer.sh <-o filename.rpd> python python_script.py <arguments_to_the_python_script>

where the optional -o filename.rpd flag specifies the output file name. The output file, formatted as .rpd, can be queried using SQLite3. In our case, to profile the function specified in matrix_mult.py, we'll execute the following command:

runTracer.sh -o matmul_result.rpd python matrix_mult.py --x_shape 50000 10000 --w_shape 10000 800

The output file matmul_result.rpd should be located in the directory where you executed the above command. You can directly query the table in the shell by running sqlite3 matmul_result.rpd, or in Python using the sqlite3 class. In this blog, we'll utilize the latter approach.

Before executing any Python code, ensure you have installed pandas in the shell:

pip install pandas

Then, import the required packages in Python:

import sqlite3
import pandas as pd

Before delving into the profiling metrics, let's first examine all the tables and views in the .rpd file and understand how the schemas are defined.

conn = sqlite3.connect('matmul_result.rpd')
# Execute SQL query to get the table names
tables = conn.execute("SELECT name, type, sql FROM sqlite_master where type='table' or type='view';").fetchall()
table_view = pd.DataFrame(data=tables, columns=['name', 'type', 'schema']).set_index('name')
conn.close()
table_view
type schema
name
rocpd_kernelcodeobject table CREATE TABLE "rocpd_kernelcodeobject" ("id" in...
sqlite_sequence table CREATE TABLE sqlite_sequence(name,seq)
rocpd_string table CREATE TABLE "rocpd_string" ("id" integer NOT ...
rocpd_barrierop table CREATE TABLE "rocpd_barrierop" ("op_ptr_id" in...
rocpd_copyapi table CREATE TABLE "rocpd_copyapi" ("api_ptr_id" int...
rocpd_op_inputSignals table CREATE TABLE "rocpd_op_inputSignals" ("id" int...
rocpd_op table CREATE TABLE "rocpd_op" ("id" integer NOT NULL...
rocpd_api table CREATE TABLE "rocpd_api" ("id" integer NOT NUL...
rocpd_api_ops table CREATE TABLE "rocpd_api_ops" ("id" integer NOT...
rocpd_kernelapi table CREATE TABLE "rocpd_kernelapi" ("api_ptr_id" i...
rocpd_metadata table CREATE TABLE "rocpd_metadata" ("id" integer NO...
rocpd_monitor table CREATE TABLE "rocpd_monitor" ("id" integer NOT...
api view CREATE VIEW api AS SELECT rocpd_api.id,pid,tid...
op view CREATE VIEW op AS SELECT rocpd_op.id,gpuId,que...
busy view CREATE VIEW busy AS select A.gpuId, GpuTime, W...
ktop view CREATE VIEW ktop as select C.string as Name, c...
top view CREATE VIEW top as select C.string as Name, co...
kernel view CREATE VIEW kernel AS SELECT B.id, gpuId, queu...
copy view CREATE VIEW copy AS SELECT B.id, pid, tid, sta...
copyop view CREATE VIEW copyop AS SELECT B.id, gpuId, queu...

You can find the queries that defined these tables/views in the schema column. Let's take a look at several key tables/views that are essential for profiling your application:

  • rocpd_op: GPU operations are stored in the rocpd_op table, serving as a base class for GPU operations.

  • rocpd_api: CPU-based calls are recorded in the rocpd_api table, typically comprising HIP API calls and ROCTX marks/ranges, among other entries. For instance, PyTorch's internal profiler emits time ranges for its internal operators, facilitating the analysis of interactions between PyTorch operators and HIP. Additional "subclass tables" exist to accommodate extra details for specific operation types, such as size for Copy operations and grid size for Kernel operations. Entries in these subclass tables reference the base operation entry for common information, such as GPU, stream, begin, and end.

  • rocpd_kernelapi: API calls launching kernels can store additional parameters in the rocpd_kernelapi table. Rows in this table reference entries in the rocpd_api table for base class fields and can be joined using rocpd_kernelapi.api_ptr_id = rocpd_api.id.

  • rocpd_copyapi: API calls performing copies can store extra parameters in the rocpd_copyapi table. Similar to rocpd_kernelapi, entries in this table reference entries in the rocpd_api table for base class fields and can be joined using rocpd_copyapi.api_ptr_id = rocpd_api.id.

  • api: The rocpd_api table with expanded strings for the 'apiName' and 'args' columns.

  • op: The rocpd_op table with expanded strings for the 'description' and 'opType' columns.

  • busy: Displays the percentage of GPU utilization for each GPU, averaged over the entire trace. For accuracy, the trace should not include "warmup" or should be sufficiently long.

  • top: Presents a list of operations consuming the most GPU time.

  • ktop: Lists kernel operations consuming the most time, excluding async copies and barriers.

  • kernel: Displays kernel launch parameters for each kernel.

  • copy: Shows all copy API calls with their parameters, including CPU timestamps.

  • copyop: Presents copy API calls and parameters for asynchronous copies, i.e., copies resulting in GPU operations. Includes GPU timestamps and represents a subset of all copies.

Execute the following code block to load the tables/views into pandas dataframes. You can revise the code to load and explore other tables/views you are interested in.

conn = sqlite3.connect("matmul_result.rpd")
df_op = pd.read_sql_query("SELECT * from op", conn)
df_top = pd.read_sql_query("SELECT * from top", conn)
df_ktop = pd.read_sql_query("SELECT * from ktop", conn)
df_busy = pd.read_sql_query("SELECT * from busy", conn)
conn.close()
df_op.head()
id gpuId queueId sequenceId start end description opType
0 1 2 0 0 8306017670485923 8306017825375030 CopyHostToDevice
1 2 2 0 0 8306017826047669 8306017828336625 CopyHostToDevice
2 3 2 0 0 8306020565922417 8306020592134373 Cijk_Ailk_Bljk_SB_MT128x64x16_MI32x32x2x1_SN_1... KernelExecution
3 4 2 0 0 8306020592134373 8306020592140613 void at::native::(anonymous namespace)::CatArr... KernelExecution
4 5 2 0 0 8306020592140613 8306020592146693 void at::native::(anonymous namespace)::CatArr... KernelExecution
df_top.head()
Name TotalCalls TotalDuration Ave Percentage
0 CopyHostToDevice 2 157178 78589 85.460920
1 Cijk_Ailk_Bljk_SB_MT128x64x16_MI32x32x2x1_SN_1... 1 26211 26211 14.251975
2 CopyDeviceToHost 53 334 6 0.182015
3 void at::native::(anonymous namespace)::CatArr... 6 37 6 0.020182
4 void at::native::reduce_kernel<512, 1, at::nat... 1 19 19 0.010787
df_ktop.head()
Name TotalCalls TotalDuration Ave Percentage
0 void at::native::index_elementwise_kernel<128,... 1 9410 9410 14.442580
1 void at::native::(anonymous namespace)::CatArr... 6 8390 1398 12.877698
2 void at::native::reduce_kernel<512, 1, at::nat... 1 5417 5417 8.315085
3 void at::native::modern::elementwise_kernel<at... 1 5334 5334 8.187123
4 void at::native::modern::elementwise_kernel<at... 2 5122 2561 7.861807
df_busy
gpuId GpuTime WallTime Busy
0 2 183918056 2964070817 0.062049

Upon reviewing the df_top and df_busy metrics, you may observe that the actual computation time on the GPU is significantly shorter compared to the total running time, with the actual matrix computation time being even shorter. This discrepancy is primarily attributed to the substantial overhead incurred by data movement between different hardware components. Consequently, there are instances where applications running solely on the CPU may outperform those utilizing the GPU, as they eliminate the need for data transfer between CPU and GPU. However, as the size of the data increases, this overhead becomes proportionally smaller, highlighting the superior performance capabilities of the GPU. To further illustrate this point, let's profile the same operation with much larger data and reexamine the metrics:

runTracer.sh -o matmul_result_large.rpd python matrix_mult.py --x_shape 100000 50000 --w_shape 50000 800
import sqlite3
import pandas as pd
conn = sqlite3.connect("matmul_result_large.rpd")
df_top = pd.read_sql_query("SELECT * from top", conn)
df_busy = pd.read_sql_query("SELECT * from busy", conn)
conn.close()
df_top.head()
Name TotalCalls TotalDuration Ave Percentage
0 CopyHostToDevice 2 1577218 788609 74.727344
1 Cijk_Ailk_Bljk_SB_MT64x64x16_MI32x32x2x1_SN_1L... 1 532802 532802 25.243742
2 CopyDeviceToHost 53 420 7 0.019931
3 void at::native::(anonymous namespace)::CatArr... 6 34 5 0.001637
4 void at::native::reduce_kernel<512, 1, at::nat... 1 20 20 0.000963
df_busy
gpuId GpuTime WallTime Busy
0 2 2110631504 4933119382 0.427849

Please note that after increasing the size of the X matrix by 10 times and the W matrix by 5 times, the runtime of the matrix multiplication kernel on the GPU has escalated from 14.25% to 25.24%, and the overall GPU time has surged from 6.2% to 42.78%. In essence, these tables furnish the essential metrics required for profiling your application.

How to trace an application

You might not find it surprising that we can trace an application using the .rpd file, as it already contains all the necessary information for tracing an application. All you need to do is convert the .rpd file to a .json file using the following command, allowing it to be imported into trace viewers like Chrome Trace. Please ensure to adjust the path to the rpd2tracing.py script for it to run properly. Once the command completes, you'll find the output matmul_result.json file in your working directory.

python3 ../rocmProfileData/tools/rpd2tracing.py matmul_result.rpd matmul_result.json

Now, you can download the matmul_result.json file, open Chrome, then go to "chrome://tracing/" and import the downloaded file into Chrome Trace to explore the traces. Below is a snippet of the trace displaying the matrix multiplication kernel.

trace_file

In this blog, we've introduced the primary functionalities of rocmProfileData. Feel free to explore additional useful features and functionalities by visiting its Github page.

Disclaimers

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.