This library allows
- Profiling PyTorch neural networks on a CPU,
- Simulating the execution of the neural network in a target hardware accelerator.
Import the
library andtyper
. Create atyper
app to handle the user input for the profiler.# file: import typer import upmem_llm_framework as upmem_layers app = typer.Typer(callback=upmem_layers.initialize_profiling_options)
Define your main function and add the desired user input. Initialize the library before creating or importing the neural network:
@app.command() def profile(my_input: str): upmem_layers.profiler_init() # Create or import the neural network model = ... # Define the input tensor myTensor = ...
Call the profiler when doing a forward pass / inference:
upmem_layers.profiler_start() prediction = model.forward(myTensor) upmem_layers.profiler_end()
Call the app:
if __name__ == "__main__": app()
See the available options:
python --help
Run the app:
python --some-option profile my_input
You can find usage examples with a custom PyTorch model in
with a model from HuggingFace in
PyTorch model
python3 profile
Expected output:
Options: Options(report_layers=False, report_functions=False, print_log=False, print_log_summary=False, simulation=False, sim_compute=False, sim_data_type=<DataType.bfloat16: 'bfloat16'>, sim_num_key_value_heads=-1, sim_sliding_window=-1, sim_verbose=False, extra_archs=None)
The model:
(linear1): UPM_Linear(in_features=100, out_features=200, bias=True)
(activation): ReLU()
(linear2): UPM_Linear(in_features=200, out_features=10, bias=True)
Total time (SUM + GEN): 0.002975238 s, with data type: bfloat16, batch size: 1
Generated tokens: 0 in 0.002175651 s, with tokens/s: 0.0
Summarization step took: 0.000799587 s, weight in the execution: SUM: 0.268747239716621%, GEN: 0.7312527602833789%
tensor([0.0983, 0.0919, 0.1012, 0.0836, 0.0796, 0.1157, 0.1202, 0.0996, 0.0930,
0.1168], grad_fn=<SoftmaxBackward0>)
HuggingFace model
python3 profile
Expected output:
Options: Options(report_layers=False, report_functions=False, print_log=False, print_log_summary=False, simulation=False, sim_compute=False, sim_data_type=<DataType.bfloat16: 'bfloat16'>, sim_num_key_value_heads=-1, sim_sliding_window=-1, sim_verbose=False, extra_archs=None)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.03it/s]
Total time (SUM + GEN): 42.124470486 s, with data type: bfloat16, batch size: 1
Generated tokens: 57 in 41.404485553 s, with tokens/s: 1.3766624373834302
Summarization step took: 0.719984933 s, weight in the execution: SUM: 0.017091845302584535%, GEN: 0.9829081546974154%
How to prepare coffee?
There are many ways to prepare coffee, and the method you choose will depend on your personal preferences and the equipment you have available. Here are some common methods for preparing coffee:
1. Drip brewing: This is one of the most common methods of prepar
The profiler records the start time and end time of a computation layer or function. Currently, the profiler doesn't track the real power consumption of the CPU.
The profiler identifies a layer or function by 4 parameters:
- Layer type (f.e.
module) or function (f.e.softmax
), - Context when the layer or function is called, meaning the variable name
assigned to the layer or function (f.e.
q_proj = torch.nn.Linear(...)
has a context ofq_proj
), - the input dimensions of the layer or function,
- specifically for layer, a unique ID assigned at layer initialization.
By default, the profiler reports a summary with execution time, energy (when simulating), and power consumption (when simulating) at the end of its execution.
When simulating, this summary breaks down into the summarization (encoding) phase and the generation (decoding) phase.
You can enable the following flags to show more information:
: reports the created layers in the neural network with its associated parameters--report-functions
: reports the called functions during the forward pass of the neural network with its associated parameters--print-log
: prints a time-ordered detailed log of each layer and function executed during the forward pass of the neural network
To run a simulation, library users need to provide a dictionary mapping layers with a device or hardware accelerator.
This dictionary contains name_of_layer:device,options
key-value pairs.
The name of the layer corresponds to the context concept introduced before.
The device corresponds to one of the accelerators defined in
Currently supported options:
- 't' or transfer point: the input of a layer with this option comes from the CPU, which means that the last device sent its results back to the CPU and the CPU is sending them back as input to the layer's device.
- 'm' or MoE transfer point: the input of a layer with this option comes from the CPU but only once since the input is shared across different MoEs.
For instance, for a neural network composed of 2 Linear layers that execute sequentially in different chips:
layer_mapping = {
prediction = model.forward(myTensor)
This mapping corresponds to the following scheme
graph LR
**CPU** -->|input of *linear1* is sent to **PIM-AI-1chip1** device| PIM-AI-1chip1["`**PIM-AI-1chip1**
Execute *linear1*`"]
PIM-AI-1chip1 -->|output of *linear1* is sent to **CPU**| **CPU**
**CPU** -->|input of *linear2* is sent to **PIM-AI-1chip2** device| PIM-AI-1chip2["`**PIM-AI-1chip2**
Execute *linear2*`"]
After specifying the layer mapping, to run a simulation:
python3 --simulation profile
The file sim_architectures.yaml
contains hardware accelerator profiles.
To add a new hardware accelerator profile, create a YAML file with the following structure:
# yaml-language-server: $schema=<path-to-this-library>/architectures_schema.json
# (The above line is optional, it will enable autocompletion and validation in editors that support
# the YAML language server)
# * Required parameters:
# - HOST communication
host_to_device_bw_GBs: 22
host_to_device_pj_per_bit: 200
device_to_host_bw_GBs: 88
device_to_host_pj_per_bit: 50
# - Device memory (shared memory like)
mem_bw_GBs: 6553.6
mem_pj_per_bit: 0.95
# - Compute
tflops: 320
pj_per_tflop: 0.4e+12
# * Optional parameters:
softmax_ns_per_element: 6.25e-03
SiLU_ns_per_element: 9.375e-03
RMSNorm_ns_per_element: 1.625e-02
Note: underscores in device names such as new_device
convert to hyphens,
resulting in new-device
in the layer mapping.
This library makes two assumptions to simplify execution modelling across hardware profiles:
- Ignored interconnection communication latency: it assumes that intercommunication between devices finishes fast enough that it can overlap with compute and get hidden. For instance, when simulating more than one GPU, it doesn't model the required data exchange between them. For an AI-PIM device (DIMM), it doesn't model communication within a DIMM.
- Devices always reach peak performance. All hardware profiles perform operations at their peak performance. This is unrealistic in some scenarios. Adding a performance ratio to model this is left to future work.
This library expects Python 3.10.
If your distribution doesn't provide it, you can use
to install it, or any other Python
version manager:
pyenv install 3.10
pyenv shell 3.10
Now your shell runs Python 3.10.
Preferably, create a virtual environment to install the library:
python -m venv venv
source venv/bin/activate
This avoids conflicts with other Python libraries in your system.
To install the library in your current Python environment:
python -m pip install .
To install the library for editing in your current Python environment, with the necessary development dependencies:
python -m pip install -e '.[dev]'
Run the tests with:
python -m pytest
This project uses black
formatting. Please, make sure to run it before
python -m black src/upmem_llm_framework/*.py