UPMEM LLM framework for profiling / simulation

This library allows

Profiling PyTorch neural networks on a CPU,
Simulating the execution of the neural network in a target hardware accelerator.

Usage

Import the upmem_llm_framework library and typer. Create a typer app to handle the user input for the profiler.

# file: my_profiler.py
import typer

import upmem_llm_framework as upmem_layers

app = typer.Typer(callback=upmem_layers.initialize_profiling_options)

Define your main function and add the desired user input. Initialize the library before creating or importing the neural network:

@app.command()
def profile(my_input: str):
    upmem_layers.profiler_init()
    # Create or import the neural network
    model = ...
    # Define the input tensor
    myTensor = ...

Call the profiler when doing a forward pass / inference:

    upmem_layers.profiler_start()
    prediction = model.forward(myTensor)
    upmem_layers.profiler_end()

Call the app:
```
if __name__ == "__main__":
    app()
```
See the available options:
```
python my_profiler.py --help
```

Run the app:

python my_profiler.py --some-option profile my_input

Examples

You can find usage examples with a custom PyTorch model in nn_example.py and with a model from HuggingFace in hf_example.py.

PyTorch model

python3 nn_example.py profile

Expected output:

Options: Options(report_layers=False, report_functions=False, print_log=False, print_log_summary=False, simulation=False, sim_compute=False, sim_data_type=<DataType.bfloat16: 'bfloat16'>, sim_num_key_value_heads=-1, sim_sliding_window=-1, sim_verbose=False, extra_archs=None)
The model:
TinyModel(
  (linear1): UPM_Linear(in_features=100, out_features=200, bias=True)
  (activation): ReLU()
  (linear2): UPM_Linear(in_features=200, out_features=10, bias=True)
)
##### UPMEM PROFILER OUTPUT #####
Total time (SUM + GEN): 0.002975238 s, with data type: bfloat16, batch size: 1
Generated tokens: 0 in 0.002175651 s, with tokens/s: 0.0
Summarization step took: 0.000799587 s, weight in the execution: SUM: 0.268747239716621%, GEN: 0.7312527602833789%
##### END UPMEM PROFILER OUTPUT #####
tensor([0.0983, 0.0919, 0.1012, 0.0836, 0.0796, 0.1157, 0.1202, 0.0996, 0.0930,
        0.1168], grad_fn=<SoftmaxBackward0>)

HuggingFace model

python3 hf_example.py profile

Expected output:

Options: Options(report_layers=False, report_functions=False, print_log=False, print_log_summary=False, simulation=False, sim_compute=False, sim_data_type=<DataType.bfloat16: 'bfloat16'>, sim_num_key_value_heads=-1, sim_sliding_window=-1, sim_verbose=False, extra_archs=None)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.03it/s]
torch.Size([6])
##### UPMEM PROFILER OUTPUT #####
Total time (SUM + GEN): 42.124470486 s, with data type: bfloat16, batch size: 1
Generated tokens: 57 in 41.404485553 s, with tokens/s: 1.3766624373834302
Summarization step took: 0.719984933 s, weight in the execution: SUM: 0.017091845302584535%, GEN: 0.9829081546974154%
##### END UPMEM PROFILER OUTPUT #####
How to prepare coffee?

There are many ways to prepare coffee, and the method you choose will depend on your personal preferences and the equipment you have available. Here are some common methods for preparing coffee:

1. Drip brewing: This is one of the most common methods of prepar

Profiler

The profiler records the start time and end time of a computation layer or function. Currently, the profiler doesn't track the real power consumption of the CPU.

The profiler identifies a layer or function by 4 parameters:

Layer type (f.e. Linear module) or function (f.e. softmax),
Context when the layer or function is called, meaning the variable name assigned to the layer or function (f.e. q_proj = torch.nn.Linear(...) has a context of q_proj),
the input dimensions of the layer or function,
specifically for layer, a unique ID assigned at layer initialization.

Profiler output

By default, the profiler reports a summary with execution time, energy (when simulating), and power consumption (when simulating) at the end of its execution.

When simulating, this summary breaks down into the summarization (encoding) phase and the generation (decoding) phase.

You can enable the following flags to show more information:

--report-layers: reports the created layers in the neural network with its associated parameters
--report-functions: reports the called functions during the forward pass of the neural network with its associated parameters
--print-log: prints a time-ordered detailed log of each layer and function executed during the forward pass of the neural network

Simulation

To run a simulation, library users need to provide a dictionary mapping layers with a device or hardware accelerator.

This dictionary contains name_of_layer:device,options key-value pairs. The name of the layer corresponds to the context concept introduced before. The device corresponds to one of the accelerators defined in sim_architectures.yaml.

Currently supported options:

't' or transfer point: the input of a layer with this option comes from the CPU, which means that the last device sent its results back to the CPU and the CPU is sending them back as input to the layer's device.
'm' or MoE transfer point: the input of a layer with this option comes from the CPU but only once since the input is shared across different MoEs.

For instance, for a neural network composed of 2 Linear layers that execute sequentially in different chips:

layer_mapping = {
    "linear1":"PIM-AI-1chip,t",
    "linear2":"PIM-AI-1chip,t",
}

upmem_layers.profiler_start(layer_mapping)
prediction = model.forward(myTensor)
upmem_layers.profiler_end()

This mapping corresponds to the following scheme

graph LR
    **CPU** -->|input of *linear1* is sent to **PIM-AI-1chip1** device| PIM-AI-1chip1["`**PIM-AI-1chip1**
                                                                                        Execute *linear1*`"]
    PIM-AI-1chip1 -->|output of *linear1* is sent to **CPU**| **CPU**
    **CPU** -->|input of *linear2* is sent to **PIM-AI-1chip2** device| PIM-AI-1chip2["`**PIM-AI-1chip2**
                                                                                        Execute *linear2*`"]

Running a simulation

After specifying the layer mapping, to run a simulation:

python3 hf_example.py --simulation profile

Adding a hardware accelerator

The file sim_architectures.yaml contains hardware accelerator profiles.

To add a new hardware accelerator profile, create a YAML file with the following structure:

# yaml-language-server: $schema=<path-to-this-library>/architectures_schema.json
# (The above line is optional, it will enable autocompletion and validation in editors that support
# the YAML language server)

My_accelerator:
    # * Required parameters:
    #   - HOST communication
    host_to_device_bw_GBs: 22
    host_to_device_pj_per_bit: 200
    device_to_host_bw_GBs: 88
    device_to_host_pj_per_bit: 50
    #   - Device memory (shared memory like)
    mem_bw_GBs: 6553.6
    mem_pj_per_bit: 0.95
    #   - Compute
    tflops: 320
    pj_per_tflop: 0.4e+12
    # * Optional parameters:
    softmax_ns_per_element: 6.25e-03
    SiLU_ns_per_element: 9.375e-03
    RMSNorm_ns_per_element: 1.625e-02

My_accelerator2:
    <...>

Note: underscores in device names such as new_device convert to hyphens, resulting in new-device in the layer mapping.

Notes on simulation

This library makes two assumptions to simplify execution modelling across hardware profiles:

Ignored interconnection communication latency: it assumes that intercommunication between devices finishes fast enough that it can overlap with compute and get hidden. For instance, when simulating more than one GPU, it doesn't model the required data exchange between them. For an AI-PIM device (DIMM), it doesn't model communication within a DIMM.
Devices always reach peak performance. All hardware profiles perform operations at their peak performance. This is unrealistic in some scenarios. Adding a performance ratio to model this is left to future work.

Installation

Environment setup

Python version

This library expects Python 3.10.

If your distribution doesn't provide it, you can use pyenv to install it, or any other Python version manager:

pyenv install 3.10
pyenv shell 3.10

Now your shell runs Python 3.10.

Virtual environment

Preferably, create a virtual environment to install the library:

python -m venv venv
source venv/bin/activate

This avoids conflicts with other Python libraries in your system.

User installation

To install the library in your current Python environment:

python -m pip install .

Developer installation

To install the library for editing in your current Python environment, with the necessary development dependencies:

python -m pip install -e '.[dev]'

Running tests

Run the tests with:

python -m pytest

Formatting

This project uses black formatting. Please, make sure to run it before committing:

python -m black src/upmem_llm_framework/*.py

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
scripts_simulation		scripts_simulation
src/upmem_llm_framework		src/upmem_llm_framework
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pylintrc		.pylintrc
.vale.ini		.vale.ini
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UPMEM LLM framework for profiling / simulation

Usage

Examples

Profiler

Profiler output

Simulation

Running a simulation

Adding a hardware accelerator

Notes on simulation

Installation

Environment setup

Python version

Virtual environment

User installation

Developer installation

Running tests

Formatting

About

Releases 1

Packages

Languages

License

upmem/upmem_llm_framework

Folders and files

Latest commit

History

Repository files navigation

UPMEM LLM framework for profiling / simulation

Usage

Examples

Profiler

Profiler output

Simulation

Running a simulation

Adding a hardware accelerator

Notes on simulation

Installation

Environment setup

Python version

Virtual environment

User installation

Developer installation

Running tests

Formatting

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages