Llama 3.1 8b f16 sharded TP8 compiles but fails to run #19428

aviator19941 · 2024-12-09T22:39:01Z

What happened?

Compiling and running the llama 3.1 8b f16 TP8 IR with iree-run-module, we get this error:

EXEC @prefill_bs4
Memory access fault by GPU node-3 (Agent handle: 0x55fbb87af7b0) on address 0x761d4b2f8000. Reason: Unknown.
Aborted (core dumped)

Steps to reproduce your issue

Download the llama 3.1 8b f16 TP8 IR.
Download the stripped weights and numpy inputs: https://gist.github.com/aviator19941/21b8bfdff635a2deabc79077c339857f.
Compile IR:

../iree-build-no-trace/tools/iree-compile 8b_f16_bs4_tp8_nondecomposed_prefill.mlir --iree-hip-target=gfx942 -o=prefill_8b_tp8_12_5.vmfb --iree-hal-target-device=hip[0] --iree-hal-target-device=hip[1] --iree-hal-target-device=hip[2] --iree-hal-target-device=hip[3] --iree-hal-target-device=hip[4] --iree-hal-target-device=hip[5] --iree-hal-target-device=hip[6] --iree-hal-target-device=hip[7] --iree-dispatch-creation-enable-aggressive-fusion=true --iree-global-opt-propagate-transposes=true --iree-opt-aggressively-propagate-transposes=true --iree-opt-data-tiling=false --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' --iree-hal-indirect-command-buffers=true --iree-stream-resource-memory-model=discrete --iree-hip-legacy-sync=false --iree-hal-memoization=true --iree-opt-strip-assertions

Run vmfb:

ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ../iree-build-no-trace/tools/iree-run-module --hip_use_streams=true --device_allocator=caching --module=prefill_8b_tp8_12_5.vmfb --parameters=model=llama3.1_8b_fp16_tp8_parameters.irpa --parameters=model=llama3.1_8b_fp16_tp8_parameters.rank0.irpa --parameters=model=llama3.1_8b_fp16_tp8_parameters.rank1.irpa --parameters=model=llama3.1_8b_fp16_tp8_parameters.rank2.irpa --parameters=model=llama3.1_8b_fp16_tp8_parameters.rank3.irpa --parameters=model=llama3.1_8b_fp16_tp8_parameters.rank4.irpa --parameters=model=llama3.1_8b_fp16_tp8_parameters.rank5.irpa --parameters=model=llama3.1_8b_fp16_tp8_parameters.rank6.irpa --parameters=model=llama3.1_8b_fp16_tp8_parameters.rank7.irpa --device=hip://0 --device=hip://1 --device=hip://2 --device=hip://3 --device=hip://4 --device=hip://5 --device=hip://6 --device=hip://7 --function=prefill_bs4 --input=@/data/llama3.1/weights/8b/prefill_args_bs4_128/random_tokens.npy --input=@/data/llama3.1/weights/8b/prefill_args_bs4_128/seq_lens.npy --input=@/data/llama3.1/weights/8b/prefill_args_bs4_128/seq_block_ids.npy --input=@/data/llama3.1/weights/8b/prefill_args_bs4_128/cs_f16_shard_0.npy --input=@/data/llama3.1/weights/8b/prefill_args_bs4_128/cs_f16_shard_1.npy --input=@/data/llama3.1/weights/8b/prefill_args_bs4_128/cs_f16_shard_2.npy --input=@/data/llama3.1/weights/8b/prefill_args_bs4_128/cs_f16_shard_3.npy --input=@/data/llama3.1/weights/8b/prefill_args_bs4_128/cs_f16_shard_4.npy --input=@/data/llama3.1/weights/8b/prefill_args_bs4_128/cs_f16_shard_5.npy --input=@/data/llama3.1/weights/8b/prefill_args_bs4_128/cs_f16_shard_6.npy --input=@/data/llama3.1/weights/8b/prefill_args_bs4_128/cs_f16_shard_7.npy

See error:

EXEC @prefill_bs4
Memory access fault by GPU node-3 (Agent handle: 0x55fbb87af7b0) on address 0x761d4b2f8000. Reason: Unknown.
Aborted (core dumped)

What component(s) does this issue relate to?

Runtime

Version information

7dd6fa6 + commits from #18790

Additional context

No response

The text was updated successfully, but these errors were encountered:

AWoloszyn · 2024-12-10T03:09:38Z

So a quick look at this. It looks like data is being copied between devices by simply device.queue_copy (or similar) passing a reference to a buffer created on one device to another device to perform the copy.

(This is based on behavior, not looking at the output IR) .

I think this is what is happening because if I enable peering across all devices that are created, the problem goes away (since device_ptrs are now valid across devices). However, from my understanding of the API, this is not really how we expect it to be used. Instead of 8 logical devices (each backed by a single physical device) we should be using a single logical device backed by 8 physical devices (where each physical device would be represented by a queue on the logical device). This allows us to set up peering explicitly between requested devices.

@benvanik Is my understanding of this correct?

AWoloszyn · 2024-12-10T03:27:40Z

https://gist.github.com/AWoloszyn/630716214cf64e936c4c6cca077c2752
Here is the hack that I tried that gets things passing

benvanik · 2024-12-10T04:29:33Z

We want to be using multiple physical devices via one logical HAL device in most cases but all existing frameworks do things the other way around today so we'll likely have to support both well enough :(

I believe the API we want to use is cuMemcpyPeerAsync - that supports device to device transfers of pointers that are local to each device. #19160 (which I need to land!) adds tracking of which device and queue affinity a buffer was allocated for and could be used to detect peer transfers and route to cuMemcpyPeerAsync when appropriate.

(this support would be a new feature instead of a bug fix, so something that can happen after the bigger rewrite PR that's outstanding)

benvanik · 2024-12-10T16:23:09Z

Dug through hip to figure out where cuMemcpyPeerAsync bottomed out, looks like here: https://github.com/ROCm/clr/blob/7c9c7a6332f69f740f59aaaeb83ad6eeff4598d6/rocclr/device/rocm/rocvirtual.cpp#L2240

If the devices are peered it just does a normal copy. Explains why your patch works! I'm honestly not sure of the risk of enabling peering between devices we may not use as peers beyond memory allocation overhead as the virtual address spaces are unified (but that may be the case anyway). So if we find that peering isn't a cost when not used we could just peer all the devices.

aviator19941 added the bug 🐞 Something isn't working label Dec 9, 2024

aviator19941 assigned AWoloszyn and benvanik Dec 9, 2024

benvanik mentioned this issue Dec 10, 2024

[hip] Added hip_device_group_device to the runtime. #18790

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 3.1 8b f16 sharded TP8 compiles but fails to run #19428

Llama 3.1 8b f16 sharded TP8 compiles but fails to run #19428

aviator19941 commented Dec 9, 2024

AWoloszyn commented Dec 10, 2024

AWoloszyn commented Dec 10, 2024

benvanik commented Dec 10, 2024 •

edited

Loading

benvanik commented Dec 10, 2024

Llama 3.1 8b f16 sharded TP8 compiles but fails to run #19428

Llama 3.1 8b f16 sharded TP8 compiles but fails to run #19428

Comments

aviator19941 commented Dec 9, 2024

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

AWoloszyn commented Dec 10, 2024

AWoloszyn commented Dec 10, 2024

benvanik commented Dec 10, 2024 • edited Loading

benvanik commented Dec 10, 2024

benvanik commented Dec 10, 2024 •

edited

Loading