Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to enable Intel AMX in asm! on Linux? #107795

Closed
jczaja opened this issue Feb 8, 2023 · 13 comments · Fixed by #113525
Closed

How to enable Intel AMX in asm! on Linux? #107795

jczaja opened this issue Feb 8, 2023 · 13 comments · Fixed by #113525
Labels
A-target-feature Area: Enabling/disabling target features like AVX, Neon, etc. O-linux Operating system: Linux O-x86_64 Target: x86-64 processors (like x86_64-*)

Comments

@jczaja
Copy link

jczaja commented Feb 8, 2023

Hi,

I want to use in Rust (via inline assembly) Intel AMX instruction set.
AMX support is by default disabled in Linux Kernel due to significant amount of memory(~10KB) that has to be save on stack when there is context switching for programs using AMX. To enable AMX we need processor
with this capability (sapphirerapids), recent enough Linux kernel (5.16+) and stacks of FPU&sigalt to be of a size enough to be able to store AMX tiles (registers). Article on enabling AMX is here.
I have implemented a programs to enable and test AMX: one in C++ and the other in Rust. The one in C++ does initialize AMX properly, but the Rust program is not able to initialize AMX properly (likely due to stack sizes being not big enough, see "PROGRAM EXITS HERE in Rust example"). Similar problem was described for python programming language. Please advice how to have AMX support enabled in Rust on SapphireRapids under Linux.

Details:

C++ program:

#include <iostream>
                       
namespace {
#include <unistd.h>
#include <sys/syscall.h>

#define XFEATURE_XTILECFG 17
#define XFEATURE_XTILEDATA 18
#define XFEATURE_MASK_XTILECFG (1 << XFEATURE_XTILECFG)
#define XFEATURE_MASK_XTILEDATA (1 << XFEATURE_XTILEDATA)
#define XFEATURE_MASK_XTILE (XFEATURE_MASK_XTILECFG | XFEATURE_MASK_XTILEDATA)
#define ARCH_GET_XCOMP_PERM 0x1022
#define ARCH_REQ_XCOMP_PERM 0x1023

bool init() {
    unsigned long bitmask = 0;
    long status = syscall(SYS_arch_prctl, ARCH_GET_XCOMP_PERM, &bitmask);
    if (0 != status) return false;
    if (bitmask & XFEATURE_MASK_XTILEDATA) return true;

    status = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA);
    if (0 != status)
        return false; // XFEATURE_XTILEDATA setup is failed, TMUL usage is not allowed
    status = syscall(SYS_arch_prctl, ARCH_GET_XCOMP_PERM, &bitmask);

    // XFEATURE_XTILEDATA setup is failed, can't use TMUL
    if (0 != status || !(bitmask & XFEATURE_MASK_XTILEDATA)) return false;

    // XFEATURE_XTILEDATA set successfully, TMUL usage is allowed
    return true;
}
}

int main(int argc, char **argv) {

    puts("Using system call to enable AMX...");
    if (!init()) { 
      printf("Error: AMX is not available\n");
      return 1;
    }
    puts("...AMX is now enabled!\n");
}

Rust:

main:rs:
use syscalls::*;

fn initialize_amx_if_available() -> bool {
    const ARCH_GET_XCOMP_PERM: usize = 0x1022;
    const ARCH_REQ_XCOMP_PERM: usize = 0x1023;
    const XFEATURE_XTILECFG: usize = 17;
    const XFEATURE_XTILEDATA: usize = 18;
    const XFEATURE_MASK_XTILEDATA: usize = 1 << XFEATURE_XTILEDATA;
    const XFEATURE_MASK_XTILECFG: usize = 1 << XFEATURE_XTILECFG;
    const XFEATURE_MASK_XTILE: usize = XFEATURE_MASK_XTILECFG | XFEATURE_MASK_XTILEDATA;

    let bitmask: [usize; 1] = [0; 1];
    let mut status: usize = 0;
    unsafe {
        let maybe_status = syscall!(Sysno::arch_prctl, ARCH_GET_XCOMP_PERM, bitmask.as_ptr());
        match maybe_status {
            Ok(s) => status = s,
            Err(_) => {
                println!("AMX not supported!");
                return false;
            }
        }
    }

    if (bitmask[0] & XFEATURE_MASK_XTILEDATA) != 0 {
        return true;
    }

    unsafe {
        let maybe_status = syscall!(Sysno::arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA);
        match maybe_status {
            Ok(s) => status = s,
            Err(err) => {
                println!("AMX Error: XFEATURE_XTILEDATA setup is failed, TMUL usage is not allowed! Error: {}",err);
                return false;          //<========================================== PROGRAM EXITS HERE!!!
            }
        }
    }

    unsafe {
        status = syscall!(Sysno::arch_prctl, ARCH_GET_XCOMP_PERM, bitmask.as_ptr())
            .expect("Error: ARCH_PRCTL syscall failed!");
    }
    if status != 0 || ((bitmask[0] & XFEATURE_MASK_XTILEDATA) == 0) {
        println!("AMX not supported!");
        return false;   
    }

    // XFEATURE_XTILEDATA set successfully, TMUL usage is allowed
    true
}

fn main() {
  if initialize_amx_if_available() == true {
      println!("Success: AMX Enabled!");
  } else {
      println!("ERROR: Could not enable AMX!");
  }
}
Cargo.toml
[package]
name = "test-enable-amx"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
syscalls = "0.6.7"

building:RUSTFLAGS='-C target-cpu=sapphirerapids -C target-feature=+amx-int8,+amx-bf16,+amx-tile' cargo build
toolchains used: 1.67.0 , 1.69.0-nightly
Linux kernel: 5.19.0-1.el8.elrepo.x86_64
OS: Centos 8.5

@jczaja jczaja changed the title Problem enabling SapphireRapids instruction set (AMX) for Rust in Linux? Problem enabling SapphireRapids instruction set (AMX) for Rust in Linux. Feb 8, 2023
@saethlin
Copy link
Member

saethlin commented Feb 8, 2023

If the problem is stack size, set the environment variable RUST_MIN_STACK to the size in bytes that you want for the main thread's stack. You could also spawn a new thread and set the stack size you want with https://doc.rust-lang.org/std/thread/struct.Builder.html#method.stack_size

@workingjubilee workingjubilee changed the title Problem enabling SapphireRapids instruction set (AMX) for Rust in Linux. How to enable Intel AMX in asm! on Linux? Feb 8, 2023
@workingjubilee
Copy link
Member

RUST_MIN_STACK is documented as not setting the main thread's stack size, is that true? https://doc.rust-lang.org/std/thread/

In that case this recommendation is incorrect?

This doesn't mean the problem isn't stack size: it's quite plausible the issue is that Rust is consuming an abnormally large amount of the stack and a missed optimization is happening here. I believe you may programmatically increase the stack size via setrlimit to something more than the 8 megabytes that Linux and gcc will default to.

You may also need to use #[inline(never)] on the initialize_amx_if_available function, if I am reading this correctly:

Also, while it shouldn't matter here, if you want upgrades to more recent kernels and userlands which have better support in general for AMX, you may want to switch from CentOS 8 to CentOS Stream 8 via the directions here and then upgrade to CentOS Stream 9. Most notably, CentOS Stream 9 should have binutils support for Intel AMX.

@workingjubilee workingjubilee added O-linux Operating system: Linux O-x86_64 Target: x86-64 processors (like x86_64-*) labels Feb 9, 2023
@saethlin
Copy link
Member

saethlin commented Feb 9, 2023

In that case this recommendation is incorrect?

Yup, I'm wrong. I must have been misremembering the code I used last time I was experimenting with a stack size issue.

@workingjubilee
Copy link
Member

Also if you want further help with debugging this, @jczaja, it might help if you describe the message you got from running this code, exactly. You say it exits but... what happens, exactly?

@jczaja
Copy link
Author

jczaja commented Feb 9, 2023

@workingjubilee,@saethlin I must apologize , as I labelled wrong line with "PROGRAM EXITS HERE". So I updated the code(snippet bellow) to print error message:

 unsafe {
      let maybe_status = syscall!(Sysno::arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA);
      match maybe_status {
          Ok(s) => status = s,
          Err(err) => {
              println!("AMX Error: XFEATURE_XTILEDATA setup is failed, TMUL usage is not allowed! Error: {}",err);
              return false;          //<========================================== PROGRAM EXITS HERE!!!
          }
      }
  }

Error message:

AMX Error: XFEATURE_XTILEDATA setup is failed, TMUL usage is not allowed! Error: -28 ENOSPC (No space left on device)                                                                        
ERROR: Could not enable AMX!

Output of strace of Rust program:

arch_prctl(0x1022 /* ARCH_??? */, 0x7fff984dc7c0) = 0
arch_prctl(0x1023 /* ARCH_??? */, 0x12) = -1 ENOSPC (No space left on device)
write(1, "AMX Error: XFEATURE_XTILEDATA se"..., 118AMX Error: XFEATURE_XTILEDATA setup is failed, TMUL usage is not allowed! Error: -28 ENOSPC (No space left on device)
) = 118
write(1, "ERROR: Could not enable AMX!\n", 29ERROR: Could not enable AMX!
) = 29

Output of strace of C++ program:

arch_prctl(0x1022 /* ARCH_??? */, 0x7ffe4b737ee0) = 0
arch_prctl(0x1023 /* ARCH_??? */, 0x12) = 0
arch_prctl(0x1022 /* ARCH_??? */, 0x7ffe4b737ee0) = 0
write(1, "...AMX is now enabled!\n", 23...AMX is now enabled!

@ChangSeokBae
Copy link

ENOSPC looks to be relevant here: https://github.com/torvalds/linux/blob/master/arch/x86/kernel/fpu/xstate.c#L1579

It looks like sigaltstack(2) was called somewhere with a small size that is not enough for the AMX states.

Then, I found this -- #69533:
Rust’s runtime seems to install a SIGSEGV handler via sigaltstack with SIGSTKSZ – 8KB.

At least, this constant should be replaced by some dynamic value like getauxval(AT_MINSIGSTKSZ)

@workingjubilee
Copy link
Member

workingjubilee commented Feb 10, 2023

Ahh, okay, so it's definitely the sigaltstack part here, and that will be harder to fix given our own signal handlers are right there, yes.

Yes, if this diagnosis is correct, we should probably make something like that change. We will want to also be prepared for... "zaniness" like Arm SVE or RV64V_Zvl1024b.

@jczaja
Copy link
Author

jczaja commented Feb 15, 2023

@workingjubilee

It seems that solution suggessted by @ChangSeokBae works
Here is a snippet that makes AMX working for Rust:

    // request kernel to allocate larger signal-stack sizes, so the
    // amx state can be saved (via XSAVE) when a signal arrives
    unsafe {
        // or can get needed size from C++ getauxval(AT_MINSIGSTKSZ) + SIGSTKZ?
        let size = 1024 * 1024;
        let st_mem = libc::malloc(size);
        let new_sig_stack = libc::stack_t {
            ss_flags: 0,
            ss_size: size,
            ss_sp: st_mem,
        };
        let res = libc::sigaltstack(&new_sig_stack, std::ptr::null_mut());
        println!("sigaltstack res = {res:?}, stack addr = {:?}", st_mem);
        if res != 0 {
            panic!("ERROR: Failed to change sigaltstack size");
        }
    }

@jczaja
Copy link
Author

jczaja commented Feb 15, 2023

@workingjubilee , @ChangSeokBae , @saethlin

Here is full dummy example of using AMX from Rust (as of stable 1.67.0 toolchain) :

main.rs:

use std::arch::asm;
use syscalls::*;

const TILE_BYTES_PER_ROW: u16 = 8; // N (4x due to dword)
const TILE_ROWS_T0: u8 = 3; // M
const TILE_ROWS_T1: u8 = 3; // M
const TILE_ROWS_T2: u8 = 2; //

#[repr(packed)]
struct amx_memory_layout {
    palette: u8, // Leaving those value undefined makes Segmentation fault
    start_row: u8,
    reserved: [u8; 14],
    tiles_bytes_per_row: [u16; 8], // Max availale ie.g. 64 bytes per tile's row
    reserved2: [u16; 8],
    tiles_rows: [u8; 8], // Max availale ie.g. 64 bytes per tile's row
    reserved3: [u8; 8],
}
impl amx_memory_layout {
    fn new() -> Self {
        amx_memory_layout {
            palette: 1,
            start_row: 0,
            reserved: [0; 14],
            tiles_bytes_per_row: [TILE_BYTES_PER_ROW; 8],
            reserved2: [0; 8],
            tiles_rows: [TILE_ROWS_T0, TILE_ROWS_T1, TILE_ROWS_T2, 8, 8, 8, 8, 8],
            reserved3: [0; 8],
        }
    }
}

fn load_amx_config() {
    // lets initialize palette
    let mycfg: [amx_memory_layout; 1] = [amx_memory_layout::new()];
    unsafe {
        asm!(
        "ldtilecfg [{cfg}]",
        cfg = in(reg)  mycfg.as_ptr(),
        )
    }
}

pub unsafe fn bench_amx() {
    const D_NUM_ELEMENTS: usize = TILE_BYTES_PER_ROW as usize * TILE_ROWS_T0 as usize / 4 as usize;
    const S1_NUM_ELEMENTS: usize = TILE_BYTES_PER_ROW as usize * TILE_ROWS_T1 as usize;
    const S2_NUM_ELEMENTS: usize = TILE_BYTES_PER_ROW as usize * TILE_ROWS_T2 as usize;
    let s1buf: [u8; S1_NUM_ELEMENTS] = [1; S1_NUM_ELEMENTS];
    let s2buf: [u8; S2_NUM_ELEMENTS] = [1; S2_NUM_ELEMENTS];
    let dbuf: [u32; D_NUM_ELEMENTS] = [0; D_NUM_ELEMENTS];
    println!("INITIAL OUTPUT BUFFER: {:?}", dbuf);
    println!("FIRST ARG: {:?}", s1buf);
    println!("SECOND ARG: {:?}", s2buf);
    asm!(
        "tilezero tmm3",
        "tileloadd tmm1, [{s1buf}]",
        "tileloadd tmm2, [{s2buf}]",
        "tdpbuud tmm0, tmm1, tmm2",
        "tilestored [{dbuf}], tmm0",
        s1buf = in(reg) s1buf.as_ptr(),
        s2buf = in(reg) s2buf.as_ptr(),
        dbuf = in(reg) dbuf.as_ptr(),
        out("tmm0") _,
        out("tmm1") _,
        out("tmm2") _,
        out("tmm3") _,
    );
    println!("FINAL OUTPUT BUFFER: {:?}", dbuf);
}

fn initialize_amx_if_available() -> bool {
    const ARCH_GET_XCOMP_PERM: usize = 0x1022;
    const ARCH_REQ_XCOMP_PERM: usize = 0x1023;
    const XFEATURE_XTILECFG: usize = 17;
    const XFEATURE_XTILEDATA: usize = 18;
    const XFEATURE_MASK_XTILEDATA: usize = 1 << XFEATURE_XTILEDATA;
    const XFEATURE_MASK_XTILECFG: usize = 1 << XFEATURE_XTILECFG;
    const XFEATURE_MASK_XTILE: usize = XFEATURE_MASK_XTILECFG | XFEATURE_MASK_XTILEDATA;

    // request kernel to allocate larger signal-stack sizes, so the
    // amx state can be saved (via XSAVE) when a signal arrives
    unsafe {
        // or can get needed size from C++ getauxval(AT_MINSIGSTKSZ) + SIGSTKZ?
        let size = 1024 * 1024;
        let st_mem = libc::malloc(size);
        let new_sig_stack = libc::stack_t {
            ss_flags: 0,
            ss_size: size,
            ss_sp: st_mem,
        };
        let res = libc::sigaltstack(&new_sig_stack, std::ptr::null_mut());
        println!("sigaltstack res = {res:?}, stack addr = {:?}", st_mem);
        if res != 0 {
            panic!("ERROR: Failed to change sigaltstack size");
        }
    }

    let bitmask: [usize; 1] = [0; 1];
    let mut status: usize = 0;
    unsafe {
        let maybe_status = syscall!(Sysno::arch_prctl, ARCH_GET_XCOMP_PERM, bitmask.as_ptr());
        match maybe_status {
            Ok(s) => status = s,
            Err(_) => {
                println!("AMX not supported!");
                return false;
            }
        }
    }

    if (bitmask[0] & XFEATURE_MASK_XTILEDATA) != 0 {
        return true;
    }

    unsafe {
        let maybe_status = syscall!(Sysno::arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA);
        match maybe_status {
            Ok(s) => status = s,
            Err(err) => {
                println!("AMX Error: XFEATURE_XTILEDATA setup is failed, TMUL usage is not allowed! Error: {}",err);
                return false;
            }
        }
    }
    unsafe {
        status = syscall!(Sysno::arch_prctl, ARCH_GET_XCOMP_PERM, bitmask.as_ptr())
            .expect("Error: ARCH_PRCTL syscall failed!");
    }
    if status != 0 || ((bitmask[0] & XFEATURE_MASK_XTILEDATA) == 0) {
        println!("AMX not supported!");
        return false;
    }

    // XFEATURE_XTILEDATA set successfully, TMUL usage is allowed
    true
}

fn main() {
    if initialize_amx_if_available() == true {
        println!("Success: AMX Enabled!");
        println!("Configuring TMUL tiles:");
        load_amx_config();
        println!("Running dummy dot products");
        unsafe {
            bench_amx();
        }
        println!("Success!");
    } else {
        println!("ERROR: Could not enable AMX!");
    }
}


Cargo.toml:

[package]
name = "test-enable-amx"
version = "0.1.0"
edition = "2021"

[dependencies]
syscalls = "0.6.7"
libc = "0.2"

Building:

RUSTFLAGS='-C target-cpu=sapphirerapids -C target-feature=+amx-int8,+amx-bf16,+amx-tile' cargo build

Output:

sigaltstack res = 0, stack addr = 0x7f4cc36ff010
Success: AMX Enabled!
Configuring TMUL tiles:
Running dummy dot products
INITIAL OUTPUT BUFFER: [0, 0, 0, 0, 0, 0]
FIRST ARG: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
SECOND ARG: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
FINAL OUTPUT BUFFER: [8, 8, 0, 0, 0, 0]
Success!

@workingjubilee
Copy link
Member

It is best to make the relevant auxval constant available in libc in the likely event that other Unix-y platforms introduce a matching constant, with either the same or a different value, as they sometimes do for these things, when they cannot think of a better interface than the glibc one, so I have opened rust-lang/libc#3125

@ChangSeokBae
Copy link

Alternatively, recent glibc versions (>=2.34) may work for you as they have non-constant (MIN)SIGSTKSZ:
https://sourceware.org/glibc/wiki/Release/2.34#Non-constant_MINSIGSTKSZ_and_SIGSTKSZ

I was told that it will eventually reference AT_MINSIGSTKSZ. At the moment, it calculates the size based on CPUID, IIRC.

@workingjubilee
Copy link
Member

@ChangSeokBae That probably won't work for us. Rust interacts with C by knowing how to handle the platform's C ABI, so it can call functions, but it is totally blind to macros: rustc does not have the C preprocessor. That's why we redefine these constants in things like our libc crate.

@workingjubilee workingjubilee added the A-target-feature Area: Enabling/disabling target features like AVX, Neon, etc. label Mar 3, 2023
@workingjubilee
Copy link
Member

I have opened:

workingjubilee added a commit to workingjubilee/rustc that referenced this issue Mar 7, 2024
…igstksz, r=m-ou-se

Dynamically size sigaltstk in std

On modern Linux with Intel AMX and 1KiB matrices,
Arm SVE with potentially 2KiB vectors,
and RISCV Vectors with up to 16KiB vectors,
we must handle dynamic signal stack sizes.

We can do so unconditionally by using getauxval,
but assuming it may return 0 as an answer,
thus falling back to the old constant if needed.

Fixes rust-lang#107795
workingjubilee added a commit to workingjubilee/rustc that referenced this issue Mar 10, 2024
…igstksz, r=m-ou-se

Dynamically size sigaltstk in std

On modern Linux with Intel AMX and 1KiB matrices,
Arm SVE with potentially 2KiB vectors,
and RISCV Vectors with up to 16KiB vectors,
we must handle dynamic signal stack sizes.

We can do so unconditionally by using getauxval,
but assuming it may return 0 as an answer,
thus falling back to the old constant if needed.

Fixes rust-lang#107795
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Mar 10, 2024
…igstksz, r=m-ou-se

Dynamically size sigaltstk in std

On modern Linux with Intel AMX and 1KiB matrices,
Arm SVE with potentially 2KiB vectors,
and RISCV Vectors with up to 16KiB vectors,
we must handle dynamic signal stack sizes.

We can do so unconditionally by using getauxval,
but assuming it may return 0 as an answer,
thus falling back to the old constant if needed.

Fixes rust-lang#107795
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Mar 10, 2024
…igstksz, r=m-ou-se

Dynamically size sigaltstk in std

On modern Linux with Intel AMX and 1KiB matrices,
Arm SVE with potentially 2KiB vectors,
and RISCV Vectors with up to 16KiB vectors,
we must handle dynamic signal stack sizes.

We can do so unconditionally by using getauxval,
but assuming it may return 0 as an answer,
thus falling back to the old constant if needed.

Fixes rust-lang#107795
@bors bors closed this as completed in b81678e Mar 10, 2024
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Mar 10, 2024
Rollup merge of rust-lang#113525 - workingjubilee:handle-dynamic-minsigstksz, r=m-ou-se

Dynamically size sigaltstk in std

On modern Linux with Intel AMX and 1KiB matrices,
Arm SVE with potentially 2KiB vectors,
and RISCV Vectors with up to 16KiB vectors,
we must handle dynamic signal stack sizes.

We can do so unconditionally by using getauxval,
but assuming it may return 0 as an answer,
thus falling back to the old constant if needed.

Fixes rust-lang#107795
github-actions bot pushed a commit to rust-lang/miri that referenced this issue Mar 12, 2024
…r=m-ou-se

Dynamically size sigaltstk in std

On modern Linux with Intel AMX and 1KiB matrices,
Arm SVE with potentially 2KiB vectors,
and RISCV Vectors with up to 16KiB vectors,
we must handle dynamic signal stack sizes.

We can do so unconditionally by using getauxval,
but assuming it may return 0 as an answer,
thus falling back to the old constant if needed.

Fixes rust-lang/rust#107795
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-target-feature Area: Enabling/disabling target features like AVX, Neon, etc. O-linux Operating system: Linux O-x86_64 Target: x86-64 processors (like x86_64-*)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants