Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New insn: Carry-less multiply (CLMUL) #1388

Open
smilingthax opened this issue Nov 2, 2020 · 2 comments
Open

New insn: Carry-less multiply (CLMUL) #1388

smilingthax opened this issue Nov 2, 2020 · 2 comments
Labels

Comments

@smilingthax
Copy link

For implementing efficient algorithms, efficient bit-level operations are often important. While there are many bit-twiddling hacks for certain operations, others are not efficiently reducible to more common operations, while being easily implemented in hardware. WebAssembly already has some instructions of that kind, e.g. POPCNT.

One such operation, which has many uses as a fundamental building block, is the carry-less multiplication (CLMUL), which became part of many CPU instruction sets for that very reason. Its uses range from efficient, table-less CRC-computations, universal hash functions to encryption routines (AES!), i.e. when calculations in finite fields (e.g. GF(2^n)) are involved.

I propose to add this as a i32 and i64 variant.
A consideration for multiplication also relevant for CLMUL is that the result of (e.g) i32 x i32 will use up to 64 bits (i.e. i64, or two i32).

There might be other interesting bit-ops that could be included in a proposal, e.g. PEXT / PDEP (parallel bits extract / deposit [aka. bit compress/expand, bit scatter/gather, bit pack/unpack]) seems to be very potent (used e.g. in chess engines: Bitboards), but at least AMD CPUs seem to execute them very slow compared to Intel, and software emulations are also expensive, thus my current focus is on the (uncontroversial) CLMUL only.

@comex
Copy link

comex commented Nov 2, 2020

What would a software emulation of this instruction look like for CPUs that don't have native support?

@smilingthax
Copy link
Author

A partial implementation in JS:

function clmul32(a, b) {  // 32 x 32 -> 32  (does not compute high part, JS bitops are only 32 bit!)
  let ret = 0;
  for (let i = 0; i < 32; i++) {
    if (a << i) {
      ret ^= (b & (1 << i));
    }
  }
  return ret;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants