-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
addition/subtraction acceleration via inline assembly #314
base: master
Are you sure you want to change the base?
Conversation
Can you share your benchmark code somewhere? I'd like to compare results with various Intel and AMD machines, and also when compared to In general, I would rather see Rust-native changes that coax LLVM toward the codegen we want, if that's possible, because that's more likely to play well with Also, I'm pretty sure this needs more |
native rust concernI've tried to play around in godbolt, but I could not generate the assembly I want, which is quite brittle: as soon as any instruction touches the carry flag ( x32thanks, I did not know benchmark codeMy benchmark code is below, it is not particularly fancy: use num_bigint::BigUint;
use rand::distributions::Standard;
use rand::prelude::Distribution;
use rand::{thread_rng, Rng};
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn gen_n_random_values<T>(n: usize) -> Vec<T>
where
Standard: Distribution<T>,
{
let mut ret = Vec::<T>::with_capacity(n);
for _ in 0..n {
ret.push(thread_rng().gen::<T>());
}
ret
}
pub fn add<const N: usize>(c: &mut Criterion) {
let mut name = "num-bigint add ".to_string();
name.push_str(&N.to_string());
println!("nb digits: {:?}", (N / 32) * 32);
let n1 = BigUint::new(gen_n_random_values::<u32>(N / 32));
let n2 = BigUint::new(gen_n_random_values::<u32>(N / 32));
c.bench_function(name.as_str(), |b| b.iter(|| black_box(&n1 + &n2)));
}
criterion_group!(
num_bigint_add,
add<10>,
add<30>,
add<100>,
add<300>,
add<1_000>,
add<3_000>,
add<10_000>,
add<30_000>,
add<100_000>,
add<300_000>,
add<1_000_000>,
add<3_000_000>,
add<10_000_000>,
);
pub fn sub<const N: usize>(c: &mut Criterion) {
let mut name = "num-bigint sub ".to_string();
name.push_str(&N.to_string());
let n1 = BigUint::new(gen_n_random_values::<u32>(N / 32 + 1));
let n2 = BigUint::new(gen_n_random_values::<u32>(N / 32));
c.bench_function(name.as_str(), |b| b.iter(|| black_box(&n1 - &n2)));
}
criterion_group!(
biguint_sub,
sub<1_000>,
sub<3_000>,
sub<10_000>,
sub<30_000>,
sub<100_000>,
sub<300_000>,
sub<1_000_000>,
sub<3_000_000>,
sub<10_000_000>,
);
pub fn mul<const N: usize>(c: &mut Criterion) {
let mut name = "num-bigint mul ".to_string();
name.push_str(&N.to_string());
println!("nb digits: {:?}", (N / 32) * 32);
let n1 = BigUint::new(gen_n_random_values::<u32>(N / 32));
let n2 = BigUint::new(gen_n_random_values::<u32>(N / 32));
c.bench_function(name.as_str(), |b| b.iter(|| black_box(&n1 * &n2)));
}
criterion_group!(
biguint_mul,
mul<10>,
mul<30>,
mul<100>,
mul<300>,
mul<1_000>,
mul<3_000>,
mul<10_000>,
mul<30_000>,
);
criterion_main!(num_bigint_add, biguint_sub, biguint_mul); |
I've pushed the fix for 32bit pointer width targets. |
By x32, I mean the odd target of Thanks for your benchmark code - I'll take a closer look later. |
Thank you for the precision ! It did compile fine for Thank you for taking the time to look at this ! |
Have you had the time to look at benchmarks ? I'm very curious about the behavior on other machines ! |
For reference, this equivalent experimental implementation (without the All in all, the current PR still has the best performance by far of the approaches I tried. #[cfg(target_arch = "x86_64")]
cfg_64!(
/// Performs a part of the addition. Returns a tuple containing the carry state
/// and the number of integers that were added
///
/// By using as many registers as possible, we treat digits 5 by 5
unsafe fn schoolbook_add_assign_x86_64(
lhs: &mut [u64],
rhs: &[u64],
mut size: usize,
) -> (bool, usize) {
if size < 5 {
return (false, 0);
}
size -= 5;
let mut c = 0u8;
let mut idx = 0;
while idx < size {
c = unsafe {
arch::_addcarry_u64(
c,
*lhs.get_unchecked(idx),
*rhs.get_unchecked(idx),
lhs.get_unchecked_mut(idx),
)
};
idx += 1;
c = unsafe {
arch::_addcarry_u64(
c,
*lhs.get_unchecked(idx),
*rhs.get_unchecked(idx),
lhs.get_unchecked_mut(idx),
)
};
idx += 1;
c = unsafe {
arch::_addcarry_u64(
c,
*lhs.get_unchecked(idx),
*rhs.get_unchecked(idx),
lhs.get_unchecked_mut(idx),
)
};
idx += 1;
c = unsafe {
arch::_addcarry_u64(
c,
*lhs.get_unchecked(idx),
*rhs.get_unchecked(idx),
lhs.get_unchecked_mut(idx),
)
};
idx += 1;
c = unsafe {
arch::_addcarry_u64(
c,
*lhs.get_unchecked(idx),
*rhs.get_unchecked(idx),
lhs.get_unchecked_mut(idx),
)
};
idx += 1;
}
(c > 0, idx)
}
); Assembly generated: schoolbook_add_assign_x86_64:
cmp r8, 5
jb .LBB1_1
add r8, -5
je .LBB1_1
mov rax, rdx
mov edx, 4
xor ecx, ecx
.LBB1_4:
mov rsi, rdx
mov rdx, qword ptr [rax + 8*rdx - 32]
add cl, -1
adc qword ptr [rdi + 8*rsi - 32], rdx
mov rcx, qword ptr [rax + 8*rsi - 24]
adc qword ptr [rdi + 8*rsi - 24], rcx
mov rcx, qword ptr [rax + 8*rsi - 16]
adc qword ptr [rdi + 8*rsi - 16], rcx
mov rcx, qword ptr [rax + 8*rsi - 8]
adc qword ptr [rdi + 8*rsi - 8], rcx
mov rcx, qword ptr [rax + 8*rsi]
adc qword ptr [rdi + 8*rsi], rcx
setb cl
lea rdx, [rsi + 5]
inc rsi
cmp rsi, r8
jb .LBB1_4
test cl, cl
setne al
add rdx, -4
ret
.LBB1_1:
xor edx, edx
xor eax, eax
ret In particular, I cannot make the assembly use more registers in a single batch, so the additions are not pipelined together. On the other end, I think was wrong thinking overwriting the CF flag is a branching issue: it is reset via the |
This PR has been open for quite some time. This could be closed if nothing is ever gonna happen ? In any case, this proposition is implemented and maintained here, an experimental repo with much the same features as |
AFAICS, we already get the same kind of unrolled optimization as your non-asm version, only it chooses 4 at a time instead of your 5, and when I use I want to keep this open to investigate further though. There's clearly some performance difference here, but I'm not sure of the true reason behind it.
I'm not sure why we need/want to force more named registers. Shouldn't the CPU's out-of-order execution with register renaming already handle this efficiently? There's a data-dependency chain on the carry flag anyway, but the stores should still work with the old version of the register even while the next one is computed. |
I understand why the PR is still open. The benchmark should definitely run on the same integers! The version I previously sent did not take that into account yet. Tbh I'm not entirely convinced about the explanation for the performance difference either. I guess it must be related to the pipelining of the additions, but I'm not confident enough to be sure of any explanations |
During the addition/subtraction loops, there is performance to be gained by specifying the exact assembly instructions. The performance gain start to be felt after 1,000bit integers, and are maximal around 10,000bits, where we are near 2x performance.
Documentation for assembly instructions can be found here
Added assembly
Principle
The main idea is to avoid any instruction that are modifying the carry flags, except for
adc
andsbb
of course, so that the carry is propagated through the whole loop. This is doable if we use a decreasing index for the loop counter and ajnz
instruction. The indices we use have to be updated usinginc
anddec
instructions (those instructions aren't modifying the carry flag).Batching
To avoid overhead by looping digits one by one, we fill the 16 registers as much as possible. This enables us to treat digits 5 by 5 in one step (enabling efficient CPU pipelining).
A consequence of this is that a number of digits is still not added after the inline assembly code is run (the number is strictly less than 5). So we keep the existing code to handle the remaining digits.
Safety
Obviously this is very much unsafe. On my personal project, where I want to let people the option to rely on
safe
Rust, I put these functions behind a feature gate. If necessary, I can take a similar approach here.Performance
I've measured performance by using the
criterion
crate, and adding 2 BigUint of different sizes. The results are below. On some scales, this gives close a 2x gain ! To me the performance gain is clear enough to be non-ambiguous, though I'm not a benchmarking expert, and I'm sure a finer evaluation could be interesting.Style
For easy review, the 2 added functions are in the files
addition
andsubtraction
, but I'd argue in the end, it'd be better to put them in a dedicated sub-module, they look far too different.Also I'm notoriously average at naming things, so suggestions are welcome ^^