Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using intrinsics to optimize counting HyperLogLog trailing bits #846

Merged
merged 18 commits into from
Aug 28, 2024

Conversation

mapleFU
Copy link
Contributor

@mapleFU mapleFU commented Jul 30, 2024

Godbolt link: https://godbolt.org/z/3YPvxsr5s

__builtin_ctz would generate shorter code than hand-written loop.

@mapleFU mapleFU force-pushed the using-ctz-for-hll branch from 61e0bc4 to 2e3178b Compare July 30, 2024 16:25
Copy link

codecov bot commented Jul 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.44%. Comparing base (08aaeea) to head (ea6505c).
Report is 22 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #846      +/-   ##
============================================
- Coverage     70.59%   70.44%   -0.16%     
============================================
  Files           112      114       +2     
  Lines         61512    61725     +213     
============================================
+ Hits          43427    43481      +54     
- Misses        18085    18244     +159     
Files Coverage Δ
src/hyperloglog.c 91.17% <100.00%> (-0.05%) ⬇️
src/intrinsics.h 100.00% <100.00%> (ø)

... and 26 files with indirect coverage changes

@mapleFU mapleFU force-pushed the using-ctz-for-hll branch from 311a9cd to 4ea2bd3 Compare July 31, 2024 05:04
@mapleFU mapleFU changed the title Minor: Using __builtin_ctz to optimize counting HyperLogLog trailing bits Using __builtin_ctz to optimize counting HyperLogLog trailing bits Jul 31, 2024
Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, can you try to do some benchmarks? I guess it may not be measurable, or the improvement is negligible.

Minor adjustments.

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Aug 2, 2024
fix format and trigger the extra ci

Signed-off-by: Binbin <[email protected]>
@mapleFU
Copy link
Contributor Author

mapleFU commented Aug 2, 2024

I don't have end-to-end benchmark

#include <random>
#include <cstdint>
#include <bit>

constexpr int n = 1000;

static void LoopBased(benchmark::State& state) {
    std::mt19937 rng(0);
    std::uniform_int_distribution<int64_t> uniform_dist(1, 1125899906842624);
    std::vector<int64_t> value;
    for (int i = 0; i < n; ++i) {
       value.push_back(uniform_dist(rng));
    }
    int cnt = 0;
    for (auto _ : state) {
       int count = 1;
       int64_t hash = value[cnt % n];
       int bit = 1;
       while ((hash & bit) == 0) {
         count++;
         bit <<= 1;
       }
       ::benchmark::DoNotOptimize(count);
       ++cnt;
    }
}
// Register the function as a benchmark
BENCHMARK(LoopBased);

static void CtzBased(benchmark::State& state) {
    std::mt19937 rng(0);
    std::uniform_int_distribution<int64_t> uniform_dist(1, 1125899906842624);
    std::vector<int64_t> value;
    for (int i = 0; i < n; ++i) {
       value.push_back(uniform_dist(rng));
    }
    int cnt = 0;
    for (auto _ : state) {
       int count = 1;
       int64_t hash = value[cnt % n];
       count += std::countl_zero(static_cast<uint64_t>(hash));
       ::benchmark::DoNotOptimize(count);
       ++cnt;
    }
}
BENCHMARK(CtzBased);

Quickbench has some error message and I run it on my x86 3800x cpu with gcc12 and Release -O2

-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
LoopBased        1.36 ns         1.36 ns    474427671
CtzBased        0.479 ns        0.479 ns   1000000000

@mapleFU
Copy link
Contributor Author

mapleFU commented Aug 2, 2024

Also this is quickbench in x86 platform: https://quick-bench.com/q/7dlzmaBJDz9Xnfzhsk-1Iw7TP0I
I believe the loop based impl is slower

@enjoy-binbin
Copy link
Member

LGTM, @zuiderkwast @PingXie any other ideas before the merge?

@madolson
Copy link
Member

madolson commented Aug 7, 2024

It might make sense to have a #define for __builtin_ctzll incase it's not supported in the compiler.

@mapleFU
Copy link
Contributor Author

mapleFU commented Aug 9, 2024

It might make sense to have a #define for __builtin_ctzll incase it's not supported in the compiler.

#if defined(__clang__) || defined(__GNUC__)
  return static_cast<int>(__builtin_clzll(value));
#elif defined(_MSC_VER)
  unsigned long index;
  i_BitScanReverse64(&index, value);
  return 63 - static_cast<int>(index);
#else
  int bitpos = 0;
  while (value != 0) {
    value >>= 1;
    ++bitpos;
  }
  return 64 - bitpos;
#endif

I do this because server uses __builtin_clz ( see:

valkey/src/dict.c

Line 1624 in 7424620

return 8 * sizeof(long) - __builtin_clzl(size - 1);
,
int clz = mem > 0 ? __builtin_clzl(mem) : size_in_bits;
)

I also find some code checks that:

return __builtin_clzll(value); /* smallest power of 2 containing value */

So I don't know the idiom here :-(

@madolson do you think something like this is ok? ( borrowed from https://github.com/apache/arrow/blob/b33f040640c7ccb3e6a8406e4d3158608c597025/cpp/src/arrow/util/bit_util.h#L198 )

@PingXie
Copy link
Member

PingXie commented Aug 12, 2024

Yeah, using __builtin_ctz is good.

So I don't know the idiom here :-(

We need an indirection in config.h. Here is an example

158 #if __GNUC__ >= 3
159 #define likely(x) __builtin_expect(!!(x), 1)
160 #define unlikely(x) __builtin_expect(!!(x), 0)
161 #else
162 #define likely(x) (x)
163 #define unlikely(x) (x)
164 #endif

@mapleFU mapleFU force-pushed the using-ctz-for-hll branch from 7bb0823 to fd6f283 Compare August 12, 2024 04:52
Signed-off-by: mwish <[email protected]>
@mapleFU
Copy link
Contributor Author

mapleFU commented Aug 12, 2024

@PingXie I've add an ad-hoc counting_trailing_bits. I didn't add it in config.h because this doesn't like likely, which is an empty operation when compiler doesn't have the hint. Instead, it should fall into loop here.

@mapleFU mapleFU changed the title Using __builtin_ctz to optimize counting HyperLogLog trailing bits Using intrinsics to optimize counting HyperLogLog trailing bits Aug 12, 2024
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@PingXie
Copy link
Member

PingXie commented Aug 13, 2024

@PingXie I've add an ad-hoc counting_trailing_bits. I didn't add it in config.h because this doesn't like likely, which is an empty operation when compiler doesn't have the hint. Instead, it should fall into loop here.

You should be able to move this function to config.h. Make sure you keep inline so that you won't upset the linker with multiple copies of the same function, which also hurts performance, even if the linker doesn't complain (as in the MSVC case).

I would highly recommend moving any intrinsic function wrappers or their "emulation" to config.h. This maximizes their re-usability.

@zuiderkwast
Copy link
Contributor

@zuiderkwast do you have a strong opinion against intrinsic.c?

No, I don't mind.

@mapleFU
Copy link
Contributor Author

mapleFU commented Aug 17, 2024

So, from the discussion here, I need add a intrinsics.h and put CountingTrailingZeros in intrinsics.h

My remaining questions:

  1. Like Using intrinsics to optimize counting HyperLogLog trailing bits #846 (comment) , should I put other intrinsics here?

@mapleFU mapleFU force-pushed the using-ctz-for-hll branch from 2500f10 to 42849b0 Compare August 21, 2024 13:48
@mapleFU mapleFU force-pushed the using-ctz-for-hll branch from 3755406 to 7f13f22 Compare August 21, 2024 13:50
@mapleFU
Copy link
Contributor Author

mapleFU commented Aug 21, 2024

@PingXie I've added the intrinsics.h now

Signed-off-by: mwish <[email protected]>
Signed-off-by: mwish <[email protected]>
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mapleFU!

@mapleFU mapleFU requested a review from PingXie August 22, 2024 05:46
@mapleFU mapleFU force-pushed the using-ctz-for-hll branch from 6507ddc to df3ebca Compare August 22, 2024 05:47
Signed-off-by: mwish <[email protected]>
@mapleFU mapleFU force-pushed the using-ctz-for-hll branch from df3ebca to 82a8038 Compare August 22, 2024 05:50
@mapleFU
Copy link
Contributor Author

mapleFU commented Aug 23, 2024

@PingXie Comment resolved, mind take a look again?

Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me. I do have a question about one of your comments but I don't think it would affect the correctness. Thanks for your patience @mapleFU!

@mapleFU
Copy link
Contributor Author

mapleFU commented Aug 27, 2024

@enjoy-binbin @madolson could we move forward and check in this? Or there're more things I need to do?

@madolson
Copy link
Member

madolson commented Aug 27, 2024

I think @PingXie will merge it, although there are a bunch of test failures I am looking through. (Seem like some old issues, let me re-trigger them)

Signed-off-by: Madelyn Olson <[email protected]>
@PingXie PingXie merged commit 744b13e into valkey-io:unstable Aug 28, 2024
54 of 56 checks passed
@PingXie
Copy link
Member

PingXie commented Aug 28, 2024

There are still some test failures but I don't think they are related.

@mapleFU
Copy link
Contributor Author

mapleFU commented Aug 28, 2024

Thanks all!

@mapleFU mapleFU deleted the using-ctz-for-hll branch August 28, 2024 07:54
madolson added a commit that referenced this pull request Sep 2, 2024
Godbolt link: https://godbolt.org/z/3YPvxsr5s

__builtin_ctz would generate shorter code than hand-written loop.

---------

Signed-off-by: mwish <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Binbin <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
madolson added a commit that referenced this pull request Sep 3, 2024
Godbolt link: https://godbolt.org/z/3YPvxsr5s

__builtin_ctz would generate shorter code than hand-written loop.

---------

Signed-off-by: mwish <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Binbin <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants