Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Popcount function is under performing #59

Open
unzvfu opened this issue Feb 8, 2018 · 3 comments
Open

Popcount function is under performing #59

unzvfu opened this issue Feb 8, 2018 · 3 comments

Comments

@unzvfu
Copy link

unzvfu commented Feb 8, 2018

In branch hlaw-fix-issue56 the raw popcount function performance is reported to be about 12 GiB/s (on my machine), however (essentially) the same code performs at 18 GiB/s when called from a basic wrapper function thus:

// -*- compile-command: "g++ -ggdb -Wall -Wextra -std=c++11 -O3 -mssse3 -mpopcnt -fvisibility=hidden -o bench bench.cc dice_one_against_many.cpp" -*-

#include <iostream>
#include <stdint.h>
using namespace std;

double popcount_1024_array(const char *many, int n, uint32_t *counts_many);

int main(int argc, char *argv[]) {
    static constexpr int KEYWORDS = 16;
    long n = 1 << 20;
    if (argc > 1)
        n = atol(argv[1]);
    uint64_t *buf = new uint64_t[KEYWORDS * n];
    uint32_t *counts = new uint32_t[n];
    long nbytes = n*KEYWORDS*sizeof(uint64_t);
    double t = popcount_1024_array(reinterpret_cast<const char *>(buf), n, counts);
    cout << "Time: " << t << "ms => " << (n/1e6) * (1e3/t) << " 1e6 popc/s => "
         <<  (nbytes / (double)(1 << 20)) * (1e3/t) << " MiB/s" << endl;
    delete[] counts;
    delete[] buf;
    return 0;
}

Aha! Link: https://csiro.aha.io/features/ANONLINK-71

@unzvfu unzvfu added this to the Sprint 2018-02-12 milestone Feb 8, 2018
@unzvfu unzvfu self-assigned this Feb 8, 2018
@unzvfu
Copy link
Author

unzvfu commented Mar 9, 2018

So, it turns out that at least part of the disparity between the observed throughput of (about 12-14 GiB/s) and the actual (expected) bandwidth (about 20 GiB/s on a 2.4 GHz CPU) of popcounting is explained by the difference between doing a "lots of popcounts on small arrays" (the first case) and doing "fewer popcounts on larger arrays" (the second case). This is presumably because the "output pipeline" is stalling/stumbling at the point where the popcount is calculated and written back to memory, which happens much more often in the first case compared to the second.

There is no obvious fix to this. Further investigation would be necessary to find a solution to this (which would almost certainly be a fiddly and low-level one).

@hardbyte
Copy link
Collaborator

I'm having a look at switching to using https://github.com/kimwalisch/libpopcnt instead of hand rolling our own - this should also help us support other platforms and take advantage of newer/different instruction sets too.

https://github.com/data61/anonlink/tree/windows-support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants