Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General discussion about XXH3 #175

Closed
easyaspi314 opened this issue Mar 13, 2019 · 268 comments
Closed

General discussion about XXH3 #175

easyaspi314 opened this issue Mar 13, 2019 · 268 comments

Comments

@easyaspi314
Copy link
Contributor

This is going to be a tracker for discussion, questions, feedback, and analyses about the new XXH3 hashes, found in the xxh3 branch.

@Cyan4973's comments (from xxhash.h):

XXH3 is a new hash algorithm, featuring vastly improved speed performance for both small and large inputs.

A full speed analysis will be published, it requires a lot more space than this comment can handle.

In general, expect XXH3 to run about ~2x faster on large inputs, and >3x faster on small ones, though exact difference depend on platform.

The algorithm is portable, will generate the same hash on all platforms. It benefits greatly from vectorization units, but does not require it.

XXH3 offers 2 variants, _64bits and _128bits.
The first 64-bits field of the _128bits variant is the same as _64bits result. However, if only 64-bits are needed, prefer calling the _64bits variant. It reduces the amount of mixing, resulting in faster speed on small inputs.

The XXH3 algorithm is still considered experimental. It's possible to use it for ephemeral data, but avoid storing long-term values for later re-use. While labelled experimental, the produced result can still change between versions.

The API currently supports one-shot hashing only. The full version will include streaming capability, and canonical representation Long term optional feature may include custom secret keys, and secret key generation.

There are still a number of opened questions that community can influence during the experimental period. I'm trying to list a few of them below, though don't consider this list as complete.

  • 128-bits output type : currently defined as a structure of 2 64-bits fields.
    • That's because 128-bit values do not exist in C standard.
    • Note that it means that, at byte level, result is not identical depending on endianess.
    • However, at field level, they are identical on all platforms.
    • The canonical representation will solve the issue of identical byte-level representation across platforms, which is necessary for serialization.
    • Would there be a better representation for a 128-bit hash result ?
    • Are the names of the inner 64-bit fields important ? Should they be changed ?
  • Canonical representation : for the 64-bit variant, canonical representation is the same as XXH64() (aka big-endian).
    • What should it be for the 128-bit variant ?
    • Since it's no longer a scalar value, big-endian representation is no longer an obvious choice.
    • One possibility : represent it as the concatenation of two 64-bits canonical representation (aka 2x big-endian)
    • Another one : represent it in the same order as natural order in the struct for little-endian platforms.
    • Less consistent with existing convention for XXH32/XXH64, but may be more natural for little-endian platforms.
  • Associated functions for 128-bit hash : simple things, such as checking if 2 hashes are equal, become more difficult with struct.
    • Granted, it's not terribly difficult to create a comparator, but it's still a workload.
    • Would it be beneficial to declare and define a comparator function for XXH128_hash_t ?
    • Are there other operations on XXH128_hash_t which would be desirable ?
  • Variant compatibility : The first 64-bit field of the _128bits variant is the same as the result of _64bits.
    • This is not a compulsory behavior. It just felt that it "wouldn't hurt", and might even help in some (unidentified) cases.
    • But it might influence the design of XXH128_hash_t, in ways which may block other possibilities.
    • Good idea, bad idea ?
  • Seed type for 128-bits variant : currently, it's a single 64-bit value, like the 64-bit variant.
    • It could be argued that it's more logical to offer a 128-bit seed input parameter for a 128-bit hash.
    • Although it's also more difficult to use, since it requires to declare and pass a structure instead of a value.
    • It would either replace current choice, or add a new one.
    • Farmhash, for example, offers both variants (the 128-bits seed variant is called doubleSeed).
    • If both 64-bit and 128-bit seeds are possible, which variant should be called XXH128?
  • Result for len==0 : Currently, the result of hashing a zero-length input is the seed.
    • This mimics the behavior of a crc : in which case, a seed is effectively an accumulator, so it's not updated if input is empty.
    • Consequently, by default, when no seed specified, it returns zero. That part seems okay (it used to be a request for XXH32/XXH64).
    • But is it still fine to return the seed when the seed is non-zero ?
    • Are there use case which would depend on this behavior, or would prefer a mixing of the seed ?
@easyaspi314
Copy link
Contributor Author

Canonical representation : for the 64-bit variant, canonical representation is the same as XXH64() (aka big-endian).

Nothing is wrong with big endian output. I suggest for XXH3_64b, it will look like this (big endian):

0123456789abcdef-3  file.c

and XXH3_128b, this:

00112233445566778899aabbccddeeff-3  file.c

I say 2x big endian. That would be easiest to parse (in the case of being in a text file),

fscanf(hashfile, "%016llx%016llx-3", &hashLo, &hashHi);

and the easiest to output:

printf("%016llx%016llx-3\n", hashLo, hashHi);

Similar to how my XXH32a's canonical representation ended in -a, I think it makes sense for XXH3 to end in -3.

@easyaspi314
Copy link
Contributor Author

  • Associated functions for 128-bit hash : simple things, such as checking if 2 hashes are equal, become more difficult with struct.
    • Granted, it's not terribly difficult to create a comparator, but it's still a workload.
/* pointer or value? also, should we perhaps give a saturated subtraction for sorting
 * functions that check values? */
XXH_PUBLIC_API int
XXH128_hash_cmp(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
    if (a->ll2 == b->ll2) {
        if (a->ll1 == b->ll1) {
            return 0;
        } else if (a->ll1 < b->ll1) {
            return -1;
        } else { 
             return 1;
         }
    }
    if (a->ll2 < b->ll2) {
        return -1;
     } else {
         return 1;
     }
}

/* xxhash.h */
#ifdef __cplusplus
static inline bool operator==(const XXH128_hash_t &a, const XXH128_hash_t &b) {
    return XXH128_hash_cmp(&a, &b) == 0;
}
static inline bool operator!=(const XXH128_hash_t &a, const XXH128_hash_t &b) {
    return XXH128_hash_cmp(&a, &b) != 0;
}
static inline bool operator>(const XXH128_hash_t &a, const XXH128_hash_t &b) {
    return XXH128_hash_cmp(&a, &b) > 0;
}
static inline bool operator<(const XXH128_hash_t &a, const XXH128_hash_t &b) {
    return XXH128_hash_cmp(&a, &b) < 0;
}
static inline bool operator<=(const XXH128_hash_t &a, const XXH128_hash_t &b) {
    return XXH128_hash_cmp(&a, &b) <= 0;
}
static inline bool operator>=(const XXH128_hash_t &a, const XXH128_hash_t &b) {
    return XXH128_hash_cmp(&a, &b) >= 0;
}
#endif /* __cplusplus */

@Cyan4973
Copy link
Owner

should we perhaps give a saturated subtraction for sorting

That's a great idea !

pointer or value?

I prefer by value when structure is of limited size.
That's the case here : XXH128_hash_t is just 16-bytes long.

@42Bastian
Copy link

Regarding length == 0 return: I'd go for returning the seed (even if it is 0) instead of always returning 0. If one likes to differ the hashes of the same input data by providing a seed, you get still different hashes even if length == 0.

@42Bastian
Copy link

Pointer or value
I'd vote for passing by pointer the XXH128_hash_t structure. Using value will imply copying for 32bit systems. Even on 64bit systems, passing 2 structs might not be done in registers.

BTW: Why "ll1" and "ll2", IMHO it should be suffixed by "h" and "l".

@42Bastian
Copy link

42Bastian commented Mar 14, 2019

regarding static U64 XXH3_mul128(U64 ll1, U64 ll2)
The Aarch32 code seems to be over-complicated for returning only the lower 64 bits.
GCC produces elegant and as as it looks like, correct code:
mul r3, r0, r3
mla r3, r2, r1, r3
umull r0, r1, r0, r2
add r1, r3, r1

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 14, 2019

It doesn't return the lower 64 bits, it adds the lower bits to the higher bits. This is pretty much the best way of doing it.

You can see it in the GCC extension code:

    __uint128_t lll = (__uint128_t)ll1 * ll2;
    return (U64)lll + (U64)(lll >> 64);

or if we break it up:

    __uint128_t ll1_u128 = (__uint128_t) ll1; // cast to __uint128_t
    __uint128_t ll2_u128 = (__uint128_t) ll2; // promoted
    __uint128_t lll = ll1_u128 * ll2_u128; // long multiply
    U64 lll_hi = (U64) (lll >> 64); // high bits
    U64 lll_lo = (U64) (lll & 0xFFFFFFFFFFFFFFFFULL); // low bits
    return lll_hi + lll_lo; // 64-bit add together

In x86_64 assembly, we get this:

_mul128:
        mov     rax, rsi
        mul     rdi    # note: rax stores high bits now
        add     rax, rdx
        ret

The reason I use inline assembly is because neither GCC nor Clang will emit the right code and prefer to spew out nonsense, and the __uint128_t type is not available on 32-bit.

The code is quite efficient:

        umull   r12, lr, r0, r2
        mov     r5, #0
        mov     r4, #0
        umaal   r5, lr, r1, r2
        umaal   r5, r4, r0, r3
        umaal   lr, r4, r1, r3
        adds    r0, lr, r12
        adc     r1, r4, r5

TBH XXH3_mul128 should be renamed to something like XXH3_foldedMult64to128 to match XXH_mult32to64 to make it clear that this is a "folding" multiply and not just a superoptimized 64->128-bit multiply.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 14, 2019

I also kinda agree about the pointer thing. On 32-bit, those two structs will take up 8 registers. ARM for example can only pass up to 4 in a function call before it has to push to the stack. And x86 only has 7 registers.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 14, 2019

Actually, I benchmarked it via qsort, and it seems like passing value is in fact better on both 32-bit and 64-bit (for x86, haven't tested ARM yet):

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <string.h>

typedef struct {
    uint64_t ll1;
    uint64_t ll2;
} XXH128_hash_t;

__attribute__((__noinline__)) int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
    if (a->ll2 == b->ll2) {
        if (a->ll1 == b->ll1) {
            return 0;
        } else if (a->ll1 < b->ll1) {
            return -1;
        } else {
             return 1;
         }
    }
    if (a->ll2 < b->ll2) {
        return -1;
     } else {
         return 1;
     }
}

__attribute__((__noinline__)) int
XXH128_hash_cmp(const XXH128_hash_t a, const XXH128_hash_t b)
{
    if (a.ll2 == b.ll2) {
        if (a.ll1 == b.ll1) {
            return 0;
        } else if (a.ll1 < b.ll1) {
            return -1;
        } else {
             return 1;
         }
    }
    if (a.ll2 < b.ll2) {
        return -1;
     } else {
         return 1;
     }
}

int XXH128_hash_cmp_ptr_qsort(const void *start, const void *end)
{
    return XXH128_hash_cmp_ptr((const XXH128_hash_t *)start, (const XXH128_hash_t *)end);
}

int XXH128_hash_cmp_qsort(const void *start, const void *end)
{
    return XXH128_hash_cmp(*(const XXH128_hash_t *)start, *(const XXH128_hash_t *)end);
}

#define ROUNDS 1000000
static XXH128_hash_t arr[ROUNDS];
static XXH128_hash_t *arr_p[ROUNDS];

int main()
{
    srand(time(NULL));
    printf("arch: %zu-bit\n", sizeof(void *) * 8);
    for (int i = 0; i < ROUNDS; i++) {
        arr[i].ll1 = rand() | (uint64_t)rand() << 32;
        arr[i].ll2 = rand() | (uint64_t)rand() << 32;
        arr_p[i] = malloc(sizeof(XXH128_hash_t));
        memcpy(arr_p[i], &arr[i], sizeof(XXH128_hash_t));
    }
    double start, end;
    start = (double)clock();
    qsort(arr, ROUNDS, sizeof(XXH128_hash_t), XXH128_hash_cmp_qsort);
    end = (double)clock();
    printf("pass by value: %lf\n", (end - start) / CLOCKS_PER_SEC);

    start = (double)clock();
    qsort(arr_p, ROUNDS, sizeof(XXH128_hash_t *), XXH128_hash_cmp_ptr_qsort);
    end = (double)clock();
    printf("pass by pointer: %lf\n", (end - start) / CLOCKS_PER_SEC);
}
arch: 32-bit
pass by value: 0.352646
pass by pointer: 0.526951

arch: 64-bit
pass by value: 0.198776
pass by pointer: 0.449999

(note: compiled with clang 7.0.1 on a 2.0 GHz 2nd Gen Core i7, -O3)

GCC 8.3:

arch: 32-bit
pass by value: 0.442356
pass by pointer: 0.547503

arch: 64-bit
pass by value: 0.175436
pass by pointer: 0.360158

@42Bastian
Copy link

42Bastian commented Mar 14, 2019

On ARM64 there are enough registers for ABI to pass it as value (r0..r7), but for 32Bit ARM the pointer-version is better as there are only 4 registers.
So it highly depends on the CPU and the ABI.
Edit: I checked, x64-GCC ABI uses four registers, so just enough to pass the two structs as value.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 14, 2019

Nevermind that, this is the way to do it:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <string.h>

typedef struct {
    uint64_t ll1;
    uint64_t ll2;
} XXH128_hash_t;

__attribute__((__noinline__)) int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
    if (a->ll2 == b->ll2) {
        if (a->ll1 == b->ll1) {
            return 0;
        } else if (a->ll1 < b->ll1) {
            return -1;
        } else {
             return 1;
         }
    }
    if (a->ll2 < b->ll2) {
        return -1;
     } else {
         return 1;
     }
}

__attribute__((__noinline__)) int
XXH128_hash_cmp(const XXH128_hash_t a, const XXH128_hash_t b)
{
    if (a.ll2 == b.ll2) {
        if (a.ll1 == b.ll1) {
            return 0;
        } else if (a.ll1 < b.ll1) {
            return -1;
        } else {
             return 1;
         }
    }
    if (a.ll2 < b.ll2) {
        return -1;
     } else {
         return 1;
     }
}

int XXH128_hash_cmp_ptr_qsort(const void *start, const void *end)
{
    return XXH128_hash_cmp_ptr(*(const XXH128_hash_t **)start, *(const XXH128_hash_t **)end);
}

int XXH128_hash_cmp_qsort(const void *start, const void *end)
{
    return XXH128_hash_cmp(*(const XXH128_hash_t *)start, *(const XXH128_hash_t *)end);
}

#define ROUNDS 10000000
static XXH128_hash_t arr[ROUNDS];
static XXH128_hash_t *arr_p[ROUNDS];
static XXH128_hash_t arr_p_alt[ROUNDS];

int main()
{
    srand(time(NULL));
    printf("arch: %zu-bit\n", sizeof(void *) * 8);
    for (int i = 0; i < ROUNDS; i++) {
        arr[i].ll1 = rand() | (uint64_t)rand() << 32;
        arr[i].ll2 = rand() | (uint64_t)rand() << 32;
        arr_p[i] = malloc(sizeof(XXH128_hash_t));
        memcpy(arr_p[i], &arr[i], sizeof(XXH128_hash_t));
        memcpy(&arr_p_alt[i], &arr[i], sizeof(XXH128_hash_t));
    }
    double start, end;
    start = (double)clock();
    qsort(arr, ROUNDS, sizeof(XXH128_hash_t), XXH128_hash_cmp_qsort);
    end = (double)clock();
    printf("pass by value: %lf\n", (end - start) / CLOCKS_PER_SEC);

    start = (double)clock();
    qsort(arr_p, ROUNDS, sizeof(XXH128_hash_t *), XXH128_hash_cmp_ptr_qsort);
    end = (double)clock();
    printf("pass by pointer: %lf\n", (end - start) / CLOCKS_PER_SEC);

    start = (double)clock();
    qsort(arr_p_alt, ROUNDS, sizeof(XXH128_hash_t), (int (*)(const void *, const void *)) XXH128_hash_cmp_ptr);
    end = (double)clock();
    printf("direct pass to qsort: %lf\n", (end - start) / CLOCKS_PER_SEC);
}

(I increased the count to make it more noticable)

arch: 32-bit
pass by value: 4.095442
pass by pointer: 5.924909
direct pass to qsort: 3.096856


arch: 64-bit
pass by value: 2.489569
pass by pointer: 4.165021
direct pass to qsort: 2.178582

Use the pointer, and to use it for qsort, say

/* To use it as a comparator for qsort, bsearch, or the like, the recommended way to use this is to
 * cast to int (*)(const void*, const void*). This has the best performance.
 *      qsort(array, size, sizeof(XXH128_hash_t), (int (*)(const void*, const void*)) XXH128_hash_cmp)
 * This assumes an array of XXH128_hash_t values. */

Because a compare function for pointers is going to be the most important usage of this, as I presume it will be used in stuff like qsort and bsearch.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 14, 2019

As for the C++ version, we are going to want to inline the comparisons, as it makes std::sort blazing fast:

static inline bool operator ==(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
    return (a.ll1 == b.ll1) && (a.ll2 == b.ll2);
}
static inline bool operator <(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
    if (a.ll2 == b.ll2) {
        return a.ll1 < b.ll1;
    } else {
        return a.ll2 < b.ll2;
    }
}
static inline bool operator !=(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
    return !(a == b);
}
static inline bool operator >(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
    return b < a;
}
static inline bool operator <=(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
    return !(a > b);
}
static inline bool operator >=(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
    return !(a < b);
}

And by passing by &, we are also passing by pointer. It is just syntax sugar.

arch: 32-bit
pass by value: 4.056218
pass by pointer: 5.928958
direct pass to qsort: 3.117807
std::sort: 2.620486
std::sort inlined: 2.486056

arch: 64-bit
pass by value: 2.375233
pass by pointer: 4.089312
direct pass to qsort: 2.440806
std::sort: 1.826079
std::sort inlined: 1.583563

Edit: the first std::sort uses this:

static inline bool operator<(const XXH128_hash_t &a, const XXH128_hash_t &b) {
    return XXH128_hash_cmp_ptr(&a, &b) < 0;
}

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 14, 2019

Even better, and this is all C++11 compatible constexpr which while it isn't that important, it seems to help optimization a bit.

int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
    if (a->ll2 == b->ll2) {
        return (a->ll1 < b->ll1) ? -1 : (a->ll1 > b->ll1);
    }
    if (a->ll2 < b->ll2) {
        return -1;
    } else {
         return 1;
    }
}
static inline constexpr bool
operator ==(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
    return (a.ll1 == b.ll1) && (a.ll2 == b.ll2);
}
static inline constexpr bool
operator<(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
#if (defined(_WIN32) || defined(__LITTLE_ENDIAN__)) && (defined(__SIZEOF_INT128__) || (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128))
    /* convert to uint128_t and compare */
    return (a.ll1 | (static_cast<__uint128_t>(a.ll2) << 64))
         < (b.ll1 | (static_cast<__uint128_t>(b.ll2) << 64));
#else
    return (a.ll2 == b.ll2) ? (a.ll1 < b.ll1) : (a.ll2 < b.ll2);
#endif
}
static inline constexpr bool
operator !=(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
    return !(a == b);
}
static inline constexpr bool
operator >(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
    return b < a;
}
static inline constexpr bool
operator <=(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
    return !(a > b);
}
static inline constexpr bool
operator >=(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
    return !(a < b);
}

It is incredibly fast. Unfortunately, the uint128_t compare for the C-style sorting isn't as good as the other method.

arch: 32-bit
pass by value: 4.012220
direct pass to qsort: 3.027623
std::sort: 2.359985
std::sort by value: 2.335475
std::sort new: 2.096199

arch: 64-bit
pass by value: 2.185738
pass to qsort, u128_t compare: 2.127638
direct pass to qsort: 2.044731
std::sort: 1.629353
std::sort by value: 1.629171
std::sort new: 1.201221

On x86_64 with clang, operator < compiles to this:

__ZltRK13XXH128_hash_tS1_:              ## @operator<(XXH128_hash_t const&, XXH128_hash_t const&)
        mov     rax, qword ptr [rdi]
        mov     rcx, qword ptr [rdi + 8]
        cmp     rax, qword ptr [rsi]
        sbb     rcx, qword ptr [rsi + 8]
        setb    al
        ret

and on i386 it compiles to this:

__ZltRK13XXH128_hash_tS1_:              ## @operator<(XXH128_hash_t const&, XXH128_hash_t const&)
        push    ebp
        push    ebx
        push    edi
        push    esi
        mov     esi, dword ptr [esp + 24]
        mov     ecx, dword ptr [esp + 20]
        mov     eax, dword ptr [ecx + 8]
        mov     edx, dword ptr [ecx + 12]
        mov     ebx, dword ptr [esi + 8]
        mov     edi, dword ptr [esi + 12]
        mov     ebp, edx
        xor     ebp, edi
        mov     esi, eax
        xor     esi, ebx
        or      esi, ebp
        jne     LBB5_2
        mov     eax, dword ptr [ecx]
        mov     ecx, dword ptr [ecx + 4]
        mov     edx, dword ptr [esp + 24]
        cmp     eax, dword ptr [edx]
        sbb     ecx, dword ptr [edx + 4]
        jmp     LBB5_3
LBB5_2:
        cmp     eax, ebx
        sbb     edx, edi
LBB5_3:
        setb    al
        pop     esi
        pop     edi
        pop     ebx
        pop     ebp
        ret

Edit: GCC 8.3

arch: 32-bit
pass by value: 4.906234
direct pass to qsort: 3.260699
std::sort: 2.868732
std::sort by value: 2.862184
pass by pointer: 1.626228

arch: 64-bit
pass by value: 2.009833
pass to qsort, u128_t compare: 2.098546
direct pass to qsort: 1.861579
std::sort: 1.432066
std::sort by value: 1.489022
pass by pointer: 1.113134

@Cyan4973
Copy link
Owner

Regarding static U64 XXH3_mul128(U64 ll1, U64 ll2)
The Aarch32 code seems to be over-complicated for returning only the lower 64 bits.

As mentioned by @easyaspi314 , it's bit a more than just returning the lower bits.
As the high bits get folded into the low bits, it might be possible to write this operation more compactly, and not need a "real" 128-bits multiplication.
However, I did not figure out the math so far.

I'm fine with a rewrite of the function if it can lead to better performance.
If some gcc assembly can help reach this purpose, it's all fine.

XXH3_mul128 should be renamed to something like XXH3_foldedMult64to128

Good point

@Cyan4973
Copy link
Owner

Cyan4973 commented Mar 14, 2019

Consequences of implementing a comparator :

In saying that one hash value is "bigger" than the other one, it makes XXH128_hash_t more than a bit field. It is now necessary to distinguish its components, as they are no longer equal : one must be "high", and the other "low".
And now the order matters. Which one is high, which one is low ?

Possible answer :
In keeping the convention that internal components of XXH128_hash_t are XXH64_hash_t, for which the convention is big-endian, it feels consistent to continue using the same convention, making the first component "high", and the second "low".
Thoughts ?

Why "ll1" and "ll2", IMHO it should be suffixed by "h" and "l".

Yes, I agree.
If we are implementing a comparator function, components must be distinguished as "high" or "low".

@42Bastian
Copy link

In keeping the convention that internal components of XXH128_hash_t are XXH64_hash_t, for which the convention is big-endian, it feels consistent to continue using the same convention, making the first component "high", and the second "low".
Thoughts ?

I'd go for this. This would be compatible with a natural uint128_t if it exists.

Why "ll1" and "ll2", IMHO it should be suffixed by "h" and "l".

Yes, I agree.
If we are implementing a comparator function, components must be distinguished as "high" or "low".

Also if you consider the struct as replacement for uint128_t

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 14, 2019

a natural uint128_t if it exists

// 0x00112233445566778899AABBCCDDEEFF
const __uint128_t x = ((__uint128_t)0x0011223344556677ULL<< 64) | 0x8899AABBCCDDEEFFULL;
$ clang -O3 -c test.c
$ hexdump -C test.o
...
00000180  ff ee dd cc bb aa 99 88  77 66 55 44 33 22 11 00  |........wfUD3"..|
...

For uint128_t 0x00112233445566778899AABBCCDDEEFF, the "natural" layout for most machines would be 0xFFEEDDCCBBAA99887766554433221100.

Doing it this way would allow someone to theoretically do this on a 64-bit little endian machine:

XXH128_hash_t hash = XXH3_128b(...);
__uint128_t val;
memcpy(&val, &hash, sizeof(__uint128_t));

Also if you consider the struct as replacement for uint128_t

It is supposed to be like one, and if the C standard had 128-bit integers, we would definitely use them. But it doesn't, and the only platforms that have it are 64-bit.

@Cyan4973
Copy link
Owner

Cyan4973 commented Mar 14, 2019

Yes, that's indeed the problem. At a binary level :

  • { hi, lo } would have same byte order as __uint128_t on a big endian machine.
  • { lo, hi } would have same byte order as __uint128_t on a little endian machine.
    So it's not possible to please both.

At a canonical level, I believe it makes sense to preserve the "natural" byte order, as if a __uint128_t could be displayed directly on screen. It makes "big endian" a natural choice. It follows the convention already established by XXH64_hash_t.

But binary and canonical levels do not have to match. Later on, a pair of conversion functions, such as XXH128_canonicalFromHash() will ensure that representation is always correct, whatever the platform.

So let's concentrate on binary level now, It would seem a matter to select which platform is more important, little or big endian. And these days, little endian is pretty much a winner.

That being said, even that does not matter much : __uint128_t is not an official type, and it's doubtful if it will ever be in the foreseeable future. So a binary compatibility (which would be limited to little endian platforms) seems of doubtful usefulness.

Consistency is also a topic.
{ hi, lo } feels to match the "natural" order expect by humans. That's how we write numbers !
I'm not opposed to { lo, hi }, I just suspect a few people will find it strange to read low before high.

But at the end of the day, both solutions work. As long as low and high are clearly expressed, it should be usable.
It's mostly a matter of picking one and sticking to it.

I'm fine with both.
Just trying to find a reason to "prefer" one over the other.

@Cyan4973
Copy link
Owner

Cyan4973 commented Mar 14, 2019

I went ahead and proceeded with renaming XXH128_hash_t members into low64 and high64.
This will hopefully makes it clearer for comparison purposes.

I used the little-endian binary convention, hence low first. I still don't have a strong reason to prefer one over the other, so I went with the easiest one.

I also believe it's not opposed to using big-endian for the canonical representation.

@Cyan4973
Copy link
Owner

"by value" vs "by reference" :

I would really not feel too much concerned by a memcpy(,,16) operation. From a performance perspective, this should be unnoticeable. I have actually read reports that it's better from a cache locality perspective, as pointers into different memory areas will maintain active more cache lines, thus degrading system performance.

Moreover, whenever the comparator function can be inlined, it all becomes moot. One can expect the comparison to be performed directly on source data. Btw, maybe it's important to offer an inlinable comparator.

In contrast, I would feel more concerned by branches, which can be costly when they mispredict.
Hopefully, in this case, since 64-bit fields are compared, the likelihood of finding hashes with equal high64 fields becomes very low, making the branch extremely predictable.

If we do not constrain the comparator to return { -1, 0, 1 }, but more generally { <0, ==0, >0 }, we may be able to reduce the nb of branches in the comparator.

@Cyan4973
Copy link
Owner

In this godbolt example, there is a comparison between a classical "branchy" comparator and a branchless one.

On 64-bit, the branchless variant looks good. It uses less instructions than the "branchy" one and avoids a cmov. I would expect it to run faster.

There is also a fully branchless comparator proposed at compare128_1(). It uses the same idea.

However, as mentioned before, it's unlikely that the high 64 bits of XXH128_hash_t are identical.
If that happens, it probably means that hashed object was effectively identical.
As predictable branches are extremely cheap, adding one to reduce amount of work is probably a good idea.
compare128_2() uses that fact to introduce a single branch, between high and low comparisons.

Now it's a matter of comparing these implementations...

@Cyan4973
Copy link
Owner

Cyan4973 commented Mar 14, 2019

I've used @easyaspi314 qsort benchmark script, and adapted it to measure differences between comparators. Here is the modified source code : https://gist.github.com/Cyan4973/e1baa1a9a32b690508df76be64b34082

Some conclusions :
I couldn't find any significant difference between the comparators. They all end up "within noise margin". I guess the comparator is only a small part of the sorting operation itself, so it doesn't matter much.

One indirection vs 2 indirections make a difference, though a moderate one (~10%).
Indeed, invoking the byPtr comparator directly is faster than invoking a wrapper using ptr arguments which itself calls the byValue comparator while ensuring this one is not inlined.
If it sounds like a win for byPtr, it's only because the test case mandates the use of byPtr. And when the byValue function can be inlined into the wrapper, it erases all differences.

So, the byPtr variant is a better fit for qsort, but is that an isolated example ?

I still believe that passing arguments by value feel more natural,
and that inlining is actually the most important key to performance.
There's an argument though for providing a comparator function "ready to use" with qsort. Even if isolated, it's still an important use case by itself.

@easyaspi314
Copy link
Contributor Author

typedef union {
  struct {
    uint64_t ll1;
    uint64_t ll2;
  };
  struct {
    /* don't use it directly! */
    size_t __val[(sizeof(uint64_t)*2)/sizeof(size_t)];
  };
} XXH128_hash_t;


int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
#if defined(__LITTLE_ENDIAN__) || defined(_WIN32)
    /* Only compare the word sizes. This only works on little endian. */
    int cmp;
    size_t i = (sizeof(a->__val)/sizeof(a->__val[0]));
    while (i-- > 0) {
       cmp = (a->__val[i] > b->__val[i]) - (a->__val[i] < b->__val[i]);
       if (cmp) {
           return cmp;
       }
    }
    return cmp;
#else
    /* normal comparison, don't really care */
#endif
}

That has the best performance on 32-bit.

XXH128_hash_cmp_ptr_alt:
   mov rax, qword ptr [rsi+0x8]
   cmp qword ptr [rdi+0x8], rax
   je L7
   sbb eax, eax
   or eax, 0x1
L1:
   ret 
L7:
   mov rdx, QWORD PTR [rsi]
   mov eax, 0x1
   cmp QWORD PTR [rdi], rdx
   ja L1
   sbb eax, eax
   ret 

That clever bit of assembly has the best I've seen on 64-bit, and it is what GCC emits with the naive implementation:

__attribute__((__noinline__)) int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
    if (a->high64 == b->high64) {
        if (a->low64 > b->low64) {
            return 1;
        } else if (a->low64 < b->low64) {
            return -1;
        } else {
             return 0;
         }
    }
    if (a->high64 < b->high64) {
        return -1;
     } else {
         return 1;
     }
}

@42Bastian
Copy link

42Bastian commented Mar 15, 2019

As for qsort: I guess the most impact comes from swaping the 128bit values if sorting does not use an array of pointers.
So for machines which have enough memory to hold thousands of hashes, the advantage of the one or the other way to call the compare function disappears in the noise.

@sergeevabc
Copy link

Have you compared it with SeaHash (sources, design)?

@ifduyue
Copy link

ifduyue commented Mar 16, 2019

Will there be createState/update/freeState APIs for XXH3?

@Cyan4973
Copy link
Owner

Will there be createState/update/freeState APIs for XXH3?

Yes, that's basically next step.

@Cyan4973
Copy link
Owner

Cyan4973 commented Mar 16, 2019

Have you compared it with SeaHash

I wasn't aware of this hash, thanks for the link !
From its description, it says that it's a bit faster than XXH64, and that it's aimed at checksumming. Hence it's probably not optimized for small inputs. Also, 10% faster than XXH64 seems not enough to compete on speed with XXH3 on large inputs.

@easyaspi314
Copy link
Contributor Author

#180

There is currently a huge bug in the implementation, so some parts need to be redesigned.

@Cyan4973
Copy link
Owner

I have to disagree with the mention of "huge bug" here.
I think we do not share the same view of the scope of a non-cryptographic hash function.

I maintain that a non-cryptographic hash should not even try to "resist" an attack, by adding layers of obfuscations that will just be defeated later on. In the meantime, there will always be a temptation to think "I don't understand what's going on in this algorithm; sure, it's labelled non-cryptographic, but (wink) it seems it's almost as good ", then it gets used in places where it should not, and a bit later, we've got an effective attack ready to roll on all these brittle implementations.

A non cryptographic algorithm should not hide the fact that it's non cryptographic, not even in its implementations. In the case of XXH3, it's trivial to figure out how to intentionally generate collisions, on condition of knowing the secret key and the seed. So sure, if one targets the default key and the default seed, it's just plain simple. For other algorithms, it may take a study to get there, but once the method is known, it's only a few clicks away, and infinite collision generation is a matter of a simple script.

The objective of a non-cryptographic hash algorithm is just to quickly generate a bunch of bits, to be used in a hash table, bloom filter, etc. The changes I'm implementing are aimed at the space reduction problem, which is far from severe. They will, as a side effect, make it a bit more difficult to generate intentional collisions, but it certainly does not make the algorithm any more secure, and is not the goal.
From a user perspective, the impact of this space reduction improvement is going to be minimal, if not completely unnoticeable.
The way I see it, we are tuning (and hopefully improving) the quality / performance trade-off. Which is precisely the point of this test period.

@easyaspi314
Copy link
Contributor Author

@Cyan4973 might be a good idea to put that patch into master as well — it is a rather serious bug.

@easyaspi314
Copy link
Contributor Author

Just went to go update my XXH3 implementation in easyaspi314/xxhash-clean:xxh3.

Oh my how XXH3 has changed…

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 12, 2020

Just broke the 6 GB/s barrier on my phone with Clang 9 by moving the veorq_u64 below the if block.

./xxhsum 0.7.3 (64-bit aarch64 + NEON little endian), Clang 9.0.1 , by Yann Collet
Sample of 100 KB...
XXH32                         :     102400 ->    28142 it/s ( 2748.3 MB/s)
XXH32 unaligned               :     102400 ->    23074 it/s ( 2253.3 MB/s)
XXH64                         :     102400 ->    31512 it/s ( 3077.4 MB/s)
XXH64 unaligned               :     102400 ->    31543 it/s ( 3080.4 MB/s)
XXH3_64b                      :     102400 ->    62141 it/s ( 6068.5 MB/s)
XXH3_64b unaligned            :     102400 ->    57487 it/s ( 5614.0 MB/s)
XXH3_64b w/seed               :     102400 ->    61413 it/s ( 5997.3 MB/s)
XXH3_64b w/seed unaligned     :     102400 ->    57456 it/s ( 5611.0 MB/s)
XXH3_64b w/secret             :     102400 ->    59019 it/s ( 5763.6 MB/s)
XXH3_64b w/secret unaligned   :     102400 ->    53962 it/s ( 5269.7 MB/s)
XXH128                        :     102400 ->    51385 it/s ( 5018.1 MB/s)
XXH128 unaligned              :     102400 ->    49295 it/s ( 4813.9 MB/s)
XXH128 w/seed                 :     102400 ->    52260 it/s ( 5103.5 MB/s)
XXH128 w/seed unaligned       :     102400 ->    49710 it/s ( 4854.5 MB/s)
XXH128 w/secret               :     102400 ->    49710 it/s ( 4854.5 MB/s)
XXH128 w/secret unaligned     :     102400 ->    46561 it/s ( 4547.0 MB/s)

Clang seems to interleave it better that way.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 12, 2020

GCC 9 is slightly slower, although it is slower in general (4.5 GB/s) due to using inline assembly everywhere in arm_neon.h and therefore not having any pattern recognition.

However, screw GCC AArch64. Everybody uses clang anyways. 😂

Add proper builtins and then I might change my mind

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 14, 2020

Updated xxhash-clean with the latest XXH3.

https://github.com/easyaspi314/xxhash-clean

Not planning to do streaming — not yet anyways.

The other cool thing is that it can compile really small, especially since XXH3_64bits and XXH3_128bits are isolated.

aarch64-linux-android-clang 9.0.1 -Oz
xxh3-64b-ref.o  :
section           size   addr
.text             1440      0
.rodata            256      0
.comment            22      0
.note.GNU-stack      0      0
.debug_frame        48      0
Total             1766

xxh3-128b-ref.o  :
section           size   addr
.text             2028      0
.rodata            256      0
.comment            22      0
.note.GNU-stack      0      0
.debug_frame        72      0
Total             2378

XXH_INLINE_ALL wrapping the same functions:

xxh3-64b.o  :
section           size   addr
.text             1792      0
.rodata            256      0
.comment            22      0
.note.GNU-stack      0      0
.debug_frame        88      0
Total             2158

xxh3-128b.o  :
section           size   addr
.text             2324      0
.rodata            256      0
.comment            22      0
.note.GNU-stack      0      0
.debug_frame        48      0
Total             2650

ABI compatible with the single shot functions.

Edit: Happy Pi Day!

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Apr 14, 2020

RISC-V 64-bit on qemu-riscv64 4.1.1 on my Pixel 2 XL (because my PC is in use)
Clang 9.0.1
musl libc

xxhsum-rv64 0.7.4 (64-bit riscv + mul little endian), Clang 9.0.1 , by Yann Collet
Sample of 100 KB...        
XXH32                         :     102400 ->     5829 it/s (  569.3 MB/s) 
XXH32 unaligned               :     102400 ->     5830 it/s (  569.3 MB/s) 
XXH64                         :     102400 ->     6212 it/s (  606.7 MB/s) 
XXH64 unaligned               :     102400 ->     6238 it/s (  609.2 MB/s) 
XXH3_64b                      :     102400 ->     5032 it/s (  491.4 MB/s) 
XXH3_64b unaligned            :     102400 ->     5043 it/s (  492.5 MB/s) 
XXH3_64b w/seed               :     102400 ->     5036 it/s (  491.8 MB/s) 
XXH3_64b w/seed unaligned     :     102400 ->     5038 it/s (  492.0 MB/s) 
XXH3_64b w/secret             :     102400 ->     2392 it/s (  233.6 MB/s) 
XXH3_64b w/secret unaligned   :     102400 ->     2392 it/s (  233.6 MB/s) 
XXH128                        :     102400 ->     5165 it/s (  504.4 MB/s) 
XXH128 unaligned              :     102400 ->     5139 it/s (  501.9 MB/s) 
XXH128 w/seed                 :     102400 ->     5190 it/s (  506.9 MB/s) 
XXH128 w/seed unaligned       :     102400 ->     5170 it/s (  504.9 MB/s) 
XXH128 w/secret               :     102400 ->     2357 it/s (  230.2 MB/s) 
XXH128 w/secret unaligned     :     102400 ->     2353 it/s (  229.8 MB/s) 

Note that Clang uses byteshift loads, and it seems that RISC-V has optional but recommended unaligned access, saying it can be managed by the execution environment or implemented in hardware. This was intended to make implementing unaligned access easier/having the ability to disable it on embedded.

If I do XXH_FORCE_MEMORY_ACCESS=2, we get a much better speed, but I'm not 100% sure it is safe.

xxhsum-rv64 0.7.4 (64-bit riscv + mul little endian), Clang 9.0.1 , by Yann Collet
Sample of 100 KB...        
XXH32                         :     102400 ->     9306 it/s (  908.8 MB/s) 
XXH32 unaligned               :     102400 ->     8457 it/s (  825.8 MB/s) 
XXH64                         :     102400 ->    17902 it/s ( 1748.3 MB/s) 
XXH64 unaligned               :     102400 ->    16735 it/s ( 1634.3 MB/s) 
XXH3_64b                      :     102400 ->    18693 it/s ( 1825.5 MB/s) 
XXH3_64b unaligned            :     102400 ->    17454 it/s ( 1704.5 MB/s) 
XXH3_64b w/seed               :     102400 ->    18399 it/s ( 1796.7 MB/s) 
XXH3_64b w/seed unaligned     :     102400 ->    17810 it/s ( 1739.3 MB/s) 
XXH3_64b w/secret             :     102400 ->    16885 it/s ( 1648.9 MB/s) 
XXH3_64b w/secret unaligned   :     102400 ->    16311 it/s ( 1592.8 MB/s) 
XXH128                        :     102400 ->    18250 it/s ( 1782.2 MB/s) 
XXH128 unaligned              :     102400 ->    17769 it/s ( 1735.3 MB/s) 
XXH128 w/seed                 :     102400 ->    18089 it/s ( 1766.5 MB/s) 
XXH128 w/seed unaligned       :     102400 ->    17102 it/s ( 1670.1 MB/s) 
XXH128 w/secret               :     102400 ->    16692 it/s ( 1630.1 MB/s) 
XXH128 w/secret unaligned     :     102400 ->    16252 it/s ( 1587.2 MB/s) 

Also, here's mips64el (MIPS VI) on qemu-mips64el 2.11.50 (because 4.4.0 isn't in the repos and I already had that version patched for Termux)

xxhsum-mips 0.7.4 (64-bit mips64 little endian), Clang 9.0.1 , by Yann Collet
Sample of 100 KB...        
XXH32                         :     102400 ->     8909 it/s (  870.0 MB/s) 
XXH32 unaligned               :     102400 ->     8545 it/s (  834.5 MB/s) 
XXH64                         :     102400 ->    20445 it/s ( 1996.6 MB/s) 
XXH64 unaligned               :     102400 ->    17705 it/s ( 1729.0 MB/s) 
XXH3_64b                      :     102400 ->    15062 it/s ( 1470.9 MB/s) 
XXH3_64b unaligned            :     102400 ->    14507 it/s ( 1416.7 MB/s) 
XXH3_64b w/seed               :     102400 ->    15050 it/s ( 1469.7 MB/s) 
XXH3_64b w/seed unaligned     :     102400 ->    14759 it/s ( 1441.3 MB/s) 
XXH3_64b w/secret             :     102400 ->    14367 it/s ( 1403.1 MB/s) 
XXH3_64b w/secret unaligned   :     102400 ->    13759 it/s ( 1343.6 MB/s) 
XXH128                        :     102400 ->    15052 it/s ( 1469.9 MB/s) 
XXH128 unaligned              :     102400 ->    14304 it/s ( 1396.8 MB/s) 
XXH128 w/seed                 :     102400 ->    14735 it/s ( 1438.9 MB/s) 
XXH128 w/seed unaligned       :     102400 ->    14463 it/s ( 1412.4 MB/s) 
XXH128 w/secret               :     102400 ->    14290 it/s ( 1395.5 MB/s) 
XXH128 w/secret unaligned     :     102400 ->    13481 it/s ( 1316.5 MB/s) 

Also, damn, generating a constant is torture on RISC-V...

int64_t load_constant(void)
{
    return 0x165667919E3779F9;
}
load_constant:
        lui     a0, 45
        addiw   a0, a0, -1331
        slli    a0, a0, 14
        addi    a0, a0, -883
        slli    a0, a0, 14
        addi    a0, a0, -913
        slli    a0, a0, 15
        addi    a0, a0, -1543
        ret
int64_t load_constant(void)
{
    int32_t a0_32 = 45 << 12;
    a0_32 = a0_32 - 1331;
    int64_t a0 = (int64_t)a0_32; // sign extend
    a0 = a0 << 14;
    a0 = a0 - 883;
    a0 = a0 << 14;
    a0 = a0 - 913;
    a0 = a0 << 15;
    a0 = a0 - 1543;
    return a0;
}

@Cyan4973
Copy link
Owner

Cyan4973 commented Jun 19, 2020

I spent the past days trying to find and exploit other weaknesses, either in implementation or in the algorithm itself, using a few hunches where I expected to find exploitable problems, but ultimately couldn't trigger any additional issue.

So I presume the scope is now good enough to contemplate a new release v0.7.4 .

By far, the most important remaining decision is : is it worth fixing the avalanche issue detected by demerphq test when employing the seed as the variable (as opposed to the input) => #395

The use case seems minor, though not necessarily "useless". And this is more or less the last chance to fix something like that, as next planned release, v0.8.0 will officially stabilize the output, making it impossible to change function return value afterwards.

Also, indirectly, this could open the door to additional minor changes that were kept at bay because they were also impacting the output, such as for example @easyaspi314 's suggestion to change the default secret to use the decimal of Pi instead (or any other "famous" constant with known random qualities).

These changes are not expecting to impact the planning much, a few days at most. So this is more about taking a decision.

If you want to voice your opinion on this topic, now is a great time.

Annex

For the record, list of other topics that can still be worked upon, but are not required for next release :

  • Introduction of a new variant _withSeed128(), suggested by @koraa.
  • Advanced variant of generateSecret(), able to generate secrets of custom length (instead of being limited to a single pre-defined length).
  • -c check capability for BSD-style checksums (produced by --tag)
  • x86_dispatch in the library
  • Mitigate random impact of instruction alignment on Intel cpus

@aras-p
Copy link
Contributor

aras-p commented Jun 19, 2020

If some tweak is done that changes the computed values anyway, then would/could #342 be rolled into the same release too?

@Cyan4973
Copy link
Owner

Cyan4973 commented Jun 19, 2020

Yes @aras-p , #342 is part of the list of "minor but breaking change" that would become opened for review and potential inclusion.

@Cyan4973
Copy link
Owner

Cyan4973 commented Jun 22, 2020

OK, time to take a decision,

I lean on making the #395 change part of the release. Here is my line of reasoning:

  • This is still development period, and basically the last chance to update the formula to address a quality concern, even for a minor scenario.
    • I'll regret it later if I don't do it
  • With latest update, using the XXH64 avalanche instead of rrmxmx, performance impact has been minimized to the point of being barely noticeable in benchmarks. So it seems like a change which purely improves quality, without performance downside.
  • General planning doesn't change. v0.7.4 is planned to be released later this week, followed by v0.8.0 a few weeks later, which will officially stabilize the output of XXH3 and XXH128.

Now, speaking of planning, if I merge this change, then other breaking changes can be introduced too.
Among them, updating the default kSecret, and reducing scrambling (#342).
But in order to keep the planning unchanged, these changes need to be completed within the next few days.

Btw, @easyaspi314 , are you still interested in updating kSecret ?

edit : created new branch newOutput074 where all changes impacting hash return values will be merged, for proper validation.

edit 2 : pushed branch Pi which modifies kSecret with decimals of Pi as provided by http://hexpi.sourceforge.net/

@Cyan4973
Copy link
Owner

Cyan4973 commented Jun 23, 2020

All breaking changes are now merged into staging branch newOutput074.
I will now initiate long resiliency and quality tests, to ensure this new variant features no hidden issue (none expected).
Should be completed within a day.

@WayneD
Copy link
Contributor

WayneD commented Jun 24, 2020

I'm trying to find a good define that will let the C preprocessor differentiate if the XXH3 routines are available via the xxhash.h include. This is so that my code can auto-enable/disable support of the newer hash algorithms no matter what xxhash-devel package the user has installed (I'm hoping to avoid a compile test via autoconf). Obviously there's not a released version that has XXH3 algorithms in the stock xxhash.h yet, so what does the future hold?

Any chance for a define that would be supported currently via xxh3.h and would also be present in the future via xxhash.h? Or is the best differentiator going to be via some future XXH_VERSION_NUMBER value check?

At the moment I settled on a combination of my own define for if xxh3.h should be included combined with a version check:

#ifdef PRE_RELEASE_XXHASH
#include <xxh3.h>
#endif
#include <xxhash.h>
#if defined PRE_RELEASE_XXHASH || XXH_VERSION_NUMBER >= 800
#define SUPPORT_XXH3 1
#endif

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jun 24, 2020

XXH_VERSION_NUMBER is probably your best option, yeah.

xxHash installs xxh3.h on v0.7.3 and later. The single shot functions are fine to use, but keep in mind that there are some nasty bugs in the streaming implementation which are to be fixed in the next release.

Also, don't include xxh3.h directly. It is likely going to disappear, and you can just do this:

#define XXH_INLINE_ALL
#include <xxhash.h>

@WayneD
Copy link
Contributor

WayneD commented Jun 24, 2020

Is there a plan for what the first version number will be that includes XXH3 & XXH128 by default? Is it going to be 1.0.0 (aka 10000)?

@easyaspi314
Copy link
Contributor Author

It will either be v0.8.0 or v1.0.0, but both will trigger on XXH_VERSION_NUMBER >= 800. 🙂

@Cyan4973
Copy link
Owner

Cyan4973 commented Jul 18, 2020

Time to clearly define the scope to upgrade to "stable" status for v0.8.0.
Not everything will reach this label. In the text below, I'll divide the different parts of the API, from easiest to decide to more difficult.

Stable at v0.8.0

This is the minimal scope which, I believe, is beyond dispute and will be labelled stable for next release.

64-bit one shot functions :
XXH3_64bits()
XXH3_64bits_withSeed()

State management functions, from streaming :
XXH3_createState()
XXH3_freeState()
XXH3_copyState()

Streaming functions, with equal scope as one-shot functions :
XXH3_64bits_reset()
XXH3_64bits_reset_withSeed()
XXH3_64bits_update()
XXH3_64bits_digest()

128-bit one-shot hash function :
XXH3_128bits()

Helpers for XXH128_hash_t type :
Since it's not an integral type, standard comparison properties are not provided natively by the language. They must be created, which is what these functions offer :
XXH128_isEqual()
XXH128_cmp()

Not part of stabilization

The following functions will not be stabilized at v0.8.0.
They may become stabilized later, though in a future version (v0.9.0 for example).

XXH3_generateSecret() :

This function takes a custom seed of any length, and convert it into a high entropy secret of size XXH3_SECRET_DEFAULT_SIZE, suitable to be employed in combination with _withSecret() variants.
The primary reason to not stabilize this function is to make it possible to change its output. This function is still very new, and new requirements may appear during the testing period.
For example, a recent discussion suggested a need to have a special resolution for seed of size 8, mimicking the internal generation of secret when using the _withSeed() variants, for reproducibility.
Another opened question is about the size of the generated secret : right now, it's static, set at XXH3_SECRET_DEFAULT_SIZE. Should it be possible to make the secret's size a parameter of the function ? If yes, should it be a parameter of this function, or should another (advanced) function be created for this purpose ?

XXH3_128bits_withSeed128() and corresponding XXH3_128bits_reset_withSeed128() :

@koraa mentioned a need for a hash using a 128-bits seed. There were not a lot of requests for this feature, but it's definitely a possible one. However, it will take a bit of time to develop. I don't want to commit to it for next version, and more importantly, even if I do, these functions will start their life in the experimental section, and won't join the "stable" section of the API in their first incarnation.

Variant(s) _withSeedAndSecret() and / or _withState() :

A recent discussion started by @gzm55 suggests the creation of a new function variant, primarily in order to optimize speed of _withSeed() variants in scenarios where the seed remains constant, trying to avoid the reconstruction of the secret at each round in order to save some latency budget, measurable when input is "medium" size and the selected vector instruction set is AVX.
The discussion is still ongoing, about the merits and consequences of different possible approaches. There is no guarantee that it will result in the creation of new functions during this cycle, and even if it does, these functions will be too young to join the "stable" area.

Variants _256bits*()

The need for such a variant is very low, mostly confined to curiosity level, while the amount of work required is very high.
I simply don't intend to work on it in the foreseeable future, so, obviously, not part of v0.8.0.

Stabilization opened for comments

Sorted from easier to more difficult to settle.

Streaming 128-bit variants

This paragraph regroups XXH3_128bits_reset(), XXH3_128bits_update() and XXH3_128bits_digest().

I think there is no question regarding the scope : of course, streaming 128-bit variants are required for the stabilized scope.
Rather, one could note that XXH3_128bits_reset() is effectively the same as XXH3_64bits_reset(), and XXH3_128bits_update() is exactly the same as XXH3_64bits_update().
This is true since v0.7.4, when the inner loop has been simplified to always employ the stronger 128-bit one.

This underlines an opportunity : could they be merged ?
In such an alternative design, one would see XXH3_reset() and XXH3_update(). The distinction between 64-bit and 128-bit output would only be specified at digest invocation.

This is mostly an API design question. In term of behavior, nothing is changed.
One try to maintain a form of consistency, by prefixing all 128-bit functions with XXH3_128bits_*(),
the other underlines versatility, and how the choice between 64 and 128-bit can be postpone up to the last moment.
Both choices are fine, it's just a matter of selecting one.

XXH3_128bits_withSeed() and XXH3_128bits_reset_withSeed()

This is a follow-up of the suggestion from @koraa to introduce _withSeed128() variants.
There are 2 potential questions :

  • Should the new variant replace the existing ones, thus removing existing _withSeed() variants ?
  • Should existing variants be renamed, to become _withSeed64(), to better emphasize the difference with Seed128 ?

I believe it's possible to settle this question. And I believe the answer to both is "no".
I don't think that a variant using a 128-bit seed invalidates the need for another variant using a 64-bit seed.
As mentioned before, integral types are much easier to use. This statement goes beyond C/C++ scenarios : bindings from multiple languages will also exist, and while I don't doubt that solutions exist to manipulate XXH128_hash_t structures from higher level languages, it's also certainly more complex that integrals.
So I believe both can exist.
As for the name, changing from _withSeed() to _withSeed64() would question the equivalent variant of XXH3_64bits*(). It seems that both names would need to change for consistency. This would delay stabilization of _withSeed() variants by another round.

I don't believe there's a need for that, and we can probably accept _withSeed() variants as part of the stabilized scope for v0.8.0. We'll discover later if a _withSeed128() actually exists, and if does, I believe that the distinctive name (and prototype) will be enough to not confuse them.

Variants _withSecret()

A comment by @gzm55 underlines that these functions are targeted at "advanced" users, because it's easy to misuse them. For example, it's fairly easy to provide a secret with bad randomness properties, resulting in poor hash quality.

I presume the intended objective is make these variants less accessible.

However, by keeping them in the "experimental" section, it implies, by default, a risk that the return values of these functions may change in the future, strongly reducing their usefulness (only session-local, no storage, no transmission).

Thing is, I don't see any alternate scenario regarding the usage and algorithm of _withSecret() functions. In my view they are perfectly well defined, and don't need a change in the future.

This lead us to 2 solutions :

  • Keep them in "experimental" section, but pledge that their return values is in fact stable
  • Promote them to "stable", but add a comprehensive warning in the documentation explaining the risk of their usage (essentially around the randomness properties of the secret).

The current stance is to select option 2, which looks reasonable to me.
If one feels differently about it, now would be good time to present your arguments.

XXH128()

XXH128() is a shortcut for the one-shot 128-bit variant, mirroring existing functions XXH32() and XXH64().
Sure, but which one ?

Currently, it's mapped to XXH3_128bits_withSeed(), because both XXH32() and XXH64() offer a seed parameter, so it maintains consistency.

However, presuming that one day a XXH3_128bits_withSeed128() variant exist, it also becomes a contender.
Therefore, which one should become the simplified XXH128() ? _withSeed() or _withSeed128() ?

One could note that the seed of XXH32() is 32-bit, the seed of XXH64() is 64-bit, consequently, it would feel somehow logical for XXH128() to offer a 128-bit seed.

However, there are arguments in the opposite direction :

  • the _withSeed128() doesn't exist yet
  • manipulating a 128-bit seed is more complex than a 64-bit one, since it's no longer an integral type. As a shortcut function name, XXH128() should better lean towards the simpler variant to use.
  • changing the signature of XXH128() would break existing code

This topic is the least decided, and I can see arguments leaning both ways.
In the worst case, if there is no clear winner, stabilization of this prototype could be delayed to v0.9.0.
But, maybe there is no need to, and it could be settled now, right on time for v0.8.0.

Other actions

Among the other actions required for the release, API declaration of xxhash.h must be updated, obviously, to include the newly stabilized API into the public section, so that it will no longer require XXH_STATIC_LINKING_ONLY.
Also, xxh3.h must be integrated into xxhash.h (implementation section). I think I will keep xxh3.h around, but more as a stub, so that programs currently including it will continue compiling correctly. This is temporary : as a file, xxh3.h will disappear in a future release. But there is no urgency.
Finally, the README.md should be updated to reflect the presence of XXH3, with new benchmarks using more recent systems, typically taken from there.

@Cyan4973
Copy link
Owner

Cyan4973 commented Jul 23, 2020

Current proposal for v0.8.0 : #435

All existing XXH3 prototypes are promoted to stable status without modification,
except XXH3_generateSecret() and XXH128()
which both remain experimental, preserving a possibility to change them in future versions.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jul 24, 2020

Is it okay if we set XXH64_hash_t to unsigned long on LP64?

This would match stdint.h, and fix some inconsistency annoyances, especially with NEON which uses uint64_t and warns on xxh_u64 in gnu90 mode.

/* LP64 defines uint64_t as unsigned long, try to match it. */
# if defined(__LP64__) && ULONG_MAX == 0xFFFFFFFFFFFFFFFFULL
    typedef unsigned long XXH64_hash_t;
# else
    typedef unsigned long long XXH64_hash_t;
# endif

@Cyan4973
Copy link
Owner

I presume it is

@easyaspi314
Copy link
Contributor Author

Ok, how is this? It should cover all cases.

#ifndef XXH64_HASH_T_DEFINED
#  define XXH64_HASH_T_DEFINED
#  if !defined (__VMS) \
    && (defined (__cplusplus) /* C++ */ \
    || (defined (__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) /* C99 */) \
    || defined(UINT64_MAX) /* stdint.h already included */)
#      include <stdint.h>
       typedef uint64_t XXH64_hash_t;
#  else
#    include <limits.h>
     /* Fake stdint.h */
#    if defined(__UINT64_TYPE__) /* usually predefined by GCC */
       typedef __UINT64_TYPE__ XXH64_hash_t;
#    elif defined(_MSC_VER) /* MSVC */
       typedef unsigned __int64 XXH64_hash_t;
#    elif ULONG_MAX == 0xFFFFFFFFFFFFFFFFULL /* LP64 ABI */
       typedef unsigned long XXH64_hash_t;
#    else
       typedef unsigned long long XXH64_hash_t;
#    endif
#  endif
#endif /* XXH64_HASH_T_DEFINED */

The XXH64_HASH_T_DEFINED macro will prevent something like this from happening:

#include <xxhash.h>
#include <stdint.h>
#define XXH_INLINE_ALL
#include <xxhash.h>

@Cyan4973
Copy link
Owner

Okay, it looks complex, but I guess it indeed adapts to a lot of situations.
Presuming that the rest of the code base already makes a good job of not confusing XXH64_hash_t with unsigned long long, I guess it should be fine (note : compiling with -Wconversion will be important here).

The XXH64_HASH_T_DEFINED macro will prevent something like this from happening:

Have you checked that this is actually required ?
I would expect current code to already protect from such multi-include issue.
If it does not, that's good to know too !

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jul 26, 2020

Actually, this still isn't good enough.

The check for stdint.h being included is overkill, and old MSVC versions don't have stdint.h even in C++ mode.

This should work:

#if defined(_MSC_VER) /* MSVC has always supported these types */
    typedef unsigned __int32 XXH32_hash_t;
    typedef unsigned __int64 XXH64_hash_t;
#elif !defined(__VMS) \
    && ( \
        (defined(__cplusplus) && __cplusplus >= 201103L) /* C++11 */ \
     || (defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L) /* C99 */ \
    )
#  include <stdint.h>
 ...
#else
  ....
#endif 

I tested on VC++2005 and sure enough, when compiled in C++ mode, it errors because it can't find stdint.h.

It found two other issues:

  • _mm_castps_si128 isn't on this version of MSVC. Why are we using that?
  • We include headers in extern "C" instead of only using it for declarations. MSVC does not like that.

If we want to be compatible with basically any compiler we put at it, why not add a few CI checks against uncommon/really old compiler/libc combos? It is good for testing and flexing

I have figured out how to programmatically download and install the VC++2005 CLI tools from PowerShell, and I can probably set up a GCC 3.3/glibc 2.2 toolchain for Linux. We can cache these.

@Cyan4973
Copy link
Owner

If the test becomes so complex, it may be better to create a new build macro like XXH_HAVE_STDINT which could be determined externally, or set manually, so that it's easier to adjust to any configuration.

If we want to be compatible with basically any compiler we put at it

There are some limits to this exercise.
I'm generally favorable to any minor change / tweak which increases the range of compatible compilers without affecting much the code structure.
Also, if we want to pretend that a given compiler is supported, then it needs to be part of CI. Reason is, it's way too easy for any contributor in the future to introduce code which break some of these compilers since most contributors won't check anything beyond their own system and the automated tests.

If an old or rare compiler requires some larger code change to adjust, with several modifications spread out into the code base, then it's best to leave that out. We don't want to impact general code readability and maintenance for the sake of a very rare scenario.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jul 26, 2020

Good point. It is a little complicated.

I will make a PR with the fixes for those though, as well as obeying strict aliasing as best as I can in the main loop.

@Cyan4973
Copy link
Owner

xxHash v0.8.0 has been released this morning, it stabilizes output values of XXH3.

Side-question : is it still necessary to keep this "general discussion" thread open ?
Now that XXH3 is stabilized, we may be better off using the regular issue board for more targeted topics ?

@aras-p
Copy link
Contributor

aras-p commented Jul 27, 2020

is it still necessary to keep this "general discussion" thread open

I'd say close this one; more targeted issues are better once things are stable & settled down.

@Cyan4973 Cyan4973 closed this as completed Aug 2, 2020
AralRocais pushed a commit to AralRocais/xxHash that referenced this issue Nov 3, 2022
answering : Cyan4973/xxHash#175 (comment)

The probability of receiving an empty string is larger than random (> 1 / 2^64),
making the generated hash more "common".

For some algorithm, it's an issue if this "more common" value is 0.

Maps it instead to an avalanche of an arbitrary start value (prime64).
The start value is blended with the `seed` and the `secret`,
so that the result is dependent on those ones too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests