Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use FastHash64 or FNV1a as a 64-bit hash algorithm. #6138

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rodrigorc
Copy link
Contributor

This PR implements an alternative 64-bit hash algorithm instead of CRC32, to make it more resilient to hash collisions.

It has already been discussed a few times such as in #4612.

According to my calculations, using the Birthday problem approximation formula, and assuming a perfect hash, with a 32-bit hash the probability of collisions grows as follows:

  • 1% - 9 300 IDs
  • 10% - 30 100 IDs
  • 50% - 77 200 IDs
  • 90% - 140 700 IDs

With a 64-bit hash the probabilities are:

  • 1% - 610 000 000 IDs
  • 10% - 2 000 000 000 IDs
  • 50% - 5 000 000 000 IDs
  • 90% - 9 200 000 000 IDs

Which, IMO makes a non-malicious collision practically impossible.

A few additional comments:

  1. By default it keeps using CRC32. To use FastHash64 add CXXFLAGS += -DIMGUI_USE_HASH64 to your Makefile.
  2. It changes the ImGuiID typedef to be a 64-bit integer instead of 32-bits. It is an ABI change, obviously. And I would feel more comfortable doing typedef ImU64 ImGuiID; instead of using unsigned long long, but that would change the order of existing typedefs.
  3. In many places IDs are converted to/from strings using the format string %08X. With a 64-bit integer that is not correct, so I'm defining a macro IMGUI_HASH_FMT, <inttypes.h> style, to do the proper formatting. I think I have changed all of them, but third party code that does something similar would need to be updated.
  4. I'm using FastHash64 algorithm, by Zilong Tan, based on this implementation, MIT licensed. I considered using the FNV1a, as a FIXME comment suggests, but curiously enough the ShowDemoWindow() suffered from the same collision as CRC32, as commented by @opadin in the linked issue. I'm opening a separate PR for this collision, as it is not just random chance, but this shows how a better hashing may avoid nasty surprises.
  5. It is not explicitly stated, but ImHashData/ImHashStr must return the unmodified seed if they receive zero bytes. CRC32 and FNV1a do that, but FastHash64 does not, so I'm doing that check manually. It is required for some things such as BeginMenuEx() calling Selectable("", ...), or DragScalarN() calling DragScalar("", ...) to reuse the same ID.
  6. About speed, in a modern 64-bit machine, with a properly aligned buffer, and a few dozens bytes to hash, FastHash64 is about twice faster than CRC32. YMMV.
  7. Since it reads bytes in chunks of 8, the checking of ### and the ending NUL cannot be done inline. So I'm calling strlen() and strstr() beforehand. It is incompatible with the previous behavior, as I'm only checking for ### if data_size==0. This is because if the string length is passed explicitly we cannot be sure if there is a NUL byte at the end, so strsstr cannot be used. And AFAIK, memmem() or strnstr() are not standard C functions. If desired, we can implement ImMemMem/ImStrNStr easily, of course. Or use a different hashing algorithm that can chomp bytes one by one, such as FNV1a. I think ImGui itself never calls those functions with data_size != 0 so there is no harm here. With third-party libraries or other language bindings this will be an issue, so please consider this point a request for comments.
  8. These str* calls make the runtime of ImHashStr about the same as CRC32. I expect that 32-bit CPUs will be slightly slower.
  9. I've changed the signature of ImHashData/ImHashStr so that the seed argument is of type ImGuiID instead of ImU32 so they keep the same signature with both hashes.

With all that said, I've been using it for a while in a project of mine with no issues at all.

@rodrigorc
Copy link
Contributor Author

Now that f799a29 fixes the demo hash collision I see no reason not to use FNV1a hash algorithm and incompatibility commented in point 7 of my original post

So I'm force-pushing a version of this PR with this FNV1a instead of FastHash64, the code is simpler, faster and properly checks for the ### mark. My home-made measurements give about 25%-50% faster than CRC32. Hey, if it proves worthy, we could even implement the hash32 version by doing the FNV1a and folding the 64->32 bits: best of both worlds.

Anybody feel free to comment on either version of the PR, of course.

@ocornut ocornut changed the title Use FastHash64 as a 64-bit hash algorithm. Use FastHash64 or FNV1a as a 64-bit hash algorithm. Feb 3, 2023
@ocornut
Copy link
Owner

ocornut commented Feb 3, 2023

Thank you, I'll try to review next week.
Some quick thought:

  • See if the string parser in ShowDebugLogWindow() still works.
  • May want to test this with the Test Suite.
  • May need to do a cursory check about sscanf() working on all compilers/libs.
  • Unsure if %016llX works everywhere too. I don't think 64-bit integers in are well standardized. Notice this blocks in GDataTypeInfos[]:
#ifdef _MSC_VER
    { sizeof(ImS64),            "S64",  "%I64d","%I64d" },  // ImGuiDataType_S64
    { sizeof(ImU64),            "U64",  "%I64u","%I64u" },
#else
    { sizeof(ImS64),            "S64",  "%lld", "%lld"  },  // ImGuiDataType_S64
    { sizeof(ImU64),            "U64",  "%llu", "%llu"  },
#endif

BUT would need to check history/blame about this. As we raised our build requirement to a subset of C++11 (VS2012 ?) in 1.87 it might be easier now. I'll be able to check this later if you don't have millions of compilers installed (I do).

  • If this work I think we can get simply rid of CRC32.

Obviously the slight yet biggest annoyance are the IMGUI_HASH_FMT things...

Don't act on this idea yet, but I am wondering if it would make sense adding a helper relying on constructor/destructor to optimally fill a static buffer (with a small number of slots) and return a const char*, so it becomes "%s", ...., ImFormatID(id). and destructor release its slot. It would need to be a function with static return type, as a class constructor won't auto-cast to const char* in an ellipsis function.

@rodrigorc
Copy link
Contributor Author

May want to test this with the Test Suite.

There is a test suite?! I see it now, I'll try and play with it.

I'll be able to check this later if you don't have millions of compilers installed.

I don't have so many, I just tested g++ and clang++, they seem to work fine.

Personally I would prefer to define ImU32 and ImU64 in terms of uint32_t and uint64_t, from <inttypes.h> that exists in C++98, IIRC. Then, instead of "llX" we can use the proper PRIX64 that is guaranteed to exist whenever there is support for 64-bits and do the right thing. That is:

#include <inttypes.h>
#ifdef IMGUI_USE_HASH64
typedef uint64_t ImGuiID;
#define IMGUI_HASH_FMT "%016" PRIX64
#else
typedef uint32_t ImGuiID;
#define IMGUI_HASH_FMT "%08" PRIX32
#endif
// ditto for ImS8, ImU8, ImS16, ImU16, ...

If this work I think we can get simply rid of CRC32.

That would make things simpler. I did the option because I thought maybe there are embedded or old systems out there without proper support for 64 bit integers.

I am wondering if it would make sense adding a helper relying on ... static buffers ...

Why a static buffer? You can use an automatic one and avoid the destructor altogether. Something like this:

struct ImFormatID_
{
    char txt[17];
    ImFormatID_(ImGuiID id)
    {
        sprintf(txt, "%016X", id);
    }
};
#define ImFormatID(id) ImFormatID_(id).txt

This is risky because that macro returns a pointer to a temporary, but that is the beauty of C++, I guess:

printf("%s\n", ImFormatID(0x1234)); //ok
const char *txt = ImFormatID(0x1234); //dangling pointer!

@ocornut
Copy link
Owner

ocornut commented Feb 4, 2023

Didn’t think of a macro wrapper, that’s better for the risk of dangling pointer is too high with this technique.

I think PRIX64 will work just fine on any 64-bit type.
My wonder is about whether its content converged accross compilers since C++11 of if there is a single format type that willl work accross compilers for a 64-bits integer. I’ll try to investigate next week.

@ocornut
Copy link
Owner

ocornut commented Feb 7, 2023

Minor: as part of e816bc6 I merged the signature fix for ImHashXXX functions using ImGuIID seed and the use of 0u in that popup function. May trivially conflicts on your end.

@eric-2
Copy link

eric-2 commented Feb 10, 2023

I just saw this PR and thought i would share my (very similar) patch against the docking branch.
I found 2 local buffers that have to be increased in size for the longer hash values.
The path should apply cleanly to the current docking commit ea45653.
imgui_fnv.patch

@TrianglesPCT
Copy link

TrianglesPCT commented Feb 14, 2023

crc has the advantage of having intrinsic implementations available..(It would be nice if the ImGuiID were 64bit though.)
I have replaced imgui's original CRC with this:

crc_example.txt

^ You would think github of all places could properly handle displaying code, it has a add code button but apparently it does not work properly. Gave up and added it as a text file

@PathogenDavid
Copy link
Contributor

^ You would think github of all places could properly handle displaying code

The code button does inline code by default unless you have multiple lines selected.

```cpp
printf("Multi-line codeblocks use three backticks (with an optional language specifier on the first line.\n");
```

Here's your CRC example properly formatted:

//      32/64 bit test msvc and intel       32/64 bit test clang,gcc,intel(linux)                    
#if (defined(_M_IX86)||defined(_M_X64) || defined(__i386__)||defined(__x86_64__)) && (defined(__SSE4_2__) || defined(__AVX__) ) //msvc doesn't define __SSE4_2__, so use next available __AVX__, which implies sse4.2 support
ImGuiID ImHashData(const void* data_p, size_t data_size, ImU32 seed)
{
    unsigned char* data = (unsigned char*)data_p, * data_end = (unsigned char*)data_p + data_size;

    ImU32 crc = ~seed;

#if defined(_M_X64) || defined(__x86_64__) // 32bit mode doesn't have 64bit crc32
    uint64_t crc64 = crc;
    for (; (data + 8) <= data_end; data += 8)
        crc64 = _mm_crc32_u64(crc64, *(uint64_t*)data);
    crc = (ImU32)crc64;
#endif
    for (; (data + 4) <= data_end; data += 4)
        crc = _mm_crc32_u32(crc, *(ImU32*)data);

    for (; data != data_end; ++data)
        crc = _mm_crc32_u8(crc, *data);

    return ~crc;
}
#else 
//original imgui impl  here

#endif

@PathogenDavid
Copy link
Contributor

I wonder if it might be easier to just add an advanced config option that leaves ImHashData/ImHashStr undefined so that a user-provided implementation can be used instead.

FNV1a would be an improvement over CRC32, but it's hardly state of the art anymore. The 32-bit variant of xxHash is supposedly a much higher quality hash with significantly better performance while still being 32-bits.

xxHash is probably too big to distribute alongside Dear ImGui (although the main library is single header), but an add-on like imgui_freetype might be more appropriate.

@ocornut
Copy link
Owner

ocornut commented Feb 15, 2023

That CPU-assisted CRC32 change seems surprisingly viable to consider as a temporary step.

xxHash is probably too big to distribute alongside Dear ImGui

While not aiming to hard-preventing it, I'd be in favor of not encouraging people to swap that hashing function, nor adding too many compile-time features that will make their way into package managers, bindings, and complexify things etc. If we settle on a nicer hash function, we'd be good enough. Intuitively I imagine that nicer hash may be 64-bits but I haven't deeply investigated trade-offs. Also note our use case are: very small data, very frequent calls (so code size/prologue/epilogue may have meaningful effect), and things may be not benchmarked enough for that favor.

@rodrigorc
Copy link
Contributor Author

That CPU-assisted CRC32 change seems surprisingly viable to consider as a temporary step.

That intrinsic looks great, but note that for ImHashStr you need to check for the ### so we have to either do the strstr() call beforehand or advance one byte at a time with _mm_crc32_u8().

Also, AFAIK _mm_crc32_u* are only available on SSE4.1 or later. Not exactly new, but probably too risky for the default code, unless we include a run-time check and a software.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
label/id and id stack implicit identifiers, pushid(), id stack optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants