-
Notifications
You must be signed in to change notification settings - Fork 136
hashString in SRV record generation might lead to collisions #157
Comments
what about using a 64bit crc? On Wed, May 13, 2015 at 3:57 AM, Dr. Stefan Schimanski <
|
I guess that's the way to go. Though I question those hash names in general now. From a user point of view I would prefer the task-id (somehow sanitized) in the canonical A-record. |
ok, what if we suffixed the {santized-task-id} with the hash? On Wed, May 13, 2015 at 8:05 AM, Dr. Stefan Schimanski <
|
what's the sense of the hash then? |
uniqueness comes from the framework id + task id |
there was a case that christos had run into where he wanted/needed more for On Wed, May 13, 2015 at 8:34 AM, Dr. Stefan Schimanski <
|
/cc @kozyraki |
What do you think about taking a sha1 hash, encoding it to the extended hex alphabet, and truncating it to 6 characters [3 bytes]? The process of generating these hashes is done "offline" in records/generator.go, so I'm not overly concerned about the performance on sha1. Especially, given that modern CPUs can do thousands of hashes a second, and the typical reload time is under a second. Trivially testing this with the in-built test with 1e9 entries shows no obvious collisions. |
I think we need to understand the scope of this uniqueness requirement a bit more before proceeding. That said, here's my two cents:
https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function I'd like to propose that we use FNV-64 and XOR-fold the results (to mitigate the diffusion problem). Sticky-state should not be a problem for us since we'll never feed zero's into the hash. This gives us a reasonable bit space (32 bits), one that's not artificially constrained and that, once encoded, won't be too long to splice into a record name. And it's fast. I'd also rather see us go with a zbase-32 encoding: it's less "verbose" than extended-hex encoding and if you ever had to type it in zbase-32 was designed to minimize human error w/ respect to transcription. |
So, 4-byte FNV64A + xor-fold doesn't have any collisions using 1M samples. Similarly, a 24-bit SHA1 doesn't have any collisions in this sample I ran. At 10M samples, they both showed collisions. In the future, I'd like to tune the quickcheck test to use a deterministic dataset (set the seed manually). Benchmarks:
The benchmark for SHA included hashing the string, converting it to base32 encoding, and truncating. Opinions? |
fwiw, regular base32 encoding can leave strange tail chars that probably is it worth testing 8-byte FNV-1a 128-bit + xor-fold for comparison (both On Fri, Nov 13, 2015 at 12:17 PM, Sargun Dhillon [email protected]
|
The extended hex alphabet is safe for DNS. It uses - instead of = for the trailing chars as well. zbase32 seems reasonable as well. I don't have any strong opinions, as I imagine the cost, and complexity of applying the encoding for either FNV, or SHA is going to be roughly the same cost. I don't think using 128-bit FNV is a good idea, because it's not included in the Go stdlib. Although, it's fairly trivial to implement our own, or depend on a third-party library, when there are already an excellent selection of hashing algorithms available to us in stdlib, we should take advantage of them, rather than going off the beaten path. |
So, it seems like you need another byte for FNV64a to get the same likelihood of collisions as SHA1. Given that name length is precious in DNS, and all of this is precomputed, I err to use SHA1. Especially, because there's less code involved. In fact, we can make the hashString function much simpler:
|
ok, sgtm. we can revisit performance if it becomes a problem later |
See: #356 |
hashString
uses the Fowler–Noll–Vo algorithm (http://en.wikipedia.org/wiki/Fowler–Noll–Vo_hash_function) to hash the task id to 16 bit numbers. 16 bits are prone to collisions with a few hundred tasks (birth paradox collisions with >50% probability happen with only 300 tasks!).We have to go to at least 32 bits, but even then we have >1% collision probability for 10000 tasks.
The text was updated successfully, but these errors were encountered: