-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible to hash multiple files in language independent way? #599
Comments
Should we stick with xxHash or use MD5, which is (mostly) what is available on cloud platforms like AWS and GCP? I understand that xxHash is much faster, but if we are going to revamp hashing (related to #589) would it be more useful to store a hash that could be compared on common systems? Related to that, I am not seeing the same results as what is above for files (differences in file writing???) ## which ones are the same for MD5?
digest::digest(file = "test.txt", algo = "md5")
#> [1] "0bee89b07a248e27c83fc3d5951213c1"
digest::digest("abc", algo = "md5")
#> [1] "0aa9c59ea893e51a8cc55e8ea353e592"
digest::digest("abc", algo = "md5", serialize = FALSE)
#> [1] "900150983cd24fb0d6963f7d28e17f72"
## which ones are the same for xxHash? NONE
digest::digest(file = "test.txt", algo = "xxhash64")
#> [1] "e8a1523b824c6e2d"
digest::digest("abc", algo = "xxhash64")
#> [1] "f4acd87c52d4e62b"
digest::digest("abc", algo = "xxhash64", serialize = FALSE)
#> [1] "44bc2cf5ad770999" Created on 2023-02-23 with reprex v2.0.2 In Python, I see: import xxhash
## same as R, but not as each other:
xxhash.xxh64(open("test.txt", "rb").read()).hexdigest()
#> 'e8a1523b824c6e2d'
xxhash.xxh64(b"abc").hexdigest()
#> '44bc2cf5ad770999' import hashlib
## same as R, but not as each other:
hashlib.md5(open("test.txt", "rb").read()).hexdigest()
#> '0bee89b07a248e27c83fc3d5951213c1'
hashlib.md5(b"abc").hexdigest()
#> '900150983cd24fb0d6963f7d28e17f72' This is after doing For concatenating hashes together (to replace digest::digest("abcdef", algo = "md5", serialize = FALSE)
#> [1] "e80b5017098950fc58aad83c8c14978e"
digest::digest(glue::glue_collapse(c("abc", "def")), algo = "md5", serialize = FALSE)
#> [1] "e80b5017098950fc58aad83c8c14978e" Created on 2023-02-23 with reprex v2.0.2 |
I think that the echo command I gave adds a newline character 😬. It might need something like this?: echo -n abc > test.txt
I feel pretty paranoid about bad things that could happen by switching the overall hashing algorithm, but don't have anything specific in mind (I guess it could make comparisons between old-hash and new-hash pin versions hard?). It seems like one potentially big advantage of md5 is that we could use it for names in the pins cache. But since the pins cache isn't flat, but has the structure |
Oh of course, the newline! 🙈 I had assumed you were sharing the actual code you had run and something weird was happening on my end.
Yeah, for sure, me too. This is not a small or minor thing to consider. Dealing with #589 is a big enough project, though, that we should figure out the really most useful way to change the hashing strategy. I do think that there are enough potential advantages to MD5 to consider it; I'll work up an implementation and see how that goes. |
After some more time and seeing what folks' challenges are, we now think that this isn't likely to be a problem for typical pins R + Python use cases. It could come up as a problem when a team wants to write to the same pin from both R and Python (for example, picture someone sometimes wanting to update a parquet file using R and sometimes using Python); this is not a common pattern currently. We can revisit this in the future if it comes up for users. If we do, the main thing to look at is when the current strategy in R serializes before taking the hash (or uses ## different from Python because serializes by default:
digest::digest("abc", algo = "xxhash64")
#> [1] "f4acd87c52d4e62b"
## either of these are OK:
digest::digest(file = "test.txt", algo = "xxhash64")
#> [1] "44bc2cf5ad770999"
digest::digest("abc", algo = "xxhash64", serialize = FALSE)
#> [1] "44bc2cf5ad770999" Created on 2023-04-27 with reprex v2.0.2 In Python: import xxhash
## same:
xxhash.xxh64(open("test.txt", "rb").read()).hexdigest()
#> "44bc2cf5ad770999"
xxhash.xxh64(b"abc").hexdigest()
#> "44bc2cf5ad770999" |
I'm on board with closing this issue, but want to clarify the issue raised here. This issue only concerns differences in hashing when hashing multiple files. Here are the important points of single file hashing (which is not the issue I meant to raise here):
For writing single files, the last piece will produce issues for teams that both want to write from R and python, but is not about hashing (even though they use identical single file hashing strategies, different hashes occur because it's different data written to file). Edit: My note about |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Currently, pins seems to use two strategies for hashing:
digest(file=..., algo="xx64")
hash(hashes)
, orhash(url)
One challenge for porting the second hash strategy to python is that it seems
hash
serializes objects first, anddigest
will serialize by default. Below is an example.Example
R code
Python code
Possible fix
I think the following steps should be language independent (but I could be very wrong here):
hash(...)
withdigest(..., serialize=FALSE)
c("abc", "def")
would need to be concatenatedThe text was updated successfully, but these errors were encountered: