-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Re-enabling data versioning by hash only #589
Comments
+1 I'm just testing out pins, which so far has been very slick with s3 backend, and went looking in the issues if someone else had already asked about this. For my use case, even just an optional duplicate detection built into This would be a simple QoL option that does not affect current structures. |
It seems like the big challenge here for pins is that version name is made of two parts (
It seems like versioning by hash only would require a new strategy for tracking timestamp info. For example, on E.g.
It seems like it might require fairly substantial changes, though....! |
Agreed - a simpler way could be to let the user choose not to upload/update at all if there already is a version with the same hash (also keeps external references stable). |
Agreed, this capability would be very useful. perhaps there could be a helper function such as pin_unique_write that only writes if the dataset is unique. |
Exactly! Fundamentally, we have 3 metadata elements here that could be associated with a given data file
What is needed to provide all the flexibility of querying is to keep metadata fields as separate metadata key:value tracked in a metadata file (within the same board) rather than combining them + repeatedly saving the same file (and fill up storage through proliferation of duplicate files). Pins already is capable of tracking metadata so it shouldn't be a big undertaking to track additional metadata properly rather than getting the version to do double duty. I have done a wrapper around pins that uses the old pins (which had version as hash, rather than time/hash combo, so you wouldn't resave the same file) and also keeps a separate object (a board level pin actually) that tracks metadata above (and more) and gets updated with each equivalent of |
Thanks for all the confirmation that this would be useful, and added context! I haven't seen the original pins in action, so it's helpful to hear that this feels like an original feature that got taken out.
It seems like a key here is pins is using A |
Thanks, @machow for the follow up! I'd say just update the metadata (and definitely not write as it would be slow even if it is just an overwrite). Maybe if something like |
Just a note that this:
might work for your use case but is very dangerous in if more than one person might be writing to the board because you'll run the risk of race conditions leading to the metadata store not matching what's actually recorded on disk. This was a major flaw with the design of pins v1 that had to be fixed for v2. |
Just printing a message/warning would already be great imo. Ideal case would be a return value that contains all metadata fields of the exact written or unwritten-because-duplicated version ( PS: even just exposing the hash calculation as stand-alone function would make writing a wrapper for this trivial, given other utility functions. Maybe that's easy, haven't had the time to dive very deep yet. |
Thanks, valid point and what's dangerous and what's worth mitigation effort relates to use case/specifics and preferences. Regardless, it seems pins is ok with rewrite of the same content over and over again and offers to keep track of the writes which is useful. |
@robertsehlke I don't think just exposing the hash calculation is sufficient; we'd need some consideration of the length of the hash. IIRC it's currently truncated because the timestamp provides most of the identity; if you're relying on the hash alone you'd need to ensure that the chance of spurious collisions is very low. |
True, and timestamps + shortened hash are generally a nice compromise for human readability. Does |
Adding some color to the issue. Pinning every version, even if the data is unchanged, is painful in two ways:
|
As far as I can tell, this was never documented behaviour with old versions by pins, and I suspect it was just an implementation detail that folks took advantage of. This isn't to say that we shouldn't make it possible, just to note that we need to carefully design it from the ground up. |
In my experience, this is very much a real concern. I have witnessed multiple times connect filling up leading to folks scrambling to allocate more resources (and actual downtime). Of course, connect filling up is not totally eliminated by avoiding duplicates (and there are mitigations for teams to consider) but anything that reduces duplication does alleviate the issue quite a bit. |
We probably also need to more aggressively advertise |
Two helper functions in gist format that may assist for the time being:
|
We have made a change in #735 that addresses this issue, and we would be so happy for any of you to try it out and give feedback! You can install via The default is now not to write a new version with identical contents, but you can force the write via an optional argument: library(pins)
b <- board_connect()
#> Connecting to Posit Connect 2023.03.0 at <https://colorado.posit.co/rsc>
b %>% pin_write(1:10, "julia.silge/my-amazing-numbers")
#> Guessing `type = 'rds'`
#> Writing to pin 'julia.silge/my-amazing-numbers'
b %>% pin_write(1:10, "julia.silge/my-amazing-numbers")
#> Guessing `type = 'rds'`
#> ! The hash of pin "julia.silge/my-amazing-numbers" has not changed.
#> • Your pin will not be stored.
b %>% pin_write(1:10, "julia.silge/my-amazing-numbers", force_identical_write = TRUE)
#> Guessing `type = 'rds'`
#> Writing to pin 'julia.silge/my-amazing-numbers' Created on 2023-05-05 with reprex v2.0.2 For now, this new functionality checks the hash of the pin contents only; the metadata is not hashed or checked. This would mean that right now you need to do something like this to change the metadata without changing the pin contents: library(pins)
b <- board_connect()
#> Connecting to Posit Connect 2023.03.0 at <https://colorado.posit.co/rsc>
b %>% pin_write(
1:10,
"julia.silge/some-amazing-numbers"
)
#> Guessing `type = 'rds'`
#> Writing to pin 'julia.silge/some-amazing-numbers'
## does not successfully write:
b %>% pin_write(
1:10,
"julia.silge/some-amazing-numbers",
description = "These numbers are amazing!"
)
#> Guessing `type = 'rds'`
#> ! The hash of pin "julia.silge/some-amazing-numbers" has not changed.
#> • Your pin will not be stored.
## will force write:
b %>% pin_write(
1:10,
"julia.silge/some-amazing-numbers",
description = "These numbers are amazing!",
force_identical_write = TRUE
)
#> Guessing `type = 'rds'`
#> Writing to pin 'julia.silge/some-amazing-numbers' Created on 2023-05-05 with reprex v2.0.2 We still plan to look at what exactly is stored as |
I've outlined the issues with hashing metadata in #739. If you have thoughts on that, please chime in! And if you have further problems or questions, please open a new issue. |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Desired behavior:
Data version changes when data changes. For example, repeated execution of
pin_write(x, board = myboard, version = T)
does not generate different data versions as long as x is the same.Current behavior:
Data is re-written under a new version each time
pin_write
is executed even when content, name, description, title, etc. are all the same!I realize hash of the data is stored so one can identify duplicates, but avoiding data duplication in the first place was at the core of the value
pins
offered teams.Feature request:
Given that the time stamp was added to the version signature in version 1.0.1, I am sure it has a use-case, but would it be possible to enable an option where the signature is the hash of the data or even better a hash encompassing data, and other user defined metadata (e.g. description and other tags)?
The text was updated successfully, but these errors were encountered: