Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Re-enabling data versioning by hash only #589

Closed
afshinmhCLG opened this issue Jan 14, 2022 · 20 comments
Closed

Feature Request: Re-enabling data versioning by hash only #589

afshinmhCLG opened this issue Jan 14, 2022 · 20 comments

Comments

@afshinmhCLG
Copy link

Desired behavior:
Data version changes when data changes. For example, repeated execution of pin_write(x, board = myboard, version = T) does not generate different data versions as long as x is the same.

Current behavior:
Data is re-written under a new version each time pin_write is executed even when content, name, description, title, etc. are all the same!

I realize hash of the data is stored so one can identify duplicates, but avoiding data duplication in the first place was at the core of the value pins offered teams.

Feature request:
Given that the time stamp was added to the version signature in version 1.0.1, I am sure it has a use-case, but would it be possible to enable an option where the signature is the hash of the data or even better a hash encompassing data, and other user defined metadata (e.g. description and other tags)?

@robertsehlke
Copy link

+1

I'm just testing out pins, which so far has been very slick with s3 backend, and went looking in the issues if someone else had already asked about this.

For my use case, even just an optional duplicate detection built into pin_write() would serve the purpose. I.e. if the user chooses: calculate the hash, query existing versions, and only create a new version if it is not a duplicate.

This would be a simple QoL option that does not affect current structures.

@machow
Copy link
Collaborator

machow commented Nov 7, 2022

It seems like the big challenge here for pins is that version name is made of two parts ("{timestamp}-{hash}"), and has two jobs:

  • timestamp: allow for quick sorting of pin versions (e.g. to get the latest pin)
  • hash: allow for quick checking whether contents are identical

It seems like versioning by hash only would require a new strategy for tracking timestamp info. For example, on pin_write() we check for a duplicate, and instead a filesystem mv to rename the duplicate version to the updated timestamp.

E.g.

  • a write of data with the hash abc on 2022-02-02
  • verifies a previous version has a matching hash abc
  • moves {old_timestamp}-abc to {new_timestamp}-abc.

It seems like it might require fairly substantial changes, though....!

@robertsehlke
Copy link

Agreed - a simpler way could be to let the user choose not to upload/update at all if there already is a version with the same hash (also keeps external references stable).

@iainmwallace
Copy link

Agreed, this capability would be very useful.

perhaps there could be a helper function such as pin_unique_write that only writes if the dataset is unique.

@amashadihossein
Copy link

amashadihossein commented Nov 7, 2022

et the user choose not to upload/update at all if there already is a version with the same hash (also keeps external references stable).

Exactly! Fundamentally, we have 3 metadata elements here that could be associated with a given data file

  1. Hash (unique id which would be ideal for version)
  2. Time of original store
  3. Time of rewrite

What is needed to provide all the flexibility of querying is to keep metadata fields as separate metadata key:value tracked in a metadata file (within the same board) rather than combining them + repeatedly saving the same file (and fill up storage through proliferation of duplicate files). Pins already is capable of tracking metadata so it shouldn't be a big undertaking to track additional metadata properly rather than getting the version to do double duty.

I have done a wrapper around pins that uses the old pins (which had version as hash, rather than time/hash combo, so you wouldn't resave the same file) and also keeps a separate object (a board level pin actually) that tracks metadata above (and more) and gets updated with each equivalent of pin_write. So if you want to get the latest by time of original storage or the latest attempt at re-write (or any other metadata) you can easily do that. At the time I wrote the wrapper, I considered it a bandaid until the issue was resolved. At this point though, tbh I don't hold my breath that this will change any time soon.

@machow
Copy link
Collaborator

machow commented Nov 7, 2022

Thanks for all the confirmation that this would be useful, and added context! I haven't seen the original pins in action, so it's helpful to hear that this feels like an original feature that got taken out.

it shouldn't be a big undertaking to track additional metadata properly rather than getting the version to do double duty

It seems like a key here is pins is using {timestamp}-{hash} not just as metadata (which is already recorded in the data.txt file), but essentially as a partition (e.g. you can ls timestamp and hash info, without having to open a file, maybe loosely similar to how hive partitions work).

A pins_write_unique() function seems much easier to implement. What do y'all picture pin_unique_write() doing if it hits a duplicate? Throwing an error? (or just printing a message?)

@amashadihossein
Copy link

A pins_write_unique() function seems much easier to implement. What do y'all picture pin_unique_write() doing if it hits a duplicate? Throwing an error? (or just printing a message?)

Thanks, @machow for the follow up! I'd say just update the metadata (and definitely not write as it would be slow even if it is just an overwrite). Maybe if something like verbose = T, we would also get a cli green checkmark that says file version xxx is already pinned (although certainly w or w/o message could work). It also be fantastic if the version could come back as just the hash (no timestamp)

@hadley
Copy link
Member

hadley commented Nov 8, 2022

Just a note that this:

keeps a separate object (a board level pin actually) that tracks metadata above (and more) and gets updated with each equivalent of pin_write.

might work for your use case but is very dangerous in if more than one person might be writing to the board because you'll run the risk of race conditions leading to the metadata store not matching what's actually recorded on disk. This was a major flaw with the design of pins v1 that had to be fixed for v2.

@robertsehlke
Copy link

A pins_write_unique() function seems much easier to implement. What do y'all picture pin_unique_write() doing if it hits a duplicate? Throwing an error? (or just printing a message?)

Just printing a message/warning would already be great imo.

Ideal case would be a return value that contains all metadata fields of the exact written or unwritten-because-duplicated version ( pin_meta outputs for that version), but that would be a deeper change if you want consistency across the regular pin_write as well. Either way, likely best to leave existing metadata untouched (as @hadley noted).

PS: even just exposing the hash calculation as stand-alone function would make writing a wrapper for this trivial, given other utility functions. Maybe that's easy, haven't had the time to dive very deep yet.

@amashadihossein
Copy link

might work for your use case but is very dangerous in if more than one person might be writing to the board because you'll run the risk of race conditions

Thanks, valid point and what's dangerous and what's worth mitigation effort relates to use case/specifics and preferences. Regardless, it seems pins is ok with rewrite of the same content over and over again and offers to keep track of the writes which is useful.

@hadley
Copy link
Member

hadley commented Nov 9, 2022

@robertsehlke I don't think just exposing the hash calculation is sufficient; we'd need some consideration of the length of the hash. IIRC it's currently truncated because the timestamp provides most of the identity; if you're relying on the hash alone you'd need to ensure that the chance of spurious collisions is very low.

@robertsehlke
Copy link

True, and timestamps + shortened hash are generally a nice compromise for human readability.

Does pin_versions parse the folder name or use the data.txt files? If it returned the full hash that would solve the issue, no need to change folder/version naming.

@kmasiello
Copy link

kmasiello commented Nov 17, 2022

Adding some color to the issue. Pinning every version, even if the data is unchanged, is painful in two ways:

  1. A daily scheduled job on Connect that writes a pin will result in a pin with a very long list of versions, even if in reality, the data did not change. This obscures the visibility of when the data actually changes. Users expect that they can pull up a version history of the pin and quickly see the handful of times the data actually changed.
  2. repeated writes of the same data on the Connect server can consume resources. In practice, it's not always a practical or quick fix to just increase storage. The data scientists likely don't have visibility of the fact that their pin writes/rewrites are consuming a lot of storage space. And server admins may not be familiar enough with the data science workflows to appreciate that pin versions are a culprit in why server space is being consumed.

@hadley
Copy link
Member

hadley commented Nov 17, 2022

As far as I can tell, this was never documented behaviour with old versions by pins, and I suspect it was just an implementation detail that folks took advantage of. This isn't to say that we shouldn't make it possible, just to note that we need to carefully design it from the ground up.

@amashadihossein
Copy link

2. repeated writes of the same data on the Connect server can consume resources

In my experience, this is very much a real concern. I have witnessed multiple times connect filling up leading to folks scrambling to allocate more resources (and actual downtime). Of course, connect filling up is not totally eliminated by avoiding duplicates (and there are mitigations for teams to consider) but anything that reduces duplication does alleviate the issue quite a bit.
I understand this is not trivial to address, but (whether in pins or through other packages to come) the benefit of avoiding duplication (at least in some use-cases) makes the effort worthwhile.

@hadley
Copy link
Member

hadley commented Nov 17, 2022

We probably also need to more aggressively advertise pins::pin_versions_prune(). pins_version_prune(n = 20) would be particularly useful if we can avoid recreating pins with the same timestamp.

@kmasiello
Copy link

kmasiello commented Dec 21, 2022

Two helper functions in gist format that may assist for the time being:

  • pin_unique will only pin if the data frame is changed from the most recently-pinned version. Optionally incorporates a pruning value to keep things under control.
  • pin_compare yields the output of diffdf::diffdf() on the specified versions of a pin to see changes.

@juliasilge
Copy link
Member

We have made a change in #735 that addresses this issue, and we would be so happy for any of you to try it out and give feedback! You can install via remotes::install_github("rstudio/pins-r").

The default is now not to write a new version with identical contents, but you can force the write via an optional argument:

library(pins)
b <- board_connect()
#> Connecting to Posit Connect 2023.03.0 at <https://colorado.posit.co/rsc>

b %>% pin_write(1:10, "julia.silge/my-amazing-numbers")
#> Guessing `type = 'rds'`
#> Writing to pin 'julia.silge/my-amazing-numbers'

b %>% pin_write(1:10, "julia.silge/my-amazing-numbers")
#> Guessing `type = 'rds'`
#> ! The hash of pin "julia.silge/my-amazing-numbers" has not changed.
#> • Your pin will not be stored.

b %>% pin_write(1:10, "julia.silge/my-amazing-numbers", force_identical_write = TRUE)
#> Guessing `type = 'rds'`
#> Writing to pin 'julia.silge/my-amazing-numbers'

Created on 2023-05-05 with reprex v2.0.2

For now, this new functionality checks the hash of the pin contents only; the metadata is not hashed or checked. This would mean that right now you need to do something like this to change the metadata without changing the pin contents:

library(pins)
b <- board_connect()
#> Connecting to Posit Connect 2023.03.0 at <https://colorado.posit.co/rsc>

b %>% pin_write(
  1:10, 
  "julia.silge/some-amazing-numbers"
)
#> Guessing `type = 'rds'`
#> Writing to pin 'julia.silge/some-amazing-numbers'

## does not successfully write:
b %>% pin_write(
  1:10, 
  "julia.silge/some-amazing-numbers",
  description = "These numbers are amazing!"
)
#> Guessing `type = 'rds'`
#> ! The hash of pin "julia.silge/some-amazing-numbers" has not changed.
#> • Your pin will not be stored.

## will force write:
b %>% pin_write(
  1:10, 
  "julia.silge/some-amazing-numbers", 
  description = "These numbers are amazing!",
  force_identical_write = TRUE
)
#> Guessing `type = 'rds'`
#> Writing to pin 'julia.silge/some-amazing-numbers'

Created on 2023-05-05 with reprex v2.0.2

We still plan to look at what exactly is stored as pin_hash and see if it makes sense to include some portion of the metadata in what is hashed.

@juliasilge
Copy link
Member

I've outlined the issues with hashing metadata in #739. If you have thoughts on that, please chime in! And if you have further problems or questions, please open a new issue.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators May 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants