-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External output to encrypted s3 fails with ETag mismatch #2701
Comments
@Titousensei thanks for the report! Quick question - when you do this second time from the clean repo, does it show same ETags in the error message or every time you get the new ones? |
I get new ETags every time (for the same file with the exact same command). |
@Titousensei unfortunately, it means that the current DVC implementation does not support external cache related logic and external dependencies (?) (when you want to version outputs that you write to S3, or you do something like We'll have to update the docs with this caveat and think about workarounds/solutions :(. I see a few alternatives here from the top of my head and hopefully other folks will chime in:
You don't have to sacrifice encryption in this case. This is good. I'm not sure how SSE-S3 is different from other options. In this case most likely
It also really depends on your use case. Would be great if you could describe it a little bit so that we can figure out our options faster. |
@Titousensei Edited the previous answer a bit to be more precise. Again, thanks for sharing all this information. Let's try to figure out an option that works best for you before we find a fix for this (if it's possible w/o downloading and calculating checksum data locally, but that breaks the deal in a lot of cases). |
@shcheklein Thanks for the quick answers. A little more bit about my use case: I'm working for a company that does a lot of a machine learning and we have a large custom script (python) to do data preparation, check pointing, training and evaluation. We have another custom script to publish our models to s3 and metadata to git. I'm exploring DVC as an option to replace those scripts and improve our current pipeline because it's becoming hard to maintain, check pointing code is limited and buggy, model versioning and reproducibility is non-existant. DVC looks like a perfect fit, particularly because we can save large intermediate files (to s3 ideally) like the ones produced during or at the end of the data preparation steps (for example Bert features). We are required to have encryption for compliance reason with our customers data (our training data). I'm not sure if it's an option to change our encryption method, I'll have to ask. For the prototype we can use local files, but it would be nice to be able to share prepared data between users. If s3 is not an option, we'll have to use a different shared storage (like ssh in the office), and we'll have to figure out how to publish our models to s3 separately. I'm not sure I understand option 3. Is there a way to upload the files to s3 and let DVC know that it's an output so that the next step can use it as an input? Alternatively, is it possible to use a non-cached external output (with |
Maybe I'm using DVC wrong. I'll try not to use external outs, but instead output the data locally and push to s3 like explained in the "Sharing Data" deep dive tutorial. |
The regular way you do this is work with files locally and then use You need using external outputs/external dependencies (when you do stuff like
Yep! I would def try that first.
cached vs non-cached ( |
For the record, external deps ( One more thing that I would look into, is mounting your s3 bucket to your local fs (e.g. https://github.com/s3fs-fuse/s3fs-fuse). That way you could reference your files as local files, which would rely on regular md5s calculated on the spot. There are some possible drawbacks(e.g. ETag is free, but MD5 takes time to calculate, plus i'm not sure how encrypting part is handled by the FUSE), but I would take a look at it just in case. |
oh, I would be extra careful here. At least we need to mention explicitly that this checksum is tight to the location as well as content. |
Created a ticket to update our docs with these findings: iterative/dvc.org#774 |
@Titousensei Eric, any updates on your end? It looks like there is not much we can do with buckets encrypted this way in terms of supporting external dependencies/outputs. I'm closing this one for now. Please, let us know if you have more questions or other issues with DVC. |
Repro:
This is on a fresh
dvc init
where I configured the remote and cache following the tutorial. I also verified that the file is uploaded properly, and the original file and the cached one have the same MD5.My config:
Please provide information about your setup
DVC version(i.e.
dvc --version
), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))dvc --version: 0.66.1
Mac (Catalina), installation by pip.
The text was updated successfully, but these errors were encountered: