Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External output to encrypted s3 fails with ETag mismatch #2701

Closed
Titousensei opened this issue Oct 31, 2019 · 11 comments
Closed

External output to encrypted s3 fails with ETag mismatch #2701

Titousensei opened this issue Oct 31, 2019 · 11 comments
Labels
bug Did we break something? research

Comments

@Titousensei
Copy link

Repro:

$ dvc run -f license.dvc -d LICENSE.txt -o s3://serving-data/test_dvc/LICENSE.txt "aws s3 cp LICENSE.txt s3://serving-data/test_dvc/LICENSE.txt"
Running command:
	aws s3 cp LICENSE.txt s3://serving-data/test_dvc/LICENSE.txt
upload: ./LICENSE.txt to s3://serving-data/test_dvc/LICENSE.txt
ERROR: failed to run command - ETag mismatch detected when copying file to cache! (expected: '780ec3e2106ff7cad2772f088e894bef', actual: 'ef33931c766641805c921aa9edb160c8')

This is on a fresh dvc init where I configured the remote and cache following the tutorial. I also verified that the file is uploaded properly, and the original file and the cached one have the same MD5.

$ aws s3 ls s3://serving-data/test_dvc/
                           PRE cache/
2019-10-30 17:22:11         95 LICENSE.txt
$ aws s3 ls s3://serving-data/test_dvc/cache/78/
2019-10-30 17:22:13         95 0ec3e2106ff7cad2772f088e894bef

My config:

$ cat .dvc/config
['remote "s3cache"']
url = s3://serving-data/test_dvc/cache
sse = aws:kms
[cache]
s3 = s3cache

Please provide information about your setup
DVC version(i.e. dvc --version), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))
dvc --version: 0.66.1
Mac (Catalina), installation by pip.

@shcheklein
Copy link
Member

@Titousensei thanks for the report! Quick question - when you do this second time from the clean repo, does it show same ETags in the error message or every time you get the new ones?

@shcheklein shcheklein added bug Did we break something? research labels Oct 31, 2019
@Titousensei
Copy link
Author

I get new ETags every time (for the same file with the exact same command).

@shcheklein
Copy link
Member

shcheklein commented Oct 31, 2019

@Titousensei unfortunately, it means that the current DVC implementation does not support external cache related logic and external dependencies (?) (when you want to version outputs that you write to S3, or you do something like dvc add s3://encrypted-bucket/file) on top of buckets encrypted with SSE-C or SSE-KMS.

We'll have to update the docs with this caveat and think about workarounds/solutions :(.

I see a few alternatives here from the top of my head and hopefully other folks will chime in:

  1. Try to switch to SSE-S3 encryption. From here (ETag part):

Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.

You don't have to sacrifice encryption in this case. This is good. I'm not sure how SSE-S3 is different from other options. In this case most likely ETags will be stable, but we will need to check if multipart stuff is stable as well (@MrOutis could you do some experiments with SSE-S3 encrypted buckets and large files?).

  1. Obviously, disable encryption. Not nice.

  2. Don't use external outputs at all. Write them locally, capture them and then upload to S3?

It also really depends on your use case. Would be great if you could describe it a little bit so that we can figure out our options faster.

@shcheklein
Copy link
Member

@Titousensei Edited the previous answer a bit to be more precise. Again, thanks for sharing all this information. Let's try to figure out an option that works best for you before we find a fix for this (if it's possible w/o downloading and calculating checksum data locally, but that breaks the deal in a lot of cases).

@Titousensei
Copy link
Author

@shcheklein Thanks for the quick answers.

A little more bit about my use case: I'm working for a company that does a lot of a machine learning and we have a large custom script (python) to do data preparation, check pointing, training and evaluation. We have another custom script to publish our models to s3 and metadata to git. I'm exploring DVC as an option to replace those scripts and improve our current pipeline because it's becoming hard to maintain, check pointing code is limited and buggy, model versioning and reproducibility is non-existant. DVC looks like a perfect fit, particularly because we can save large intermediate files (to s3 ideally) like the ones produced during or at the end of the data preparation steps (for example Bert features).

We are required to have encryption for compliance reason with our customers data (our training data). I'm not sure if it's an option to change our encryption method, I'll have to ask.

For the prototype we can use local files, but it would be nice to be able to share prepared data between users. If s3 is not an option, we'll have to use a different shared storage (like ssh in the office), and we'll have to figure out how to publish our models to s3 separately.

I'm not sure I understand option 3. Is there a way to upload the files to s3 and let DVC know that it's an output so that the next step can use it as an input?

Alternatively, is it possible to use a non-cached external output (with -O, I did not try that yet)?

@Titousensei
Copy link
Author

Maybe I'm using DVC wrong. I'll try not to use external outs, but instead output the data locally and push to s3 like explained in the "Sharing Data" deep dive tutorial.

@shcheklein
Copy link
Member

@Titousensei

but it would be nice to be able to share prepared data between users

The regular way you do this is work with files locally and then use git push + dvc push to save them to a remote storage (it can be encrypted S3 bucket), and your team will be using git pull + dvc pull to get them when they are needed.

You need using external outputs/external dependencies (when you do stuff like -d s3:// or -o s3:// or dvc add s3://, etc) only when you read/write data directly to S3 in your scripts for some reason. Usually, when data (even intermediate artifacts) is large to be cached locally.

Maybe I'm using DVC wrong. I'll try not to use external outs, but instead output the data locally and push to s3 like explained in the "Sharing Data" deep dive tutorial.

Yep! I would def try that first.

Alternatively, is it possible to use a non-cached external output

cached vs non-cached (-o vs -O) - for both local outputs and external ones it is the same - and depends if you want to version (keep aka cache previous versions) or you fine overwriting them. Not using external cache might help in your case I think, but can still be fragile a bit since ETags are not stable.

@efiop
Copy link
Contributor

efiop commented Nov 4, 2019

For the record, external deps (-d) and external non-cached outputs(-O big-O) should work, as they don't move things around, and only rely on ETag as an indicator of an unchanged file, so it should work fine.

One more thing that I would look into, is mounting your s3 bucket to your local fs (e.g. https://github.com/s3fs-fuse/s3fs-fuse). That way you could reference your files as local files, which would rely on regular md5s calculated on the spot. There are some possible drawbacks(e.g. ETag is free, but MD5 takes time to calculate, plus i'm not sure how encrypting part is handled by the FUSE), but I would take a look at it just in case.

@shcheklein
Copy link
Member

as they don't move things around, and only rely on ETag as an indicator of an unchanged file, so it should work fine.

oh, I would be extra careful here. At least we need to mention explicitly that this checksum is tight to the location as well as content.

@shcheklein
Copy link
Member

Created a ticket to update our docs with these findings: iterative/dvc.org#774

@shcheklein
Copy link
Member

@Titousensei Eric, any updates on your end? It looks like there is not much we can do with buckets encrypted this way in terms of supporting external dependencies/outputs. I'm closing this one for now. Please, let us know if you have more questions or other issues with DVC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? research
Projects
None yet
Development

No branches or pull requests

3 participants