Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot set custom metadata with unicode chars #431

Closed
jboavida opened this issue May 5, 2017 · 6 comments
Closed

Cannot set custom metadata with unicode chars #431

jboavida opened this issue May 5, 2017 · 6 comments

Comments

@jboavida
Copy link

jboavida commented May 5, 2017

Apparently, it's not possible to use gsutil -h ... cp original gs://bucket/destination to set custom metadata including unicode chars.

I tried it from both a Debian 8 image (where $LANG is en_US.UTF-8) and a container optimized image (version 58, just released, where $LANG is C.UTF-8 and toolbox --setenv=LANG=$LANG does set LANG within the toolbox).

For example:

touch b
gsutil -h "x-goog-meta-opts: ãç" cp b gs://$SOME_BUCKET/

gives the following error message:

Failure: Field value encountered non-ASCII string ' \xc3\xa3\xc3\xa7': 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128).

I also attempted to use gsutil setmeta to set the metadata after the fact:

touch b
gsutil cp b gs://$SOME_BUCKET/
gsutil setmeta -h "x-goog-meta-opts: ãç" gs://$SOME_BUCKET/b

And this works. However, it has a different limitation:

touch b
gsutil cp b gs://$SOME_BUCKET/
gsutil setmeta -h "x-goog-meta-opts: {\"ã\":\"ç\"}" gs://$SOME_BUCKET/b

Here the error is different:

CommandException: Invalid argument: must be either header or header:value (x-goog-meta-opts: {"ã":"ç"})

In this version, : cannot appear in the metadata value. I tried, without success, to find some escaping. However, in all variants I tried the metadata (as listed with gsutil ls -L) is saved with the escapes.

As far as I can tell (for example, https://cloud.google.com/storage/docs/gsutil/addlhelp/WorkingWithObjectMetadata states “x-goog-meta- fields can have data set to arbitrary Unicode values. All other fields must have ASCII values.”) this is not the intended behavior. It is caused by https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/commands/setmeta.py#L283-L287. If a : in headers other than custom metadata would be detected later on, maybe one could use md_arg.partition(':') instead?

I was not able to figure why unicode is being rejected by gsutil -h ... cp ..., though.

The ultimate goal is to save JSON with non-ASCII strings in custom metadata, hence the need for both unicode and colons. Maybe there's some obvious workaround that I'm missing?

@houglum
Copy link
Collaborator

houglum commented May 9, 2017

After a bit of digging, it looks like this was done intentionally -- header fields and their values should be ASCII-encoded, per the HTTP spec. Because we send metadata values as headers when using the XML API (for both gs:// and s3:// buckets), and we'd like to keep behavior between the two APIs as consistent as possible, we shouldn't attempt to convert from ASCII to UTF-8. Looking at issues in the Boto3 library, it seems that S3 metadata obeys this as well: boto/botocore#861

This is a documentation inconsistency on our part. The correct documentation can be found at https://cloud.google.com/storage/docs/xml-api/reference-headers#xgoogmeta, which states:

Note: All custom headers and their associated values must contain only printable US-ASCII characters.

W.r.t the second issue, I agree that using partition would be a better approach in that this would allow : characters in the value. I'll get a commit out soon to update the gsutil docs to correspond to our other docs, and to begin using partition instead of split.

@houglum
Copy link
Collaborator

houglum commented May 9, 2017

Hm. As it is now, we actually allow these characters in setmeta, but not when specified via the top-level -h flag. Interesting. I'd like to not break backwards compatibility with previous versions of gsutil, but will also intensely frown upon supporting something that's explicitly advised against :(

Additionally, there seems to be an inconsistency for the setmeta command in how we transmit these characters... depending on if you're using the XML or JSON API. The JSON API seems to send the characters in utf-8 encoded string format, ãç (or \xc3\xa3\xc3\xa7). When using the XML API, however, it seems we're sending the URL-decoded version of this -- the ASCII characters representing the hex values for each utf-8-encoded byte: %C3%A3%C3%A7 (replacing % with \x gives us '\xc3\xa3\xc3\xa7', which is what the Python2 interpreter gives us when we type 'ãç'). And if you're consistent with your APIs across set/get operations, they print correctly to the screen (%C3%A3%C3%A7 will get converted and printed as ãç if using the XML API, but will show up as %C3%A3%C3%A7 when listing metadata with the JSON API). Neat.

@houglum
Copy link
Collaborator

houglum commented May 9, 2017

After a bit of digging, it's actually pretty deep in the Boto code that the header value is being url-encoded, via the urllib.quote() method: https://github.com/boto/boto/blob/315b76e01e65fb742cbc14170e5207d648bc649a/boto/connection.py#L372

GCS's XML API will actually handle the non-escaped characters just fine when passed directly via cURL:

$ curl -X PUT \
-H '<redacted auth header>' \
-H 'Host: storage.googleapis.com' \
-H 'x-goog-meta-opts: ãç' \
-H 'x-goog-metadata-directive: REPLACE' \
-H 'content-length: 0' \
"https://storage.googleapis.com/<bucket>/<object>"

...verified via gsutil ls -L <object> using the JSON API, which prints the characters ãç. But, given that this behavior has been around for quite some time and some folks are likely to rely on it, I'm hesitant to change it.

@houglum
Copy link
Collaborator

houglum commented May 9, 2017

Regardless of all this above, I'd say the partition() change is worth putting in there for now. But, I'd still warn against using non-ASCII characters for custom metadata, unless you're sure that nothing that depends on it will fetch the metadata using anything except the JSON API -- these values are passed in the resource body, rather than an HTTP header, via the JSON API, so it having unicode there doesn't go against any specs. While the XML API seems to work with such characters currently, encoding/decoding behavior isn't consistent across clients and seems susceptible to breaking.

@jboavida
Copy link
Author

jboavida commented May 9, 2017

Thank you for the suggestions and analysis.

I confess I had wondered why non-ascii characters would be safe in headers, and just assumed they were encoded somehow by gsutil. But even if that is (or were) the case, or if I follow your suggestion with curl and PUT (which is very tempting), as you point out I'd be depending on a fixed client. Maybe the wisest choice is if I always encode the string before passing it on to gsutil.

I'd like to not break backwards compatibility with previous versions of gsutil, but will also intensely frown upon supporting something that's explicitly advised against :(

Maybe some warning (about the reliance on a specific client behavior or other risks) in the documentation would alleviate the issues and not require changes?

Thanks again for looking into this. (I'll be happy to give feedback on documentation changes, if you find it could be helpful.)

houglum added a commit that referenced this issue May 9, 2017
This addresses part of #431, allowing colons in values for custom
metadata headers, e.g.:
    gsutil setmeta -h 'x-goog-meta-foobar: {"foo": "bãr"}'
@houglum
Copy link
Collaborator

houglum commented May 10, 2017

Fixed for the top-level -h flag, plus added clarification in the docs.

Thanks for the report!

houglum added a commit to houglum/apitools that referenced this issue May 12, 2017
Use case: for custom object metadata, gsutil uses an AdditionalProperty
message. Custom object metadata can contain unicode characters in either
the key or value fields. Previously, the key field was being implicitly
decoded as ASCII via a call to str(). When this call fails, we should
attempt to properly decode the field. See GoogleCloudPlatform/gsutil#431
for additional context.
craigcitro pushed a commit to google/apitools that referenced this issue May 18, 2017
* Allow key of AdditionalProperty to contain unicode

Use case: for custom object metadata, gsutil uses an AdditionalProperty
message. Custom object metadata can contain unicode characters in either
the key or value fields. Previously, the key field was being implicitly
decoded as ASCII via a call to str(). When this call fails, we should
attempt to properly decode the field. See GoogleCloudPlatform/gsutil#431
for additional context.

* Test decoding AdditionalProperty key w/ unicode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants