-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot set custom metadata with unicode chars #431
Comments
After a bit of digging, it looks like this was done intentionally -- header fields and their values should be ASCII-encoded, per the HTTP spec. Because we send metadata values as headers when using the XML API (for both gs:// and s3:// buckets), and we'd like to keep behavior between the two APIs as consistent as possible, we shouldn't attempt to convert from ASCII to UTF-8. Looking at issues in the Boto3 library, it seems that S3 metadata obeys this as well: boto/botocore#861 This is a documentation inconsistency on our part. The correct documentation can be found at https://cloud.google.com/storage/docs/xml-api/reference-headers#xgoogmeta, which states:
W.r.t the second issue, I agree that using partition would be a better approach in that this would allow |
Hm. As it is now, we actually allow these characters in setmeta, but not when specified via the top-level Additionally, there seems to be an inconsistency for the setmeta command in how we transmit these characters... depending on if you're using the XML or JSON API. The JSON API seems to send the characters in utf-8 encoded string format, |
After a bit of digging, it's actually pretty deep in the Boto code that the header value is being url-encoded, via the GCS's XML API will actually handle the non-escaped characters just fine when passed directly via cURL:
...verified via |
Regardless of all this above, I'd say the |
Thank you for the suggestions and analysis. I confess I had wondered why non-ascii characters would be safe in headers, and just assumed they were encoded somehow by gsutil. But even if that is (or were) the case, or if I follow your suggestion with curl and PUT (which is very tempting), as you point out I'd be depending on a fixed client. Maybe the wisest choice is if I always encode the string before passing it on to gsutil.
Maybe some warning (about the reliance on a specific client behavior or other risks) in the documentation would alleviate the issues and not require changes? Thanks again for looking into this. (I'll be happy to give feedback on documentation changes, if you find it could be helpful.) |
This addresses part of #431, allowing colons in values for custom metadata headers, e.g.: gsutil setmeta -h 'x-goog-meta-foobar: {"foo": "bãr"}'
Fixed for the top-level -h flag, plus added clarification in the docs. Thanks for the report! |
Use case: for custom object metadata, gsutil uses an AdditionalProperty message. Custom object metadata can contain unicode characters in either the key or value fields. Previously, the key field was being implicitly decoded as ASCII via a call to str(). When this call fails, we should attempt to properly decode the field. See GoogleCloudPlatform/gsutil#431 for additional context.
* Allow key of AdditionalProperty to contain unicode Use case: for custom object metadata, gsutil uses an AdditionalProperty message. Custom object metadata can contain unicode characters in either the key or value fields. Previously, the key field was being implicitly decoded as ASCII via a call to str(). When this call fails, we should attempt to properly decode the field. See GoogleCloudPlatform/gsutil#431 for additional context. * Test decoding AdditionalProperty key w/ unicode.
Apparently, it's not possible to use
gsutil -h ... cp original gs://bucket/destination
to set custom metadata including unicode chars.I tried it from both a Debian 8 image (where
$LANG
isen_US.UTF-8
) and a container optimized image (version 58, just released, where$LANG
isC.UTF-8
andtoolbox --setenv=LANG=$LANG
does setLANG
within the toolbox).For example:
gives the following error message:
I also attempted to use
gsutil setmeta
to set the metadata after the fact:And this works. However, it has a different limitation:
Here the error is different:
In this version,
:
cannot appear in the metadata value. I tried, without success, to find some escaping. However, in all variants I tried the metadata (as listed withgsutil ls -L
) is saved with the escapes.As far as I can tell (for example, https://cloud.google.com/storage/docs/gsutil/addlhelp/WorkingWithObjectMetadata states “x-goog-meta- fields can have data set to arbitrary Unicode values. All other fields must have ASCII values.”) this is not the intended behavior. It is caused by https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/commands/setmeta.py#L283-L287. If a
:
in headers other than custom metadata would be detected later on, maybe one could usemd_arg.partition(':')
instead?I was not able to figure why unicode is being rejected by
gsutil -h ... cp ...
, though.The ultimate goal is to save JSON with non-ASCII strings in custom metadata, hence the need for both unicode and colons. Maybe there's some obvious workaround that I'm missing?
The text was updated successfully, but these errors were encountered: