Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud Storage upload_from_file does not work with files containing non-latin-1 characters. #818

Closed
bc-lee opened this issue Jun 17, 2022 · 2 comments · Fixed by #824
Closed
Assignees
Labels
api: storage Issues related to the googleapis/python-storage API. type: question Request for information or clarification. Not an issue.

Comments

@bc-lee
Copy link

bc-lee commented Jun 17, 2022

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

NOTE: Google Cloud Python client libraries are no longer maintained inside this repository. Please visit the python-API repository (e.g., https://github.com/googleapis/python-pubsub/issues) for faster response times.

See all published libraries in the README.

Sample code:

def upload_file(f: io.StringIO, name: str):
    f.seek(0)
    b = bucket.blob(name)
    b.upload_from_file(f)

Then I get this error:

Traceback
  File "/opt/project/app.py", line 565, in upload_file
    b.upload_from_file(f)
  File "/usr/local/lib/python3.10/site-packages/google/cloud/storage/blob.py", line 2567, in upload_from_file
    created_json = self._do_upload(
  File "/usr/local/lib/python3.10/site-packages/google/cloud/storage/blob.py", line 2384, in _do_upload
    response = self._do_resumable_upload(
  File "/usr/local/lib/python3.10/site-packages/google/cloud/storage/blob.py", line 2228, in _do_resumable_upload
    response = upload.transmit_next_chunk(transport, timeout=timeout)
  File "/usr/local/lib/python3.10/site-packages/google/resumable_media/requests/upload.py", line 515, in transmit_next_chunk
    return _request_helpers.wait_and_retry(
  File "/usr/local/lib/python3.10/site-packages/google/resumable_media/requests/_request_helpers.py", line 148, in wait_and_retry
    response = func()
  File "/usr/local/lib/python3.10/site-packages/google/resumable_media/requests/upload.py", line 507, in retriable_request
    result = transport.request(
  File "/usr/local/lib/python3.10/site-packages/google/auth/transport/requests.py", line 549, in request
    response = super(AuthorizedSession, self).request(
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.10/http/client.py", line 1282, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.10/http/client.py", line 1327, in _send_request
    body = _encode(body, 'body')
  File "/usr/local/lib/python3.10/http/client.py", line 166, in _encode
    raise UnicodeEncodeError(
UnicodeEncodeError: 'latin-1' codec can't encode character '\u200b' in position 202618: Body ('\u200b') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

Python and Google's python package versions are as follows:

$ python --version
Python 3.10.5
$ pip list | grep google
google-api-core               2.8.2
google-api-python-client      2.51.0
google-auth                   2.8.0
google-auth-httplib2          0.1.0
google-cloud-bigquery         3.2.0
google-cloud-bigquery-storage 2.13.2
google-cloud-core             2.3.1
google-cloud-storage          2.4.0
google-crc32c                 1.3.0
google-resumable-media        2.3.3
googleapis-common-protos      1.56.2

Interestingly, upload_from_string works fine (i.e. b.upload_from_string(f.getvalue())).
This workaround is sufficient for me as my code uploads an in-memory file (io.StringIO), and the file is not large. However, I think it is necessary to modify the behavior of the upload_from_file function somehow.

@bc-lee bc-lee changed the title Bigquery upload_from_file does not work with files containing non-latin-1 characters. Cloud Storage upload_from_file does not work with files containing non-latin-1 characters. Jun 17, 2022
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Jun 17, 2022
@parthea parthea added type: question Request for information or clarification. Not an issue. api: storage Issues related to the googleapis/python-storage API. and removed triage me I really want to be triaged. labels Jun 20, 2022
@parthea parthea transferred this issue from googleapis/google-cloud-python Jun 20, 2022
@parthea parthea added triage me I really want to be triaged. and removed type: question Request for information or clarification. Not an issue. labels Jun 20, 2022
@cojenco cojenco added type: question Request for information or clarification. Not an issue. status: investigating The issue is under investigation, which is determined to be non-trivial. and removed triage me I really want to be triaged. labels Jun 21, 2022
@andrewsg
Copy link
Contributor

Hi bc-lee,

My understanding is that, as files are naturally stored in bytes on disk, opening a file in string mode is intended to be used for text processing only. In this case, we also have to upload in bytes, so opening a file in string mode only to then upload it to Storage would involve two conversions. The second conversion, that from a unicode string to bytes, is failing in this case because of a standard library method that says that RFC 2616 Section 3.7.1 dictates the default conversion in this case is latin-1.

Can I ask more about your use case? Is there a particular reason why your application has file-like objects in string mode for this purpose? Thanks.

@andrewsg andrewsg self-assigned this Jun 29, 2022
@bc-lee
Copy link
Author

bc-lee commented Jun 30, 2022

Thanks for the reply.
I understand that it is desirable to use file-like objects in byte mode to reduce string conversion. However, some third-party APIs are using strings or files in string mode. So I tried to use the Google Cloud SDK with a file in string mode, and I got the issue as above.

Can you add a comment or warning that in upload_from_file, using a file in byte mode is preferred in the function and using a file in string mode may raise a UnicodeEncodeError? It will help to reduce the mistakes like me.

:type file_obj: file

@cojenco cojenco removed the status: investigating The issue is under investigation, which is determined to be non-trivial. label Jul 13, 2022
gcf-merge-on-green bot pushed a commit that referenced this issue Jul 22, 2022
File-like objects should be opened in binary mode for `blob.upload_from_file()`
- cpython standard library accorded with [RFC 2616 Section 3.7.1](https://datatracker.ietf.org/doc/html/rfc2616#section-3.7.1) states the text default charset of iso-8859-1
- add clarifying notes in docstring
- update code sample

Fixes #818 🦕
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/python-storage API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants