Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#814 - open compressed files with binary mode #890

Closed
wants to merge 1 commit into from

Conversation

spazm
Copy link

@spazm spazm commented Aug 21, 2014

A possible fix for #814 (introduced with fix for #765?)

We need to open binary files with 'b' flag (and not attempt to translate the encoding) to avoid:

'utf8' codec can't decode byte 0x8b in position 1: invalid start byte
>>> import mimetypes
>>> mimetypes.guess_type('foo.json')
('application/json', None)
>>> mimetypes.guess_type('foo.txt')
('text/plain', None)
>>> mimetypes.guess_type('foo.txt.gz')
('text/plain', 'gzip')
>>> mimetypes.guess_type('foo.txt.tar')
('application/x-tar', None)
>>> mimetypes.guess_type('foo.txt.tar.gz')
('application/x-tar', 'gzip')
>>> mimetypes.guess_type('foo.txt')
('text/plain', None)
>>> mimetypes.guess_type('foo.bar')
(None, None)
  • for .gzip and .Z files, mimetypes.guess_type will return a non-None encoding.
  • .tar: 'application/x-tar'
  • .txt extensions: 'text/*' ctype
  • for no extension and unknown extension: (None, None)
  • json will return 'application/json'

Open to suggestions here. We definitely want Non-None encoding to trigger binary treatment. Less sure about the other cases.

mimetypes.guess_type is already used in the s3 code.

@spazm spazm force-pushed the fix_814 branch 3 times, most recently from 0b03970 to 40541c8 Compare August 21, 2014 05:43
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.05%) when pulling e1f20c0 on spazm:fix_814 into c560dfe on aws:develop.

@spazm
Copy link
Author

spazm commented Aug 21, 2014

python3 tests fail if .json files are opened in binary mode.
Narrowing scope to only use binary flag if encoding is set (e.g. .gz, .Z, .bz2).

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling 1d7090e on spazm:fix_814 into c560dfe on aws:develop.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling 1d7090e on spazm:fix_814 into c560dfe on aws:develop.

@jamesls
Copy link
Member

jamesls commented Aug 21, 2014

What if we make this explicit and let the user tell us if the file should be opened in binary mode or not? Right now we support file://<filename>. What if we had something like fileb://<filename> or file+b://<filename>, etc.?

@spazm
Copy link
Author

spazm commented Aug 21, 2014

@jamesls thanks for the response!

my quick thoughts:

  1. it used to 'just work' for .gz files (regression).
  2. other tools (e.g. java tools) support both text and compressed files without any extra flags
  3. if we had a flag, it seems like it should be to tag something as text that needs to be translated from the current encoding to utf8?

I like your files-as-blobs approach and to a lesser degree the --file-encoding approach, as you mentioned in #815 (comment)

Encodings can be such a pain to get right! I appreciate your work in trying to come up with a consistent plan.

@jamesls
Copy link
Member

jamesls commented Dec 2, 2014

Fixed via #1010 from @kyleknap's pull request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants