Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils: decode requirement files according to their BOM if present #3485

Merged
merged 5 commits into from
Mar 4, 2016

Conversation

xavfernandez
Copy link
Member

Work In Progrees, should fix #2865

Review on Reviewable

@pfmoore
Copy link
Member

pfmoore commented Feb 12, 2016

LGTM. Remarkably close to what I have in my local checkout at the moment. But cleaner :-)

@xavfernandez
Copy link
Member Author

Sorry to see we worked on the same thing :-/
But for some reason Python 2.6 does not like my solution...
Plus my encoding with utf8 and decoding with locale.getpreferredencoding(False) does not seem like a good idea.

@xavfernandez
Copy link
Member Author

https://docs.python.org/2/library/shlex.html

Prior to Python 2.7.3, this module did not support Unicode input. 😢

@pfmoore
Copy link
Member

pfmoore commented Feb 13, 2016

No problem.

I pray for the day when we can drop 2.6 support :-(
How about, if we're on 2.6 then we encode the requirements file content as ASCII (on the basis that if there's no Unicode support, ASCII is all that is valid) and report any encoding errors as "Unicode requirements files are only supported in Python 2.7 onwards"?

@xavfernandez
Copy link
Member Author

Well I tried being more forgiving and encoded to utf8 before shlex.split: tests seem happy.

I'm trying to add some code cleaning (always make get_file_content return text), hoping #1441 won't come back...

assert auto_decode(data) == "Django==1.4.2"

def test_auto_decode_no_bom(self):
data = b"foobar"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decoding text produces text, not bytes

@xavfernandez xavfernandez force-pushed the bom_detection branch 2 times, most recently from 6648558 to 6a87e0c Compare February 18, 2016 10:09
@xavfernandez xavfernandez added this to the 8.1 milestone Feb 24, 2016
@dstufft
Copy link
Member

dstufft commented Mar 3, 2016

Is this still a WIP? Or ready to go?

@xavfernandez
Copy link
Member Author

Ok, from a simple mypkg example containing only:

from setuptools import setup

setup(
    name='mypkg',
    version=0.1,
    install_requires=['httpretty==0.7.1'],
)

I could reproduce the issue of #1441 with pip 1.5.0.
With this PR, mypkg could be installed without any error so no obvious regression on this issue.

xavfernandez added a commit that referenced this pull request Mar 4, 2016
utils: decode requirement files according to their BOM if present
@xavfernandez xavfernandez merged commit cab10e4 into pypa:develop Mar 4, 2016
@xavfernandez xavfernandez deleted the bom_detection branch March 4, 2016 10:42
@aodag
Copy link
Contributor

aodag commented Mar 7, 2016

Hi, this change causes error during reading requirements.txt encoded utf-8 without BOM.
On python2, locale.getpreferredencoding(False) returns ASCII encoding.
Why not use utf-8 as default encoding?

aodag@debian:~$ uname -a
Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) x86_64 GNU/Linux
aodag@debian:~$ python
Python 2.7.9 (default, Mar  1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getpreferredencoding(False)
'ANSI_X3.4-1968'
>>>
aodag@debian:~$ python3
Python 3.4.2 (default, Oct  8 2014, 10:45:20)
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getpreferredencoding(False)
'UTF-8'

@xavfernandez
Copy link
Member Author

Why not use utf-8 as default encoding?
Because, it might well be encoded in something else than utf8 :) Previously it worked because in python2 we were operating on str (bytes) types.

It looks like pip should be calling locale.setlocale(locale.LC_ALL, '') before calling locale.getpreferredencoding(False).

Or maybe we could accept this kind of header: # -*- coding: utf-8 -*- in requirements.txt ?

Any thought @pfmoore @dstufft ?

@xavfernandez
Copy link
Member Author

I've implemented belt and suspenders in #3547

@pfmoore
Copy link
Member

pfmoore commented Mar 7, 2016

For the simple cases, I think pip should treat the requirements file as a text file (i.e. open it in the system default encoding), with BOM detection being a useful convenience for Windows users whose tools have a tendency to use things like UTF-16 with BOM in spite of the default encoding. I'm -0.5 on defaulting to UTF-8, as Windows tools need extra effort to specify UTF-8, so we'll likely just end up with Windows users complaining that pip isn't handling their requirements files properly.

I don't know what incantations are needed with setlocale I'll defer to your expertise there.

For cross-platform requirements files, an encoding header seems like a plausible approach.

(And I see that while I've been writing this response, you've created a PR implementing this. Looks like a good solution to me, go for it (although there's a text/bytes problem I've commented on in the PR).

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 3, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants