Add tests for PdfReader with in-memory pdf #43

b4stien · 2015-12-12T00:30:31Z

PdfReader can't read in-memory pdf with Python 3 (at least Python 3.5). The added test should demonstrate it (Travis please!).

b4stien · 2015-12-12T00:55:47Z

Well, actually it can, I just misunderstood the interface.

As far as I understand now you want the users to be able to "load" data by 2 means:

with fname which could be a filename or a filelike object (implementing the read() interface)
with fdata supplying directly the "real" data

It works if you give it unicode strings (or a filelike object open in text mode, with 'latin-1' encoding).

pmaupin · 2015-12-12T02:45:53Z

Actually, I am not sure you did misunderstand the interface. I haven't looked at your tests yet, but see if the code snippets I put in issue #38 make your tests work. If so, perhaps you can add that snippet to your PR and we'll add the test and the fixed code at the same time :-)

Thanks,
Pat

b4stien · 2015-12-12T14:01:14Z

Here's my completed PR. We now have many ways to load data into a PdfReader!

I added six (see https://pypi.python.org/pypi/six) which is the "community standard" to tackle Python 2/3 differences. Most of your py23_diffs.py can be removed to use it instead. I can submit another PR for that if you wish.

pmaupin · 2015-12-12T18:42:39Z

Bastien:

I am very impressed with the way you have come up to speed on the codebase, not only getting it to do what you want, but also providing a well-coded and useful test. You have apparently developed an excellent understanding of how the system fits together in the face of very sparse documentation, and I would love for you to become a regular pdfrw contributor.

But...

I am not convinced that six is right for pdfrw.

I suppose my thoughts echo those of Armin Ronacher where he advises to "skip six."

Most of my reluctance to embrace six could be fairly characterized as aesthetic. For example, I have a visceral negative reaction to the thought of increasing the size of the codebase by over 25%, but I could probably overcome that if someone convinced me that the benefits were worthwhile.

As you point out, there are benefits to using six that can be substantial for some codebases and coding styles. It is a de facto community standard, well-understood by people who maintain libraries that support 2 and 3, and the new hybrid Python variant it creates allows a far more flexible coding style than the one I have chosen for the bulk of pdfrw.

But while six abstracts away some differences, it does not abstract away all of them. For example, you propose adding these lines to PdfReader():

        if six.PY3 and isinstance(fdata, six.binary_type):
            fdata = fdata.decode('Latin-1')

This uses tools in six that help to manage the differences between two and three, but it does not actually hide those differences. In fact, these lines expose a lot of knowledge about both the differences between Python 2 and 3 and knowledge about six itself (e.g., what is six.binary_type?) to a large function in a large file.

The current version of PdfReader() knows that there are required differences between data handling for loading files under 2 vs 3, but it doesn't know or care about the details of those differences, and I think that's a good design -- even if I were convinced that the technology in six was a good fit for pdfrw, I would probably still want to take those two lines and place them in a separate function in a separate file where I abstract away those differences so that the PdfReader() class doesn't know the details.

If I stick with that decision to abstract away the 2 vs. 3 file handling differences in a separate module, then it seems to me that the value proposition of six is greatly reduced, and on the cost side, if we include six in the package as you have proposed, then we risk running into issues like this and if we add a dependency for six as an external module, well, that would be the first external dependency, and that's a pretty high hurdle to clear.

So it seems unlikely you could convince me that six is the right answer for pdfrw as of now.

But did I mention I love your test? :-) (as well as some of the other changes such as updating for 3.5)

Seriously, I think your test along with the code snippet in issue #38 is what it takes to put that issue to bed, and I really appreciate it. If you update the PR to insure your test works with that code, that would be awesome, but if you're too busy, I'll understand -- I'm swamped at the moment as well, but I can probably get around to fixing this sometime next week.

Thanks,
Pat

b4stien · 2015-12-12T20:26:30Z

No problem, I understand.

To be totally honest I introduced it because I was kinda afraid with your py23_diffs. Maybe you'll be willing to consider a "small" refactor over there?

In any case I'll make the changes you suggest.

pmaupin · 2015-12-12T20:37:29Z

I'm certainly willing to consider refactoring, but can you describe what problems you see?

Thanks,
Pat

b4stien · 2015-12-12T20:42:10Z

The try/except with NameError and AttributeError is kinda disturbing at first, I was expecting a more straightforward detection of Python 3 (sys.version for instance).

pmaupin · 2015-12-12T20:46:45Z

Ah. I'm a big fan of duck typing :-)

pmaupin · 2015-12-12T20:59:36Z

I guess the style there isn't really duck typing, but is similar to the EAFP used in duck typing.

FWIW, I wasn't interested in porting to Python 3 until the features in 3.3 made it less painful. (I guess ignoring 3.0-3.2 also removes much of the impetus for six.) Also, AFAIK, most of my serious "customers" still run Python 2.7, because of the performance hit with Python 3 strings being bigger. I could fix most of the performance hit (by using bytes instead of strings), but that would make the code a lot uglier (with b'xxx' instead of 'xxx' all over the code), so I'm not real interested. The hardware will catch up :-)

b4stien · 2015-12-13T18:27:51Z

I'll rebase this one in a minute (following the merge of #44)

Make read of in-memory PDF work with 3.x - This closes issue #38 with code discussed there plus a regression test. - Also add Python version 3.5 to regression tests - Also add OSX filetypes to .gitignore

b4stien closed this Dec 12, 2015

pmaupin reopened this Dec 12, 2015

b4stien force-pushed the master branch from c385aaf to 09facb8 Compare December 12, 2015 13:31

b4stien added 8 commits December 13, 2015 19:32

Add tests for PdfReader with in-memory pdf

cb4761a

Add Python 3.5 to Travis

69019e8

Refactor a bit PdfReader's init process

c0d0f0e

Current package layout does not permit external deps, vendorize six

83d70c5

Rename tests

a0a081d

Add Python 3.5 to classifiers

9d61dec

Remove six, change convert load logic for Py3 bytes

4c6fcd8

Typo

c8ec1ba

b4stien force-pushed the master branch from 154191e to c8ec1ba Compare December 13, 2015 18:32

pmaupin merged commit 9e4aa55 into pmaupin:master Dec 13, 2015

pmaupin mentioned this pull request Dec 13, 2015

Reading a pdf from file like object or data not working in python 3 with bytesIO #38

Closed

Roguelazer mentioned this pull request Apr 6, 2016

Another release? #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests for PdfReader with in-memory pdf #43

Add tests for PdfReader with in-memory pdf #43

b4stien commented Dec 12, 2015

b4stien commented Dec 12, 2015

pmaupin commented Dec 12, 2015

b4stien commented Dec 12, 2015

pmaupin commented Dec 12, 2015

b4stien commented Dec 12, 2015

pmaupin commented Dec 12, 2015

b4stien commented Dec 12, 2015

pmaupin commented Dec 12, 2015

pmaupin commented Dec 12, 2015

b4stien commented Dec 13, 2015

Add tests for PdfReader with in-memory pdf #43

Add tests for PdfReader with in-memory pdf #43

Conversation

b4stien commented Dec 12, 2015

b4stien commented Dec 12, 2015

pmaupin commented Dec 12, 2015

b4stien commented Dec 12, 2015

pmaupin commented Dec 12, 2015

b4stien commented Dec 12, 2015

pmaupin commented Dec 12, 2015

b4stien commented Dec 12, 2015

pmaupin commented Dec 12, 2015

pmaupin commented Dec 12, 2015

b4stien commented Dec 13, 2015