Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for PdfReader with in-memory pdf #43

Merged
merged 8 commits into from
Dec 13, 2015
Merged

Conversation

b4stien
Copy link
Contributor

@b4stien b4stien commented Dec 12, 2015

PdfReader can't read in-memory pdf with Python 3 (at least Python 3.5). The added test should demonstrate it (Travis please!).

@b4stien
Copy link
Contributor Author

b4stien commented Dec 12, 2015

Well, actually it can, I just misunderstood the interface.

As far as I understand now you want the users to be able to "load" data by 2 means:

  • with fname which could be a filename or a filelike object (implementing the read() interface)
  • with fdata supplying directly the "real" data

It works if you give it unicode strings (or a filelike object open in text mode, with 'latin-1' encoding).

@b4stien b4stien closed this Dec 12, 2015
@pmaupin pmaupin reopened this Dec 12, 2015
@pmaupin
Copy link
Owner

pmaupin commented Dec 12, 2015

Actually, I am not sure you did misunderstand the interface. I haven't looked at your tests yet, but see if the code snippets I put in issue #38 make your tests work. If so, perhaps you can add that snippet to your PR and we'll add the test and the fixed code at the same time :-)

Thanks,
Pat

@b4stien
Copy link
Contributor Author

b4stien commented Dec 12, 2015

Here's my completed PR. We now have many ways to load data into a PdfReader!

I added six (see https://pypi.python.org/pypi/six) which is the "community standard" to tackle Python 2/3 differences. Most of your py23_diffs.py can be removed to use it instead. I can submit another PR for that if you wish.

@pmaupin
Copy link
Owner

pmaupin commented Dec 12, 2015

Bastien:

I am very impressed with the way you have come up to speed on the codebase, not only getting it to do what you want, but also providing a well-coded and useful test. You have apparently developed an excellent understanding of how the system fits together in the face of very sparse documentation, and I would love for you to become a regular pdfrw contributor.

But...

I am not convinced that six is right for pdfrw.

I suppose my thoughts echo those of Armin Ronacher where he advises to "skip six."

Most of my reluctance to embrace six could be fairly characterized as aesthetic. For example, I have a visceral negative reaction to the thought of increasing the size of the codebase by over 25%, but I could probably overcome that if someone convinced me that the benefits were worthwhile.

As you point out, there are benefits to using six that can be substantial for some codebases and coding styles. It is a de facto community standard, well-understood by people who maintain libraries that support 2 and 3, and the new hybrid Python variant it creates allows a far more flexible coding style than the one I have chosen for the bulk of pdfrw.

But while six abstracts away some differences, it does not abstract away all of them. For example, you propose adding these lines to PdfReader():

        if six.PY3 and isinstance(fdata, six.binary_type):
            fdata = fdata.decode('Latin-1')

This uses tools in six that help to manage the differences between two and three, but it does not actually hide those differences. In fact, these lines expose a lot of knowledge about both the differences between Python 2 and 3 and knowledge about six itself (e.g., what is six.binary_type?) to a large function in a large file.

The current version of PdfReader() knows that there are required differences between data handling for loading files under 2 vs 3, but it doesn't know or care about the details of those differences, and I think that's a good design -- even if I were convinced that the technology in six was a good fit for pdfrw, I would probably still want to take those two lines and place them in a separate function in a separate file where I abstract away those differences so that the PdfReader() class doesn't know the details.

If I stick with that decision to abstract away the 2 vs. 3 file handling differences in a separate module, then it seems to me that the value proposition of six is greatly reduced, and on the cost side, if we include six in the package as you have proposed, then we risk running into issues like this and if we add a dependency for six as an external module, well, that would be the first external dependency, and that's a pretty high hurdle to clear.

So it seems unlikely you could convince me that six is the right answer for pdfrw as of now.

But did I mention I love your test? :-) (as well as some of the other changes such as updating for 3.5)

Seriously, I think your test along with the code snippet in issue #38 is what it takes to put that issue to bed, and I really appreciate it. If you update the PR to insure your test works with that code, that would be awesome, but if you're too busy, I'll understand -- I'm swamped at the moment as well, but I can probably get around to fixing this sometime next week.

Thanks,
Pat

@b4stien
Copy link
Contributor Author

b4stien commented Dec 12, 2015

No problem, I understand.

To be totally honest I introduced it because I was kinda afraid with your py23_diffs. Maybe you'll be willing to consider a "small" refactor over there?

In any case I'll make the changes you suggest.

@pmaupin
Copy link
Owner

pmaupin commented Dec 12, 2015

I'm certainly willing to consider refactoring, but can you describe what problems you see?

Thanks,
Pat

@b4stien
Copy link
Contributor Author

b4stien commented Dec 12, 2015

The try/except with NameError and AttributeError is kinda disturbing at first, I was expecting a more straightforward detection of Python 3 (sys.version for instance).

@pmaupin
Copy link
Owner

pmaupin commented Dec 12, 2015

Ah. I'm a big fan of duck typing :-)

@pmaupin
Copy link
Owner

pmaupin commented Dec 12, 2015

I guess the style there isn't really duck typing, but is similar to the EAFP used in duck typing.

FWIW, I wasn't interested in porting to Python 3 until the features in 3.3 made it less painful. (I guess ignoring 3.0-3.2 also removes much of the impetus for six.) Also, AFAIK, most of my serious "customers" still run Python 2.7, because of the performance hit with Python 3 strings being bigger. I could fix most of the performance hit (by using bytes instead of strings), but that would make the code a lot uglier (with b'xxx' instead of 'xxx' all over the code), so I'm not real interested. The hardware will catch up :-)

@b4stien
Copy link
Contributor Author

b4stien commented Dec 13, 2015

I'll rebase this one in a minute (following the merge of #44)

pmaupin added a commit that referenced this pull request Dec 13, 2015
Make read of in-memory PDF work with 3.x

- This closes issue #38 with code discussed there plus a regression test.
- Also add Python version 3.5 to regression tests
- Also add OSX filetypes to .gitignore
@pmaupin pmaupin merged commit 9e4aa55 into pmaupin:master Dec 13, 2015
@Roguelazer Roguelazer mentioned this pull request Apr 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants