-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test failure: "AssertionError: 'Header,' not found" in test_text_content (test.py:509) #957
Comments
fwiw, i'm using PyPDF2 version 2.11.2-1 and Weasyprint 57.1 when i'm doing these tests, on debian unstable. When i look at the intermediate pdf output structure, it looks like there is no whitespace in the table caption between the string "Header" and the following comma. |
With the next release (planned for early next week) xml2rfc will be moving to I don't see this error happening with |
yes, i see the upcoming switch to |
Correct, I will accept pull requests with patches, though. So if anybody has a severe issue and upgrading is not an option, this is a possible route. I have no clue who maintains the packages in Debian / Ubuntu repositories. If they are not recent enough, I would recommend to vendor @dkg I'm not sure what you mean with the question about data representations. There was a bigger change which affects the PdfMerger ( py-pdf/pypdf#1371 ), but besides that we didn't have any bigger changes. Certainly no huge change that refactors how pypdf represents PDFs internally. |
@dkg I've found it a lot easier to set up a venv and run xml2rfc in that, letting it pull in the library versions it needs rather than trying to fight snoozing port maintainers. If you want to run the venv'ed xml2rfc from the command line, it's literally a one line shell script. |
@jrlevine, i'd prefer to ensure that the software i'm running is also in debian, as that ecosystem has a better history of long-term maintenance and reproducibility than any language-specific library ecosystem. I recognize that not everyone shares those values (or that experience), but it is something that i care about and i'd like to try to keep it working. Furthermore, this points to some amount of brittleness in the toolchain that xml2rfc depends on -- it may be a brittleness in the tests, or it may be a brittleness in the pdf generation, but i'm trying to run it down to avoid problems in the future. Consider this pro-active maintenance work if that makes it seem more reasonable. i've already asked the maintainer in debian to update to pypdf, and "vendoring" (embedding a copy) is generally discouraged in debian, because it leads to unmaintained copies. If we don't "vendor" then a single update will fix all copies of any given piece of code in the OS. @MartinThoma, what i meant by the question about data representations is that i wonder whether it's possible for some version of PyPDF2 to inject whitespace between a word and its trailing comma in a table caption. The We can replicate the necessary parts with: import PyPDF2
print(PyPDF2.PdfReader(open('elements.pdf', 'rb'), strict=False).pages[9].extract_text()) For me, this shows:
Looking at the page, i don't see why there should be a space between "Header" and the comma after it. does |
fwiw, i've updated to PyPDF2 2.12.1 and i'm still seeing the same whitespace from |
If you can get the maintainers to update all the packages you need, great, but in my experience it is a losing battle. The reason Python has venv's is precisely because you can't count on the default versions of packages on whatever system you're using to be the ones you want. After a while beating one's head against the wall becomes tedious. FWIW I do all my work on MacOS and FreeBSD so I have no interest in Debian, as I suspect you have no interest in the platforms I use. |
@dkg If you don't want the software to change, you can also just pin the versions. Then it's completely under your control. |
After reading the rest of your comments I realized that you're talking about text extraction. Please read https://pypdf.readthedocs.io/en/latest/user/extract-text.html#whitespaces In short: pypdf does not "inject" whitespaces, but has to guess where which amount of whitespaces need to go. Just like any PDF-to-text tool. It's an inherent flaw of the PDF format. |
@MartinThoma wrote:
Thanks for pointing this out! It does suggest that the test in xml2rfc is inherently flawed, then, since the test appears to assume that whitespace will be present on the same boundaries from the source document to the PDF document. @kesara, can you suggest a way to improve the test to take into account this ambiguity? |
Maybe the normalization function in https://github.com/ietf-tools/xml2rfc/blob/main/test.py#L503 can be do more to fix this case. FWIW I've tested the @MartinThoma Even though the PyPDF2 method guesses the whitespaces, Is it consistent? I mean does it always guess the same number of whitespace for a given PDF file? |
The However, we cannot make any guarantees that different versions of The problem is that fixing the whitespaces for one PDF might break it for many others. For this reason, we always need to test changes on many documents and we go with the changes that improve the situation for most PDF documents in our test dataset. |
Thanks @MartinThoma |
I used weasyprint 57.1-1 to generate elements.pdf. It renders without whitespace between the words and the commas in the caption, and tools like |
That
But annoyingly, With both |
@kesara, thanks for testing it out and confirming that there is whitespace in some variants! Now that we have confirmed that the test is brittle, it sounds like we might have two different issues here:
The first point ought to be something you can run down locally, since it's all on the same system. is the test testing the wrong codepath for pdf generation so that it doesn't align with the CLI? i'm not sure how to investigate the second point. are there any artifacts of the test process that i could generate on my own platform to help dig a little deeper into that comparison? |
Describe the issue
On debian, xml2rfc's self-tests are failing, as can be seen for example at: https://ci.debian.net/data/autopkgtest/testing/amd64/x/xml2rfc/30269516/log.gz
I've omitted the full target string that is being searched, but while it doesn't contain
Header,
it does containHeader ,
(that is, with a space between the word and the trailing comma). the source of this test istests/inputs/elements.xml
, which usesHeader,
(no space), so something in the way the bulk text is being generated is inserting the space.To simplify the testing, i've written the following script to generate the expected output
and indeed its output contains the extra space:
Code of Conduct
The text was updated successfully, but these errors were encountered: