Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failure: "AssertionError: 'Header,' not found" in test_text_content (test.py:509) #957

Open
1 task done
dkg opened this issue Jan 12, 2023 · 18 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@dkg
Copy link
Contributor

dkg commented Jan 12, 2023

Describe the issue

On debian, xml2rfc's self-tests are failing, as can be seen for example at: https://ci.debian.net/data/autopkgtest/testing/amd64/x/xml2rfc/30269516/log.gz

....F...........................................
======================================================================
FAIL: test_text_content (__main__.PdfWriterTests.test_text_content)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/autopkgtest-lxc.7s2j9xsb/downtmp/build.RBs/src/xxx/test.py", line 509, in test_text_content
    self.assertIn(t, text)
AssertionError: 'Header,' not found in […]

I've omitted the full target string that is being searched, but while it doesn't contain Header, it does contain Header , (that is, with a space between the word and the trailing comma). the source of this test is tests/inputs/elements.xml, which uses Header, (no space), so something in the way the bulk text is being generated is inserting the space.

To simplify the testing, i've written the following script to generate the expected output

#!/usr/bin/python3

import xml2rfc
from xml2rfc.walkpdf import xmldoc

from xml2rfc.writers.base import default_options
default_options.allow_local_file_access = True

parser = xml2rfc.XmlRfcParser('tests/input/elements.xml')
xmlrfc = parser.parse()
elements_writer = xml2rfc.writers.pdf.PdfWriter(xmlrfc)

elements_pdfdoc = elements_writer.pdf()
elements_pdfxml = xmldoc(None, bytes=elements_pdfdoc)

text = '\n'.join( p.text for p in elements_pdfxml.xpath('.//Page/text') )
print(text)

and indeed its output contains the extra space:

    Table 1 : Header , footer , colspan and rowspan

Code of Conduct

@dkg dkg added the bug Something isn't working label Jan 12, 2023
@dkg
Copy link
Contributor Author

dkg commented Jan 12, 2023

fwiw, i'm using PyPDF2 version 2.11.2-1 and Weasyprint 57.1 when i'm doing these tests, on debian unstable. When i look at the intermediate pdf output structure, it looks like there is no whitespace in the table caption between the string "Header" and the following comma.

@kesara
Copy link
Member

kesara commented Jan 12, 2023

With the next release (planned for early next week) xml2rfc will be moving to pypdf>=3.2.1. See #953

I don't see this error happening with pypdf==3.2.1.

@dkg
Copy link
Contributor Author

dkg commented Jan 12, 2023

yes, i see the upcoming switch to pypdf, but that' s not yet in debian, so transitioning to it in the context of debian or its derivatives is going to be a bit tricky in the short term. Do you think that the issue was resolved with with a version of PyPDF2 before it was transitioned to pypdf? It looks like 2.12.1 was the last version before the 3.0.0 cutover to pypdf. Perhaps @MartinThoma can comment on the data representations or the versions?

@MartinThoma
Copy link

Correct, PyPDF2==2.12.1 was the last 2.x version and we will only continue development on pypdf>=3.1.0.

I will accept pull requests with patches, though. So if anybody has a severe issue and upgrading is not an option, this is a possible route.

I have no clue who maintains the packages in Debian / Ubuntu repositories. If they are not recent enough, I would recommend to vendor pypdf in your package (essentially copying pypdf into your project).

@dkg I'm not sure what you mean with the question about data representations. There was a bigger change which affects the PdfMerger ( py-pdf/pypdf#1371 ), but besides that we didn't have any bigger changes. Certainly no huge change that refactors how pypdf represents PDFs internally.

@jrlevine
Copy link

@dkg I've found it a lot easier to set up a venv and run xml2rfc in that, letting it pull in the library versions it needs rather than trying to fight snoozing port maintainers. If you want to run the venv'ed xml2rfc from the command line, it's literally a one line shell script.

@dkg
Copy link
Contributor Author

dkg commented Jan 12, 2023

@jrlevine, i'd prefer to ensure that the software i'm running is also in debian, as that ecosystem has a better history of long-term maintenance and reproducibility than any language-specific library ecosystem. I recognize that not everyone shares those values (or that experience), but it is something that i care about and i'd like to try to keep it working.

Furthermore, this points to some amount of brittleness in the toolchain that xml2rfc depends on -- it may be a brittleness in the tests, or it may be a brittleness in the pdf generation, but i'm trying to run it down to avoid problems in the future. Consider this pro-active maintenance work if that makes it seem more reasonable.

i've already asked the maintainer in debian to update to pypdf, and "vendoring" (embedding a copy) is generally discouraged in debian, because it leads to unmaintained copies. If we don't "vendor" then a single update will fix all copies of any given piece of code in the OS.

@MartinThoma, what i meant by the question about data representations is that i wonder whether it's possible for some version of PyPDF2 to inject whitespace between a word and its trailing comma in a table caption.

The xmldoc from my sample above comes from xml2rfc/walkpdf.py, which at its core invokes pypdf2.PdfReader(pdffile, strict=False) on elements.pdf (which itself was generated by running xml2rfc --allow-local-file-access --pdf tests/input/elements.xml). Then it converts that function's output to a python dict (of dicts) using the walk() function in the same file.

We can replicate the necessary parts with:

import PyPDF2

print(PyPDF2.PdfReader(open('elements.pdf', 'rb'), strict=False).pages[9].extract_text())

For me, this shows:

6. Table  
A B C D
1 2 3 4
Long text in a single cell b1 c1 d1
Long text in a 3-cell column.
Quid igitur est? Inquit; audire
enim cupio, quid non probes
eius, a quo dissentias.Long text in a 3-cell row.
Quid enim est a Cyrenaicisque melius liberiusque
defenditur, tamen eius modi tempora incidunt, ut labore
et dolore disputandum putant.
Another 3-cell column. c3 d3
x
a5 c5 d5
aa bb cc dd
Table 1 : Header , footer , colspan and rowspan
Default alignment        +-----+-----+-----------+
      |     |     |           |
      +-----+-----+-----+-----+
      |     |     |     |     |
      +=====+=====+=====+=====+
      |     |     |     |     |
      +-----+-----+-----+-----+
      |     |                 |
      +     +-----+-----+-----+
      |     |     |     |     |
      +     +     +-----+-----+
      |     |     |           |
      +-----+     +-----+-----+
      |     |     |     |     |
      +=====+=====+=====+=====+
      |     |     |     |     |
      +-----+-----+-----+-----+Internet-Draft Xml2rfc Vocabulary V3 Elements July 2018
Author, et al. Expires 13 January 2019 Page 10

Looking at the page, i don't see why there should be a space between "Header" and the comma after it. does extract_text behave differently between PyPDF2 2.11.2 and some more recent version of pypdf ?

@dkg
Copy link
Contributor Author

dkg commented Jan 12, 2023

This is what the relevant page of the pdf looks like when i render it in evince, fwiw:

2023-01-12_17-55-14_window_610x768

@dkg
Copy link
Contributor Author

dkg commented Jan 13, 2023

fwiw, i've updated to PyPDF2 2.12.1 and i'm still seeing the same whitespace from PyPDF2.PdfReader(open('elements.pdf', 'rb'), strict=False).pages[9].extract_text()) -- i haven't tried the 3.x series yet

@jrlevine
Copy link

I'd prefer to ensure that the software i'm running is also in debian, as that ecosystem has a better history of long-term maintenance and reproducibility than any language-specific library ecosystem.

If you can get the maintainers to update all the packages you need, great, but in my experience it is a losing battle. The reason Python has venv's is precisely because you can't count on the default versions of packages on whatever system you're using to be the ones you want. After a while beating one's head against the wall becomes tedious.

FWIW I do all my work on MacOS and FreeBSD so I have no interest in Debian, as I suspect you have no interest in the platforms I use.

@MartinThoma
Copy link

@dkg If you don't want the software to change, you can also just pin the versions. Then it's completely under your control.

@MartinThoma
Copy link

i wonder whether it's possible for some version of PyPDF2 to inject whitespace between a word and its trailing comma in a table caption.

After reading the rest of your comments I realized that you're talking about text extraction. Please read https://pypdf.readthedocs.io/en/latest/user/extract-text.html#whitespaces

In short: pypdf does not "inject" whitespaces, but has to guess where which amount of whitespaces need to go. Just like any PDF-to-text tool. It's an inherent flaw of the PDF format.

@dkg
Copy link
Contributor Author

dkg commented Jan 13, 2023

@MartinThoma wrote:

In short: pypdf does not "inject" whitespaces, but has to guess where which amount of whitespaces need to go. Just like any PDF-to-text tool. It's an inherent flaw of the PDF format.

Thanks for pointing this out! It does suggest that the test in xml2rfc is inherently flawed, then, since the test appears to assume that whitespace will be present on the same boundaries from the source document to the PDF document.

@kesara, can you suggest a way to improve the test to take into account this ambiguity?

@kesara
Copy link
Member

kesara commented Jan 13, 2023

Maybe the normalization function in https://github.com/ietf-tools/xml2rfc/blob/main/test.py#L503 can be do more to fix this case.

FWIW I've tested the xml2rfc==v3.15.3 with PyPDF2==2.11.2 and could not recreate this issue. (Not on Debian)

@MartinThoma Even though the PyPDF2 method guesses the whitespaces, Is it consistent? I mean does it always guess the same number of whitespace for a given PDF file?

@MartinThoma
Copy link

Even though the PyPDF2 method guesses the whitespaces, Is it consistent? I mean does it always guess the same number of whitespace for a given PDF file?

The text_extraction method of pypdf is deterministic. It doesn't have any random elements. That means for a specific version of pypdf and a specific PDF file, the text extraction will always give the same results.

However, we cannot make any guarantees that different versions of pypdf will always give the same result. We are improving pypdf all the time. Especially whitespaces are a known issue: https://github.com/py-pdf/pypdf/issues?q=label%3Awhitespace+

The problem is that fixing the whitespaces for one PDF might break it for many others. For this reason, we always need to test changes on many documents and we go with the changes that improve the situation for most PDF documents in our test dataset.

@kesara
Copy link
Member

kesara commented Jan 14, 2023

Thanks @MartinThoma
@dkg, I think that the issue you are having might be due to WeasyPrint rather than PyPDF2. What's the WeasyPrint version in your test environment?

@dkg
Copy link
Contributor Author

dkg commented Jan 15, 2023

I used weasyprint 57.1-1 to generate elements.pdf. It renders without whitespace between the words and the commas in the caption, and tools like pdf2txt (from pdfminer, debian package version 20220319+dfsg-1) don't insert whitespace.

@kesara
Copy link
Member

kesara commented Jan 17, 2023

That elements.pdf file does have extra spaces for me with both PyPDF2==2.11.2 and pypdf==3.2.1.

Table 1 : Header , footer , colspan and rowspan

But annoyingly, With both WeasyPrint==57.1 and WeasyPrint==57.2, tests don't fail for me.
If I create an elements.pdf file manually and read that with either PyPDF2==2.11.2 or pypdf==3.2.1, the text output contains extra spaces.

@dkg
Copy link
Contributor Author

dkg commented Jan 17, 2023

@kesara, thanks for testing it out and confirming that there is whitespace in some variants! Now that we have confirmed that the test is brittle, it sounds like we might have two different issues here:

  • the test is not generating the same underlying PDF bytestring as "manual" generation on your platform (can you be explicit about how you generated elements.pdf manually? i think i generated it with xml2rfc --allow-local-file-access --pdf tests/input/elements.xml -- is that it?)
  • the test is somehow differing on my platform compared to yours

The first point ought to be something you can run down locally, since it's all on the same system. is the test testing the wrong codepath for pdf generation so that it doesn't align with the CLI?

i'm not sure how to investigate the second point. are there any artifacts of the test process that i could generate on my own platform to help dig a little deeper into that comparison?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants