Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse error on Google sheets generated PDF #521

Closed
coppit opened this issue Oct 15, 2019 · 8 comments
Closed

Parse error on Google sheets generated PDF #521

coppit opened this issue Oct 15, 2019 · 8 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@coppit
Copy link

coppit commented Oct 15, 2019

I created a PDF by "printing" from Google sheets. When I try to merge the page into a PDF, I get the stack trace below. It looks like the parser is incorrectly backing up into the %PDF-1.4 comment. If I export the document as PDF using Apple Preview, it gets converted to a new %PDF-1.3 document that parses correctly.

Traceback (most recent call last):
File "/Users/dcoppit/documents/p4sw/sw/pvt/dcoppit/make_bom.py", line 257, in
make_pdf(LOOKUP[partner].get('doc', None), LOOKUP[partner]['sheet'])
File "/Users/dcoppit/documents/p4sw/sw/pvt/dcoppit/make_bom.py", line 248, in make_pdf
merge_pdfs(doc_temp_path, sheet_spares_temp_path)
File "/Users/dcoppit/documents/p4sw/sw/pvt/dcoppit/make_bom.py", line 231, in merge_pdfs
output_pdf_file.merge(insert_point, sheet_spares_temp_path.name, pages=(0,1))
File "/usr/local/lib/python3.7/site-packages/PyPDF2/merger.py", line 151, in merge
outline = pdfr.getOutlines()
File "/usr/local/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1346, in getOutlines
lines = catalog["/Outlines"]
File "/usr/local/lib/python3.7/site-packages/PyPDF2/generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()
File "/usr/local/lib/python3.7/site-packages/PyPDF2/generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()
File "/usr/local/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1599, in getObject
idnum, generation = self.readObjectHeader(self.stream)
File "/usr/local/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1668, in readObjectHeader
return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: b'F-1.4'

Here's how the document starts:

%PDF-1.4
% âãÏÓ
4
0
obj
<<
/Type
/Catalog
/Names

There's a space after the % in the second line, and each word is on a separate line.

@coppit
Copy link
Author

coppit commented Oct 15, 2019

@coppit
Copy link
Author

coppit commented Oct 15, 2019

I think the problem has to do with improper parsing of free objects in the xref table. Here's the table from a problematic PDF:

xref
0 12
0000000002 65535 f
0000000962 00000 n
0000000003 00000 f
0000000000 00000 f
0000000016 00000 n
0000000160 00000 n
0000000287 00000 n
0000000453 00000 n
0000000819 00000 n
0000000728 00000 n
0000000747 00000 n
0000000767 00000 n

You can see that the 5th line refers to a free object with an offset of 3. read() in pdf.py doesn't look at the object type, so it parses this non-existent object just like any other.

I made the following change to read() in pdf.py, and it seems to work. I don't know if it's a proper fix though.

# offset, generation = line[:16].split(b_(" "))
offset, generation, kind = line[:18].split(b_(" "))
# Ignore free objects
if kind == b'f' and num > 0:
cnt += 1
num += 1
continue

@cadu-leite
Copy link

I got the exact same error with my won "googl shee" PDF Downloaded.
Here attached.
pdf_sample_googlesheet_pages_02.pdf

the code is in https://github.com/cadu-leite/merge2pdf and the error is part of the tests.
https://github.com/cadu-leite/merge2pdf/blob/76a0ace2a10ad81ec03ad9cdbbcc11af2c18eaf4/tests/test_merge2pdf.py#L52

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022
@MartinThoma
Copy link
Member

Could somebody write a minimal snippet of Python code that shows the issue?

@cadu-leite
Copy link

Could somebody write a minimal snippet of Python code that shows the issue?

Not sure, but I think thats exactly what it was wrote right above

the code is in https://github.com/cadu-leite/merge2pdf and the error is part of the tests.
https://github.com/cadu-leite/merge2pdf/blob/76a0ace2a10ad81ec03ad9cdbbcc11af2c18eaf4/tests/test_merge2pdf.py#L52

@MartinThoma
Copy link
Member

@cadu-leite You're linking to a repository. Could you please paste the relevant part in here, creating a MVCE?

@MartinThoma
Copy link
Member

Using https://github.com/mstamy2/PyPDF2/files/4981701/pdf_sample_googlesheet_pages_02.pdf from @cadu-leite :

>>> from PyPDF2 import PdfReader, PdfWriter, PdfMerger;reader = PdfReader("pdf_sample_googlesheet_pages_02.pdf")
>>> for page in reader.pages: page.extract_text()
... 
Invalid FloatObject b'0.000000000000-14210855'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.0000000000000000000000000000000-8716957'
Invalid FloatObject b'0.00000000000-45474735'
Invalid FloatObject b'0.00000000000-45474735'
' 1\nabr. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n20\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n01\nmai. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n20\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\njun. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n21\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n01\njul. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n23\n\n\n'
Invalid FloatObject b'0.000000000000-14210855'
Invalid FloatObject b'0.000000000000-14210855'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.0000000000000000000000000000000-8716957'
Invalid FloatObject b'0.000000000000-14210855'
' 2\njul. 20\n\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\nago. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n21\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n\n'
>>> import PyPDF2; PyPDF2.__version__
'2.4.1'

Looking at the document, that looks ok. So text extraction works.

@MartinThoma
Copy link
Member

This issue was fixed by @Hatell via #1054. It will be part of PyPDF2>=2.4.2. I will make that release on PyPI probably this evening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

No branches or pull requests

3 participants