Parse error on Google sheets generated PDF #521

coppit · 2019-10-15T19:05:07Z

I created a PDF by "printing" from Google sheets. When I try to merge the page into a PDF, I get the stack trace below. It looks like the parser is incorrectly backing up into the %PDF-1.4 comment. If I export the document as PDF using Apple Preview, it gets converted to a new %PDF-1.3 document that parses correctly.

Traceback (most recent call last):
File "/Users/dcoppit/documents/p4sw/sw/pvt/dcoppit/make_bom.py", line 257, in
make_pdf(LOOKUP[partner].get('doc', None), LOOKUP[partner]['sheet'])
File "/Users/dcoppit/documents/p4sw/sw/pvt/dcoppit/make_bom.py", line 248, in make_pdf
merge_pdfs(doc_temp_path, sheet_spares_temp_path)
File "/Users/dcoppit/documents/p4sw/sw/pvt/dcoppit/make_bom.py", line 231, in merge_pdfs
output_pdf_file.merge(insert_point, sheet_spares_temp_path.name, pages=(0,1))
File "/usr/local/lib/python3.7/site-packages/PyPDF2/merger.py", line 151, in merge
outline = pdfr.getOutlines()
File "/usr/local/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1346, in getOutlines
lines = catalog["/Outlines"]
File "/usr/local/lib/python3.7/site-packages/PyPDF2/generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()
File "/usr/local/lib/python3.7/site-packages/PyPDF2/generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()
File "/usr/local/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1599, in getObject
idnum, generation = self.readObjectHeader(self.stream)
File "/usr/local/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1668, in readObjectHeader
return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: b'F-1.4'

Here's how the document starts:

%PDF-1.4
% âãÏÓ
4
0
obj
<<
/Type
/Catalog
/Names

There's a space after the % in the second line, and each word is on a separate line.

coppit · 2019-10-15T19:27:12Z

Here's a small PDF that demonstrates the problem.

coppit · 2019-10-15T20:56:52Z

I think the problem has to do with improper parsing of free objects in the xref table. Here's the table from a problematic PDF:

xref
0 12
0000000002 65535 f
0000000962 00000 n
0000000003 00000 f
0000000000 00000 f
0000000016 00000 n
0000000160 00000 n
0000000287 00000 n
0000000453 00000 n
0000000819 00000 n
0000000728 00000 n
0000000747 00000 n
0000000767 00000 n

You can see that the 5th line refers to a free object with an offset of 3. read() in pdf.py doesn't look at the object type, so it parses this non-existent object just like any other.

I made the following change to read() in pdf.py, and it seems to work. I don't know if it's a proper fix though.

# offset, generation = line[:16].split(b_(" "))
offset, generation, kind = line[:18].split(b_(" "))
# Ignore free objects
if kind == b'f' and num > 0:
cnt += 1
num += 1
continue

cadu-leite · 2020-07-27T11:24:28Z

I got the exact same error with my won "googl shee" PDF Downloaded.
Here attached.
pdf_sample_googlesheet_pages_02.pdf

the code is in https://github.com/cadu-leite/merge2pdf and the error is part of the tests.
https://github.com/cadu-leite/merge2pdf/blob/76a0ace2a10ad81ec03ad9cdbbcc11af2c18eaf4/tests/test_merge2pdf.py#L52

MartinThoma · 2022-04-07T16:29:00Z

Could somebody write a minimal snippet of Python code that shows the issue?

cadu-leite · 2022-04-10T12:49:49Z

Could somebody write a minimal snippet of Python code that shows the issue?

Not sure, but I think thats exactly what it was wrote right above

the code is in https://github.com/cadu-leite/merge2pdf and the error is part of the tests.
https://github.com/cadu-leite/merge2pdf/blob/76a0ace2a10ad81ec03ad9cdbbcc11af2c18eaf4/tests/test_merge2pdf.py#L52

MartinThoma · 2022-04-10T13:06:56Z

@cadu-leite You're linking to a repository. Could you please paste the relevant part in here, creating a MVCE?

MartinThoma · 2022-06-30T13:08:34Z

Using https://github.com/mstamy2/PyPDF2/files/4981701/pdf_sample_googlesheet_pages_02.pdf from @cadu-leite :

>>> from PyPDF2 import PdfReader, PdfWriter, PdfMerger;reader = PdfReader("pdf_sample_googlesheet_pages_02.pdf")
>>> for page in reader.pages: page.extract_text()
... 
Invalid FloatObject b'0.000000000000-14210855'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.0000000000000000000000000000000-8716957'
Invalid FloatObject b'0.00000000000-45474735'
Invalid FloatObject b'0.00000000000-45474735'
' 1\nabr. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n20\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n01\nmai. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n20\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\njun. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n21\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n01\njul. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n23\n\n\n'
Invalid FloatObject b'0.000000000000-14210855'
Invalid FloatObject b'0.000000000000-14210855'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.0000000000000000000000000000000-8716957'
Invalid FloatObject b'0.000000000000-14210855'
' 2\njul. 20\n\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\nago. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n21\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n\n'
>>> import PyPDF2; PyPDF2.__version__
'2.4.1'

Looking at the document, that looks ok. So text extraction works.

MartinThoma · 2022-07-05T08:23:05Z

This issue was fixed by @Hatell via #1054. It will be part of PyPDF2>=2.4.2. I will make that release on PyPI probably this evening.

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022

MartinThoma mentioned this issue Jun 30, 2022

PDF from Google Sheet doesn't merge with PdfMerger when import_bookmarks is True #1034

Closed

Hatell mentioned this issue Jul 4, 2022

Resolve IndirectObject when it refers to a free entry. #1054

Merged

MartinThoma closed this as completed in 02c601c Jul 5, 2022

Demesmaeker mentioned this issue May 31, 2024

[FIX] pdf: avoid filling malformed PDF odoo/odoo#166901

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse error on Google sheets generated PDF #521

Parse error on Google sheets generated PDF #521

coppit commented Oct 15, 2019

coppit commented Oct 15, 2019

coppit commented Oct 15, 2019 •

edited

Loading

cadu-leite commented Jul 27, 2020

MartinThoma commented Apr 7, 2022

cadu-leite commented Apr 10, 2022

MartinThoma commented Apr 10, 2022

MartinThoma commented Jun 30, 2022

MartinThoma commented Jul 5, 2022

Parse error on Google sheets generated PDF #521

Parse error on Google sheets generated PDF #521

Comments

coppit commented Oct 15, 2019

coppit commented Oct 15, 2019

coppit commented Oct 15, 2019 • edited Loading

cadu-leite commented Jul 27, 2020

MartinThoma commented Apr 7, 2022

cadu-leite commented Apr 10, 2022

MartinThoma commented Apr 10, 2022

MartinThoma commented Jun 30, 2022

MartinThoma commented Jul 5, 2022

coppit commented Oct 15, 2019 •

edited

Loading