-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse error on Google sheets generated PDF #521
Comments
I think the problem has to do with improper parsing of free objects in the xref table. Here's the table from a problematic PDF:
You can see that the 5th line refers to a free object with an offset of 3. read() in pdf.py doesn't look at the object type, so it parses this non-existent object just like any other. I made the following change to read() in pdf.py, and it seems to work. I don't know if it's a proper fix though.
|
I got the exact same error with my won "googl shee" PDF Downloaded. the code is in https://github.com/cadu-leite/merge2pdf and the error is part of the tests. |
Could somebody write a minimal snippet of Python code that shows the issue? |
Not sure, but I think thats exactly what it was wrote right above the code is in https://github.com/cadu-leite/merge2pdf and the error is part of the tests. |
@cadu-leite You're linking to a repository. Could you please paste the relevant part in here, creating a MVCE? |
Using https://github.com/mstamy2/PyPDF2/files/4981701/pdf_sample_googlesheet_pages_02.pdf from @cadu-leite : >>> from PyPDF2 import PdfReader, PdfWriter, PdfMerger;reader = PdfReader("pdf_sample_googlesheet_pages_02.pdf")
>>> for page in reader.pages: page.extract_text()
...
Invalid FloatObject b'0.000000000000-14210855'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.0000000000000000000000000000000-8716957'
Invalid FloatObject b'0.00000000000-45474735'
Invalid FloatObject b'0.00000000000-45474735'
' 1\nabr. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n20\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n01\nmai. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n20\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\njun. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n21\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n01\njul. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n23\n\n\n'
Invalid FloatObject b'0.000000000000-14210855'
Invalid FloatObject b'0.000000000000-14210855'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.0000000000000000000000000000000-8716957'
Invalid FloatObject b'0.000000000000-14210855'
' 2\njul. 20\n\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\nago. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n21\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n\n'
>>> import PyPDF2; PyPDF2.__version__
'2.4.1' Looking at the document, that looks ok. So text extraction works. |
I created a PDF by "printing" from Google sheets. When I try to merge the page into a PDF, I get the stack trace below. It looks like the parser is incorrectly backing up into the %PDF-1.4 comment. If I export the document as PDF using Apple Preview, it gets converted to a new %PDF-1.3 document that parses correctly.
Here's how the document starts:
There's a space after the % in the second line, and each word is on a separate line.
The text was updated successfully, but these errors were encountered: