Handle duplicate Xrefs in documents (Fixes missing pages bugs) #533

PrinsFrank · 2022-05-05T19:04:51Z

Handle duplicate Xrefs in documents by keeping track of how many times we have seen an xref and adding an index to the objectId.

Fixes #471
Possibly also #473, #474

…s we have seen an xref and adding an index to the objectId

PrinsFrank · 2022-05-05T19:18:06Z

@k00ni I think this is a valid solution to the missing pages (and probably other missing elements) bugs. Let me know what you think! I tried running through all reported bugs to see which one this PR fixes, but it seems a lot of them are not relevant anymore or outdated.

k00ni · 2022-05-06T09:57:05Z

@PrinsFrank thank you for taking the time and working on this.

I think this is a valid solution to the missing pages (and probably other missing elements) bugs. Let me know what you think!

Your patch is small but seems effective to solve mentioned issue(s). I have not enough knowledge about PDF specification to appreciate this solution. This means, we may either need a bit further research or someone who knows this part of the specification. Do you have this knowledge?

I have the following points:

Does it matter what the index actually is (numbers, strings)? At least it must be ordered, doesn't it?
Please provide at least one test case which covers your code changes.

I tried running through all reported bugs to see which one this PR fixes, but it seems a lot of them are not relevant anymore or outdated.

I appreciate that. If you found some which might benefit from this patch, please list them here. Maybe their authors are still interested in a solution. Because this library got neglected in the past years it is hard to keep track of all the bugs.

…en files

PrinsFrank · 2022-05-06T17:31:50Z

@k00ni Sadly I don't really have that much insight into pdf specifications. It seems to me that the xref should be unique within a document but isn't in this specific case. Two Xref Ids are reused for 3 and 2 other objects, which explains the 'missing' pages that are overwritten as the xref was used as a unique id. That would also explain why other parsers - like fpdi mentioned in the bug report - are unable to parse this file.

The references are used later on, so when a unique id is referenced we don't know what object is actually referenced. The way this works is now that the reference points to the first object because that is still ending with _0, the same id that was previously hardcoded. I suspect that any nested object with duplicate ids are now having invalid references, but I don't think there is a way te recover those. Objects in the root are now not overwritten anymore, and because of the way the array of pages is built all pages are still in the correct order.

The example file is 47MB, which is not very nice to commit in repository. We have seen this same issue with other files though, but I'm not sure what specific file besides an even larger file. As I don't think there is an easy way to generate a xref collision by manually editing a document as there are also checksums involved, I have started a process to look through some 5000 pdfs we have available to look for simpler files with the same issue. (Checking for diffs between the old code vs the new code). I will update this PR when I've found a smaller sample file.

k00ni · 2022-05-07T06:44:57Z

I have started a process to look through some 5000 pdfs we have available to look for simpler files with the same issue.

Just a quick thought: it could be sufficient already if you construct a basic string which leads to this error. You could add an if-clause at the place of your code changes and check when an index is reused. This might help to construct that string.

PrinsFrank · 2022-07-19T22:35:16Z

@k00ni I am cleaning up my forks and just came across this open PR. I'll update you on this issue;

A few months ago I ran through a couple hundred files comparing the old number of pages with the new number. I did find a bunch of files with the same issues where the issue was also fixed with this patch, but all files were as big or a lot bigger than the file initially uploaded by my colleague @zimonh in #471. So I tried genereting a reproduction as suggested from some LaTeX files in a bunch of different PDF versions and with a random combination of conversion settings, but to no avail.

To try and understand why this PR fixes an issue, I took a deep dive into the spec and built a POC parser on my own. I still don't understand why and how duplicate xref can exist, and as far as I can see when parsing the sample files there are not actually any duplicate Xrefs in the file but there is an issue with getting those xrefs.

That is as far as I got before switching jobs. Currently I don't have any time to look into this further as it doesn't have any priority right now and I'm nog getting paid to fix this anymore. Besides, it's a gnarly issue, and too complex of a spec for some spare time in my evenings.

We can either merge this with a comment explaining that this is not a root cause fix and the 47MB file as a test case (which will impact developers checking out this repo), without any test cases or I can just close this PR. I'll leave this fork and branch open, as I know it's still used in production at my old job.

k00ni · 2022-08-12T06:50:06Z

Thank you for taking the time and providing detailed background information. Maybe someone will pick this up and continues your work in the future.

PrinsFrank added 3 commits May 5, 2022 20:52

Handle duplicate Xrefs in documents by keeping track of how many time…

fcbd13c

…s we have seen an xref and adding an index to the objectId

Handle duplicate Xrefs in documents by keeping track of how many time…

8de5aac

…s we have seen an xref and adding an index to the objectId

Handle duplicate Xrefs in documents by keeping track of how many time…

3b6f97e

…s we have seen an xref and adding an index to the objectId

Increment counter before using it as an index instead of after

2edf7e5

k00ni added fix tests required labels May 6, 2022

Reset xrefIndices when parsing a new file to prevent colissions betwe…

8676b26

…en files

k00ni added the stale needs decision label Jul 12, 2022

This was referenced Jul 12, 2022

Another instance of missing pages #473

Open

PDF only being partly read. #474

Closed

k00ni closed this Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle duplicate Xrefs in documents (Fixes missing pages bugs) #533

Handle duplicate Xrefs in documents (Fixes missing pages bugs) #533

PrinsFrank commented May 5, 2022 •

edited

Loading

PrinsFrank commented May 5, 2022

k00ni commented May 6, 2022

PrinsFrank commented May 6, 2022

k00ni commented May 7, 2022

PrinsFrank commented Jul 19, 2022

k00ni commented Aug 12, 2022

Handle duplicate Xrefs in documents (Fixes missing pages bugs) #533

Handle duplicate Xrefs in documents (Fixes missing pages bugs) #533

Conversation

PrinsFrank commented May 5, 2022 • edited Loading

PrinsFrank commented May 5, 2022

k00ni commented May 6, 2022

PrinsFrank commented May 6, 2022

k00ni commented May 7, 2022

PrinsFrank commented Jul 19, 2022

k00ni commented Aug 12, 2022

PrinsFrank commented May 5, 2022 •

edited

Loading