Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle duplicate Xrefs in documents (Fixes missing pages bugs) #533

Conversation

PrinsFrank
Copy link
Contributor

@PrinsFrank PrinsFrank commented May 5, 2022

Handle duplicate Xrefs in documents by keeping track of how many times we have seen an xref and adding an index to the objectId.

Fixes #471
Possibly also #473, #474

PrinsFrank added 3 commits May 5, 2022 20:52
…s we have seen an xref and adding an index to the objectId
…s we have seen an xref and adding an index to the objectId
…s we have seen an xref and adding an index to the objectId
@PrinsFrank
Copy link
Contributor Author

@k00ni I think this is a valid solution to the missing pages (and probably other missing elements) bugs. Let me know what you think! I tried running through all reported bugs to see which one this PR fixes, but it seems a lot of them are not relevant anymore or outdated.

@k00ni
Copy link
Collaborator

k00ni commented May 6, 2022

@PrinsFrank thank you for taking the time and working on this.

I think this is a valid solution to the missing pages (and probably other missing elements) bugs. Let me know what you think!

Your patch is small but seems effective to solve mentioned issue(s). I have not enough knowledge about PDF specification to appreciate this solution. This means, we may either need a bit further research or someone who knows this part of the specification. Do you have this knowledge?

I have the following points:

  1. Does it matter what the index actually is (numbers, strings)? At least it must be ordered, doesn't it?
  2. Please provide at least one test case which covers your code changes.

I tried running through all reported bugs to see which one this PR fixes, but it seems a lot of them are not relevant anymore or outdated.

I appreciate that. If you found some which might benefit from this patch, please list them here. Maybe their authors are still interested in a solution. Because this library got neglected in the past years it is hard to keep track of all the bugs.

@PrinsFrank
Copy link
Contributor Author

@k00ni Sadly I don't really have that much insight into pdf specifications. It seems to me that the xref should be unique within a document but isn't in this specific case. Two Xref Ids are reused for 3 and 2 other objects, which explains the 'missing' pages that are overwritten as the xref was used as a unique id. That would also explain why other parsers - like fpdi mentioned in the bug report - are unable to parse this file.

The references are used later on, so when a unique id is referenced we don't know what object is actually referenced. The way this works is now that the reference points to the first object because that is still ending with _0, the same id that was previously hardcoded. I suspect that any nested object with duplicate ids are now having invalid references, but I don't think there is a way te recover those. Objects in the root are now not overwritten anymore, and because of the way the array of pages is built all pages are still in the correct order.

The example file is 47MB, which is not very nice to commit in repository. We have seen this same issue with other files though, but I'm not sure what specific file besides an even larger file. As I don't think there is an easy way to generate a xref collision by manually editing a document as there are also checksums involved, I have started a process to look through some 5000 pdfs we have available to look for simpler files with the same issue. (Checking for diffs between the old code vs the new code). I will update this PR when I've found a smaller sample file.

@k00ni
Copy link
Collaborator

k00ni commented May 7, 2022

I have started a process to look through some 5000 pdfs we have available to look for simpler files with the same issue.

Just a quick thought: it could be sufficient already if you construct a basic string which leads to this error. You could add an if-clause at the place of your code changes and check when an index is reused. This might help to construct that string.

@k00ni k00ni added the stale needs decision label Jul 12, 2022
@PrinsFrank
Copy link
Contributor Author

@k00ni I am cleaning up my forks and just came across this open PR. I'll update you on this issue;

A few months ago I ran through a couple hundred files comparing the old number of pages with the new number. I did find a bunch of files with the same issues where the issue was also fixed with this patch, but all files were as big or a lot bigger than the file initially uploaded by my colleague @zimonh in #471. So I tried genereting a reproduction as suggested from some LaTeX files in a bunch of different PDF versions and with a random combination of conversion settings, but to no avail.

To try and understand why this PR fixes an issue, I took a deep dive into the spec and built a POC parser on my own. I still don't understand why and how duplicate xref can exist, and as far as I can see when parsing the sample files there are not actually any duplicate Xrefs in the file but there is an issue with getting those xrefs.

That is as far as I got before switching jobs. Currently I don't have any time to look into this further as it doesn't have any priority right now and I'm nog getting paid to fix this anymore. Besides, it's a gnarly issue, and too complex of a spec for some spare time in my evenings.

We can either merge this with a comment explaining that this is not a root cause fix and the 47MB file as a test case (which will impact developers checking out this repo), without any test cases or I can just close this PR. I'll leave this fork and branch open, as I know it's still used in production at my old job.

@k00ni
Copy link
Collaborator

k00ni commented Aug 12, 2022

Thank you for taking the time and providing detailed background information. Maybe someone will pick this up and continues your work in the future.

@k00ni k00ni closed this Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing pages
2 participants