-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle duplicate Xrefs in documents (Fixes missing pages bugs) #533
Handle duplicate Xrefs in documents (Fixes missing pages bugs) #533
Conversation
…s we have seen an xref and adding an index to the objectId
…s we have seen an xref and adding an index to the objectId
…s we have seen an xref and adding an index to the objectId
@k00ni I think this is a valid solution to the missing pages (and probably other missing elements) bugs. Let me know what you think! I tried running through all reported bugs to see which one this PR fixes, but it seems a lot of them are not relevant anymore or outdated. |
@PrinsFrank thank you for taking the time and working on this.
Your patch is small but seems effective to solve mentioned issue(s). I have not enough knowledge about PDF specification to appreciate this solution. This means, we may either need a bit further research or someone who knows this part of the specification. Do you have this knowledge? I have the following points:
I appreciate that. If you found some which might benefit from this patch, please list them here. Maybe their authors are still interested in a solution. Because this library got neglected in the past years it is hard to keep track of all the bugs. |
@k00ni Sadly I don't really have that much insight into pdf specifications. It seems to me that the xref should be unique within a document but isn't in this specific case. Two Xref Ids are reused for 3 and 2 other objects, which explains the 'missing' pages that are overwritten as the xref was used as a unique id. That would also explain why other parsers - like fpdi mentioned in the bug report - are unable to parse this file. The references are used later on, so when a unique id is referenced we don't know what object is actually referenced. The way this works is now that the reference points to the first object because that is still ending with _0, the same id that was previously hardcoded. I suspect that any nested object with duplicate ids are now having invalid references, but I don't think there is a way te recover those. Objects in the root are now not overwritten anymore, and because of the way the array of pages is built all pages are still in the correct order. The example file is 47MB, which is not very nice to commit in repository. We have seen this same issue with other files though, but I'm not sure what specific file besides an even larger file. As I don't think there is an easy way to generate a xref collision by manually editing a document as there are also checksums involved, I have started a process to look through some 5000 pdfs we have available to look for simpler files with the same issue. (Checking for diffs between the old code vs the new code). I will update this PR when I've found a smaller sample file. |
Just a quick thought: it could be sufficient already if you construct a basic string which leads to this error. You could add an if-clause at the place of your code changes and check when an index is reused. This might help to construct that string. |
@k00ni I am cleaning up my forks and just came across this open PR. I'll update you on this issue; A few months ago I ran through a couple hundred files comparing the old number of pages with the new number. I did find a bunch of files with the same issues where the issue was also fixed with this patch, but all files were as big or a lot bigger than the file initially uploaded by my colleague @zimonh in #471. So I tried genereting a reproduction as suggested from some LaTeX files in a bunch of different PDF versions and with a random combination of conversion settings, but to no avail. To try and understand why this PR fixes an issue, I took a deep dive into the spec and built a POC parser on my own. I still don't understand why and how duplicate xref can exist, and as far as I can see when parsing the sample files there are not actually any duplicate Xrefs in the file but there is an issue with getting those xrefs. That is as far as I got before switching jobs. Currently I don't have any time to look into this further as it doesn't have any priority right now and I'm nog getting paid to fix this anymore. Besides, it's a gnarly issue, and too complex of a spec for some spare time in my evenings. We can either merge this with a comment explaining that this is not a root cause fix and the 47MB file as a test case (which will impact developers checking out this repo), without any test cases or I can just close this PR. I'll leave this fork and branch open, as I know it's still used in production at my old job. |
Thank you for taking the time and providing detailed background information. Maybe someone will pick this up and continues your work in the future. |
Handle duplicate Xrefs in documents by keeping track of how many times we have seen an xref and adding an index to the objectId.
Fixes #471
Possibly also #473, #474