-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: clean_forms function cause infinite looping if elt["/Resources"] have circular relation #2477
Conversation
Thanks for the PR. Are you able to provide a corresponding test file privately to @MartinThoma or strip one of the files down to further use it inside the tests and thus ensure that this keeps being fixed and coverage does not drop? |
Hi @stefan6419846, these PDF files are private; therefore, I cannot share them with you. However, if I encounter any files experiencing similar issues, I will send them to you later. I hope you can incorporate this code so that I can simply install PyPDF with |
Apparently this "just" exceeds the recursion depth instead of blocking completely? At least this is what my experimental code to reproduce this issue indicates. (My code still breaks with this PR, but as it just exploits this with custom Python-based datastructures and no real PDF file, I do not consider this essential for merging.)
Unless @pubpub-zz states that your fix might have negative impacts, I plan on merging it tomorrow. I still appreciate a corresponding unit/integration test nevertheless if you are able to provide one - even if its a separate PR later on. |
If you look at the code, there is already a stack mecanism designed to prevent infinite loops. I therefore do not understand we need to 'duplicate' it with a new list. this should help us to understand what is inside. Currently I do not recommend merging |
I think I have found the root cause for this, I have printed to check if
Therefore, my propose solution is to also include
|
Hi @pubpub-zz , can you help me review the screenshot from pdfbox. |
Review in progress |
a) first I'm confused with your output : the infinite is within your PDF and is not normal.
|
Hi @pubpub-zz , here is the log from your code. I have to edit it a little to avoid print ContentStream error
|
a) I have some problem to understand your traces : your first line "elt : {'/Type': ..." should indicate an indirect_object b) the "elt ---" indicates that the object are not indirect_objects which means that they are newly generated that I do not understand but explains the infinite loop
|
Hi @pubpub-zz a, Here is the test code I am using:
b, Here is the log I got with the new code you provided |
I start to understand.
|
Thank you very much. Somehow, using |
Can you confirm that it is only this last proposal that fixes the issue ? |
closes py-pdf#2474 analysis in py-pdf#2477
@syanng what would be great would be to have a test file. removing text and replacing all image with some dumb image(looping through images in a page and replacing it with a test image), could you try to produce a test file ? |
Yes, the issue has been fixed. Thank you very much. For the test file, could you please guide me on how to replace text content or image content with dumb text and dumb images? |
You might want to have a look at the |
fixed in #2505 |
I have several PDF files from customers, and using
remove_text
can cause infinite looping. Upon investigation, I discovered a corner case whereelt["/Resources"]
has a circular relation, which can result in callingclean_forms(content, stack + [elt])
infinitely.I proposed keeping a memory variable
visited_resources
to keep track of whichelt["/Resources"]
has been processed and to avoid infinite looping.The corner case files are private and cannot be shared, but I believe many people would encounter the same problem.
Closes #2474