-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bookmarks/Links drop when combining PDFs #44
Comments
And just for context, what I am trying to do is I have one pdf file with a cover page and TOC and a group of other "body" PDF's that are the documents referenced in the TOC. Im trying to merge the cover/toc pdf with the "body" pdf's, and have links in the TOC point to the appropriate page in the final PDF as well as having proper bookmarks in the final PDF. I can think of two approaches:
The issue stated above refered to approach # 1, I can join the pdfs fine, but the links/bookmarks from the toc/cover pdf are lost in the final pdf. |
Hi Andrew, Thank you for bringing these issues to my attention. Allow me to point out that CombinePDF creates a new PDF file, so I doubt if Bookmarks are transferrable, although that could be an interesting feature to have. I will try looking into bookmarks when done with the TOC. As for keeping named destinations intact, it's definitely on the roadmap. The issue, at the moment is that ad-hock links within the PDF file (the links using the PDF's "name tree") are preserved, but TOC links (links based on the entries in the table of contents) break since the original table of contents is discarded. I definitely want to merge the TOC of different PDF files as well, perhaps adding a new "root" for each file (since the TOC is a semi-complex binary tree were each parent must provide data about it's children), but this requires me to dig deeper into the standard again and I'm in the middle of a few projects that take most of my time... I'm keeping this open. If you have any further ideas about implementing these features, let me know. I'll keep you posted when I get to this. Bo. |
With ToC, do you mean the outline 12.3.3 or named destinations 8.2.1? We're currently facing the problem that our outline went missing. |
When talking about the TOC, I was referring to the outline. Named destinations are used for links inside a page. The outline is used as an "external" navigation tool. I'm assuming the outline will need to be extended when merging outlines, so that a root tree node is created and that this root tree node expands each "branch" to include all the data in the original document and any imported document. On the other hand, this root node would need to recognize CombinePDF root nodes in order to merge with them rather then swallow them. |
Is there a workaround? At least to not lose the outline of the main document? |
There's no workaround at the moment, as preserving the data and moving it around isn't as simple as a single line of code. At the moment, the ToC is lost during the parsing stage, unlike the way the named destinations are preserved. After the parsing, the data will need to be saved to the PDF object. Then, the ToC would need to get updated during the import of new objects, a similar process happens with the names object and the form data. Also, the ToC might reference named destinations and it might use conflicting keywords... i.e. names are updated to prevent conflicts and the ToC might need to go through a similar process. The last step to write would be to get the ToC into the new catalog/PDF object, similar to the way the names object is added for the rendering. I had tons of stuff on my plate, so I didn't get to it, but it is definitely possible to code this... it's just more complex of an undertaking. |
Got kind of stuck.. or lost, somewhere in the outlines hash actual_value(@outlines).update actual_value(data.outlines_object), &self.class.method(:hash_merge_new_no_page) "seems not to work" as in SystemStackError - stack level too deep from calling I'd appreciate a little nudge in the right direction if you don't mind :) This is what the beginning on the outlines hash looks like, starts with obvious and usefull data, but deeper in is stuff like fonts, annots, meadiaboxes and so on.
|
I don't think you're lost... yet :-) I think the issue is the recessive nature of the update function. The Outline hash references the pages which, in turn, reference the catalog, which brings us back to the outline... But this isn't the issue because the There must be another branch of data that causes recursive updating. It seems that each node in the outline tree references it's parent node ( I'm guessing a manual update scheme should be implemented or a different hash update function should be used to resolve data conflicts (instead of |
Seems like that's the last piece missing.. How to merge the outlines |
I'm sorry it's taking me a while to answer, it's a busy week for me. To resolve this we need to read through the PDF specification. Section 12.3 to the specs deals with outlines and I think this is the part we should be looking at. When merging the data, we should make it fit the specs, either by:
At this point, named destinations (see 12.3.2.3 in the specs) need to be reviewed, to make sure we don't have naming conflicts. Similar code already exists and could probably be adapted for the outline. It seems to me that in order to combine the outline (which is a linked list), the code could "walk" the list, updating the The code will probably need to walk only the root of the outline (assuming branches of the outline remain unchanged). If the PDF is injected in the middle, the I have no idea how to handle the middle case, as I have no idea how to discover which "bookmarks" reference the page after which the PDF is injected. |
Well, I was able to append the second outline to the original one so far (here: ee14937). Also, I might be wrong here, but I am pretty sure that we don't need to do the same thing that has to be done for the named destinations, named destinations literally have a 'word' as link/anchor while the bookmarks use page links, in my tests so far, even merging 2 identical Pdfs didn't cause any problems with names or page links. Edit/Add: Depending on what I find in the specifications, I might try to add a parent node to the new outline, so it is easier to identify an inserted part. |
Hi Stefan, I've been working on a lot of re-writes and bug corrections for resolving lost form data, decryption issues and the like... It actually brought some in-depth changes, so things might be easier now. For instance, the reference resolution should be both faster and easier... but it is implemented differently, so (I'm sorry) some of the code you wrote will need to be revised. For instance, you have this line: @objects << catalog
# [...]
add_referenced catalog[:Outlines], false which will break under the new implementation (and isn't required, everything in the catalog is automatically processed). This includes a new recursion (or repetition) protection helper at I hope this will help to "walk" the outline. After reading the code for the I came up with this untested (and probably broken) destructive update pattern (updates the data in place). def merge_outlines(old_data, new_data, position)
old_data = actual_object(old_data)
new_data = actual_object(new_data).dup
if old_data.empty?
# old_data is a reference to the actual object,
# so if we update old_data, we're done, no need to take any further action
old_data.update new_data
else
old_data[:Count] += new_data[:Count]
# walk the Hash here ...
# I'm just using the start / finish position for now...
# FIXME to implement an insert in the middle of the file?
prev = nil
pos = first = actual_object(((position < 0) ? new_data : old_data)[:First])
median = actual_object(((position < 0) ? old_data : new_data)[:First])
last = actual_object(((position < 0) ? old_data : new_data)[:Last])
first = old_data[:First] = {is_reference_only: true, referenced_object: first}
last = old_data[:Last] = {is_reference_only: true, referenced_object: last}
parent = {is_reference_only: true, referenced_object: old_data}
# the walking
while(pos)
pos[:First] = first # already a reference
pos[:Last] = last # already a reference
pos[:Parent] = parent # already a reference
if(prev)
pos[:Prev] = {is_reference_only: true, referenced_object: prev}
else
pos.delete :Prev
end
# connect the two outlines
if(pos[:Next].nil?)
pos[:Next] = median
median = nil
end
prev = pos
pos = actual_object(pos[:Next])
end
# make sure the last object doesn't have the :Next property
prev.delete :Next
end
# print_dat_outline(old_data)
return nil # no need to return the data, the update had taken place destructively.
end I wonder if you tried a similar approach...? I know you already have something working, but could you test it out and see if there's a performance difference? I know there are a few things missing:
The uptake of this approach is that it is minimizing temporary objects and function calls (which are relatively expensive in Ruby), editing the data "in place". I also think it's easier to maintain, but it might be because my mind is slow at computing function call jumps. Named destinations? I understand the PDF files you tested with all used page references as TOC destinations. This is great news. However, I think the standard states that named destinations (using either Name objects (Ruby symbols) or Strings) are allowed (see sec. 12.3.2.3). I'm not sure, but it could be that we're be covered either way, we just need to check that the existing name conflict resolution logic will be applied also on the names in the TOC (which it probably would). Edit/Add: At first I thought that would be cool, but then there is the question of naming that header position (we could probably use the On one hand, I'm not sure this is what I would want to see when merging a file. On the other hand, this could also be used to create an outline for files without an outline, so all inserted PDF files appear on the outline and the reader can jump directly to this file or the other. But this will probably only work if files are inserted at the beginning or the end. I don't see how this can be implemented for PDF files inserted in the middle of a page stream. In practice, I think this idea should be scratched in favor of practicality. |
I appreciate it, I see where you are going with that and made it work with your updates the data in place approach.
I assume that you are talking about the ToC's subsections. The answer is no. As long as we only enable insertion at the start or end and not somewhere in the middle (within a subsection) it doesn't matter. I would have explained the def merge_outlines(old_data, new_data, position)
old_data = actual_object(old_data)
new_data = actual_object(new_data).dup
if old_data.empty?
# old_data is a reference to the actual object,
# so if we update old_data, we're done, no need to take any further action
old_data.update new_data
else
old_data[:Count] += new_data[:Count]
# walk the Hash here ...
# I'm just using the start / finish position for now...
# FIXME to implement an insert in the middle of the file?
prev = nil
pos = first = actual_object(((position < 0) ? new_data : old_data)[:First])
last = actual_object(((position < 0) ? old_data : new_data)[:Last])
median = {is_reference_only: true, referenced_object: actual_object(((position < 0) ? old_data : new_data)[:First])}
old_data[:First] = {is_reference_only: true, referenced_object: first}
old_data[:Last] = {is_reference_only: true, referenced_object: last}
parent = {is_reference_only: true, referenced_object: old_data}
# the walking
while(pos)
# pos[:First] = first # already a reference
# pos[:Last] = last # already a reference
pos[:Parent] = parent if pos[:Parent] # already a reference
if(prev)
pos[:Prev] = {is_reference_only: true, referenced_object: prev}
else
pos.delete :Prev
end
# connect the two outlines
if(pos[:Next].nil?)
pos[:Next] = median
median = nil
end
prev = pos
pos = actual_object(pos[:Next])
end
# make sure the last object doesn't have the :Next property
prev.delete :Next
end
# print_dat_outline(old_data)
return nil # no need to return the data, the update had taken place destructively.
end its performance was the following:
While my version had this performance:
Merging 4 identical pdfs,
using the according call for the new and the old method testing (new being your approach)
I'd have to test more specifically, all I can say at this point is that everything looks fine in the result.
I think that's sufficient for now aswell. It would be quite some effort to figure out where to put it, especially since there's even multiple ways to define an outline nodes destination, as you pointed out.
Yup, you're right, I'm looking into that now.
Alright, I see your point. Keep in mind that everything I wrote so far is from an unmerged upstream standpoint. |
Wow, that's a lot of work... nice. I'm super thankful to you for this... this feature should be named after you 👍 Two questions about performance and the merge function.
Benchmark.bm do |bm|
merge_outlines(@outlines, data.outlines_object, 0)
bm.report { 1000.times { merge_outlines(@outlines, @outlines, 0) } }
# actual_value(@outlines).update merge_outlines_old(actual_value(@outlines), actual_value(data.outlines_object))
# bm.report { actual_value(@outlines).update merge_outlines_old(actual_value(@outlines), @outlines) }
end Merging with upstreamed changesI'm sorry I made you work harder... But I hope no big changes will pop up. I think I might have messed up the The When I wrote it I assumed there is a If this isn't the reason it fails, I'm not sure what it might be, so let me explain the new logic, maybe it will help. The new reference collector looks at any existing objects and adds any references they hold into the Since this is done after the In other words, it should have worked automatically once you had this in the code in the catalog_object = { Type: :Catalog,
Pages: { referenced_object: pages_object, is_reference_only: true },
Names: { referenced_object: @names, is_reference_only: true },
Outlines: { referenced_object: @outlines, is_reference_only: true } } However, for performance reasons, the reference collector ( Any tests performed after rebuilding the catalog will fail unless the reference collector is invoked. I'm hoping this is why it failed... If this doesn't work I might have a bug in my new reference collector... that would definitely suck, but i's possible. A simple workaround would be to add the |
Happy to help, also you did a lot by coming up with the base of how to update the outlines (: performance and the merge function.
I'm not sure if you are deleting while(pos)
pos[:Parent] = parent if pos[:Parent] # already a reference
# connect the two outlines
if(pos[:Next].nil?)
pos[:Prev] = {is_reference_only: true, referenced_object: prev} if prev
median[:referenced_object][:Prev] = {is_reference_only: true, referenced_object: prev} if median
pos[:Next] = median
median = nil
end
prev = pos
pos = actual_object(pos[:Next])
end this would work aswell. It would kind of combine your 1.1 and 1.2, if the 2. Performance Merging with upstreamed changes I'm out of time for today, but next up is rebuilding the named destinations used in the outline. |
Cool 👍 1. The loopI love the new loop and unified I think deleting the The only thing is, if we have a The first object probably shouldn't have the PerformanceKudos for taking the time to create PDF files with huge ToCs... If you can post any of them (unless they contain sensitive data), I could use them for some of the tests I perform before releasing updates. I don't add the PDFs used for testing to the online repo, since some of them might contain sensitive data... and I should probably write automated tests, though... which I haven't gotten around to. I keep using the demo app for testing half the features (merge PDF files, text boxes, font importing, tables and page numbering) and then do some manual work for testing other features (i.e. page stumping)... this isn't very effective... ... I digress. UpstreamI'm happy this is resolved and that the new reference collection works, as I think the new reference collection will make future updates that much easier. Named destinationsIf the names already exist in the name dictionary, this might be handled by the existing name resolution that is automatically applied whenever the object catalog is rebuilt. If this suspicion I have is correct, then this feature is practically done 👍 |
The Loop
"if we have a
I can definitely add a deletion of those keys after the loop, since those nodes are accessible without looping Performance report = CombinePDF.new
b_pdf = Prawn::Document.new
b_pdf.text("Hi prawn")
b_pdf.outline.page title: 'First pdf'
b_pdf.start_new_page
b_pdf.text("Heeeeelloh!")
b_pdf.outline.page title: '1st pdf page 2'
b_pdf.start_new_page
b_pdf.text("aloha!")
b_pdf.outline.define do
100.times do |t|
section "Chapter #{t}", closed: false do
page :title => 'Page 1', destination: 1
page :title => 'Page 2', destination: 2
page :title => 'Page 3', destination: 1
page :title => 'Page 4', destination: 2
end
end
end
report << CombinePDF.parse(b_pdf.render)
report << CombinePDF.parse(a_pdf.render) # a_pdf being a similarly created pdf
final_report = report.to_pdf
send_data final_report,
filename: "test.pdf",
type: 'application/pdf',
disposition: 'inline' Adding multipliers like this Named destinations Another thing I noticed, by the way, is that the way the outline hash gets build, results in some cases, in the outline inserting pages CombinePDF.parse(b_pdf.render).pages.each_with_index do |page, index|
if index < 2 # b_pdf only has 3 pages here
report << page
else
report << CombinePDF.parse(a_pdf.render)
report << page
end
end Resulting PDF contains only the outline from |
Hi Stefan, Named destinationsFirst, please let me apologize, and hopefully make you happy at the same time. Last night an Issue was posted that exposed a vulnerability in the named destinations algorithm that was in place. I had to rewrite the whole thing... but it now works better (although it might be glitchy, as I put it together late at night). It could be that some of the issues you experienced will self resolve (I doubt, but I hope). Duplicate objects / bloatingAs to having the pages in the named destination hash, it's supposed to be this way. These are the same Ruby objects (same The objects should only print once, which is what the which is also where we get this funny issue... "both the outlines linked to the same pages of the first PDF"I think this is why you get this funny behavior where both the outlines point to pages of the first PDF instead of their own original PDF. Except for page objects, all objects that contain the same data are reduced to a single object. For example, when we add pages that use the same font over and over again, we don't want to have multiple copies of the font... so the reference is resolved to the first copy and the rest of the font objects are discarded. Pages, because of an issue with an older adobe acrobat reader, must have unique references. For this reason, Pages in the page catalog that contain the same data are have a shallow copy made specifically for the catalog. Only the Page's dictionary is duplicated, but not any of the resources they use (fonts etc' are still reduced to a single copy of each). Th reason I think this might cause the issue where both the outlines point to pages of the first PDF is since the data in the pages is the same, the reference is forced to resolved to the original page (reduction takes place). I don't think there's a way around it except (maybe) numbering the pages or editing them, because I can think of too many things that might go wrong if we don't enforce this behavior (like the references pointing to the pages in the other PDF object, and then formatting and managing the PDF can start doing weird things)... ...However, I'm not sure that's a contingency we need to explore. If people are just bloating their PDF files by duplicating their own data many times, I doubt if the ToC will be their source of disappointment. nil valueI'm sorry, I think you answered me, but I'm just making sure, because sometimes I write too much and then I'm not clear.
In the Upstream changes...I hope it's easy to merge the new named destination function... sorry for the extra work. Bo. |
Named Destinations
I just realized that my named destinations (generated with prawn), don't work anymore at all. Also tested without using the outline merge, so I assume that it is caused by the your new name rebuilding method. |
I'm far from my computer at the moment (hurray for smartphones),
|
Sure, here are some: It doesn't work with any combination of those, actually not even using a single one without merging, would keep working named destination links. |
I found my two mistakes:
I think it should be working now... could you see if you're getting better results? I think the outlines in the test files didn't all work (the ones not under a chapter didn't seem to work even on the originals). There isn't a need to unpack the page stream and edit it, since the named destinations don't make it into the page stream. It seems that named destinations are a type of annotation that is overlaid over the page data. I did find a dictionary we might be missing when supporting these page url jumps, and this is the Pageable data in the catalog (section 12.4.2 to the standard). However, I think we should finish this one (and celebrate it) before tackling a new feature extension. |
The test files The problem with renaming So, to sum this up, before your latest change, none of the named destinations links, found in the files I uploaded above, worked (nothing happened when clicking the link). Now, after your latest changes, the links work if it's only 1 PDF, and if it's 2 PDFs that are being merged, the links in the second PDF work, but the ones in the first PDF don't. What I noticed: I suspect that the problem with the links that don't work is that the links and anchors don't match at the moment, maybe you are updating all the anchors (to contain a 'CombinePDF_0000006' string), but the links of the first PDF get "forgotten"? |
I'll try and work out an answer and review the code at the same time, this way if I skip something in the logic you can point it out. The logic is this:
def _parse_
#[...]
##########################################
## parse a Hex String
##########################################
elsif str = @scanner.scan(/<[0-9a-fA-F]+>/)
# warn "Found a hex string"
out << unify_string([str[1..-2]].pack('H*').force_encoding(Encoding::ASCII_8BIT))
# [...]
##########################################
## parse a Literal String
##########################################
elsif @scanner.scan(/\(/)
# [...]
out << unify_string(str.pack('C*').force_encoding(Encoding::ASCII_8BIT))
# [...]
end
# [...]
def unify_string(str)
@strings_dictionary[str] ||= str
end
a = "test"
b = a
a << " this"
b == "test this" # => true
dic << (pos[i * 2].clear << base.next!)
Caveats and possible issues:
I'm very happy I asked you to test this also on your system, because I can't seem to duplicate the issue. Here is the code I used for testing, the list is a list with all the files you sent me and their paths on my machine, I use the list to:
lists = %w{./Ruby/test\ pdfs/outlines/big_toc.pdf ./Ruby/test\ pdfs/outlines/bigger_toc.pdf ./Ruby/test\ pdfs/outlines/named_dest_no_toc.pdf ./Ruby/test\ pdfs/outlines/named_dest_no_toc2.pdf ./Ruby/test\ pdfs/outlines/named_dest.pdf ./Ruby/test\ pdfs/outlines/named_dest2.pdf};
i = 0
lists.each{|n| CombinePDF.load(n).save("07_#{(i+=1).to_s}_#{n.split('/')[-1]}"); (CombinePDF.load(n) << CombinePDF.load(n)).save("07_#{(i).to_s}x2_#{n.split('/')[-1]}") }
pdf = CombinePDF.new
lists.each{|n| pdf << CombinePDF.load(n) }
pdf.save("07_named destinations.pdf") In my test files all the links work... Can you post your test code, so I can try and replicate the issue? |
Stefan, Happy Weekend 🎉 I hope I'll be able to replicate and find the issue and release a fixed version this weekend. I'm thankful for all you've done. B. |
Hi Stefan, I think @Kagetsuki exposed the source of the issue - see issue #71 . |
Nope.. probably not the issue... this shouldn't effect page links, only outlines. |
Hi Bo, If it happens again I'll put together a more detailed report about how it happened and how I'm testing for it. As for the previous examples, the difference between our tests is that I used the data generated by prawn with |
@sLe1tner Did you update your gems? @boazsegev released like two patch versions in the last day 🤘🏼. If you updated your gems/bundle the patch probably got in there. You could peg the gem at .21 and then .23 and see if you get the error again - in which case your issue was also fixed in .25. |
🎉 ...I think this means we have a (probably) stable outline merging feature. I'm very happy about this 👍 I'll close this issue, but if anything pops up or you find something, please open an issue and we'll work on it. B. |
Ah that might have had something to do with it, i'll check it out, thanks @Kagetsuki
Yea that sounds great :) |
I'm attempting to prepend a cover page to my pdf which has a table of contents. I would like the table of contents to remain unchanged. Currently the TOC is unchanged but links are dropped. Is this expected to work and is there anything I can try to fix the links? Thanks for all your work on this. |
@daymun How are the links formed? Are they to named anchors? Can you provide sources/samples? |
@Kagetsuki the TOC links are generated with wkhtmltopdf. Here are some sample files: https://drive.google.com/drive/folders/0B0vK1fMzDbarMndpdExVUFlCbE0 As you can see, links work in |
@daymun , Thank you for opening this issue. I've started looking in to this and it might have something to do with the fact that This was a bigger problem than I thought at first glance, because Anyway... I've uploaded an updated version to GitHub, to deal with This update might have a negative impact on performance in some cases... but this might be the price we pay when we need to write work arounds. Also, I didn't finish testing this version for any negative impact on the actual PDF data... I don't promise this change will make it to the release version until I test it some more. Could you possibly test that this fix actually helps you before I push this through? |
@boazsegev, many thanks for your quick resolution of this issue. Everything seems to be working with the fix you pushed yesterday and I haven't noticed any negative impact on PDF data. |
same issue, isn't working here. |
Could you open a new issue, explaining exactly what isn't working and adding a short example that would allow me to replicate any issue you might be experiencing? Kindly, P.S., I'm mostly offline for the next two-three weeks. I might be slower than usual. |
Combining two PDF's with appropriate bookmarks and links to named destinations creates a pdf where
I see issue #31, but that shouldnt apply here because the pdf has different named destinations (resolved with GUIDs)
The text was updated successfully, but these errors were encountered: