Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_drawings() does not see show_pdf_page's layers #2539

Closed
snoyer opened this issue Jul 13, 2023 · 15 comments
Closed

get_drawings() does not see show_pdf_page's layers #2539

snoyer opened this issue Jul 13, 2023 · 15 comments

Comments

@snoyer
Copy link

snoyer commented Jul 13, 2023

Describe the bug

PyMuPDF's get_drawings() does not grab the layer information from pages generated using show_pdf_page with an oc arguments.

To Reproduce

  1. Generate a PDF document containing layers using doc.show_pdf_page(..., oc="blah")
from pathlib import Path

import fitz

colors = [
    ("red", (1, 0, 0)),
    ("green", (0, 1, 0)),
    ("blue", (0, 0, 1)),
    ("grey", (0.5, 0.5, 0.5)),
]

doc = fitz.open()
bounds = fitz.Rect(0, 0, 205, 55)
page = doc.new_page()
page.set_mediabox(bounds)

for i, (name, color) in enumerate(colors):
    tmp_doc = fitz.open()
    tmp_page = tmp_doc.new_page()
    tmp_page.set_mediabox(bounds)
    shape = tmp_page.new_shape()
    x, y = 5 + i * 50, 5
    shape.draw_rect((x, y, x + 45, y + 45))
    shape.finish(fill=color)
    shape.commit()

    oc = doc.add_ocg(name)
    page.show_pdf_page(bounds, tmp_doc, 0, oc=oc)


doc.save(Path(__file__).with_suffix(".pdf"))

print(f'{set(ocg["name"] for ocg in doc.get_ocgs().values()) = }')
print(f'{set(d.get("layer") for d in doc[0].get_drawings()) = }')
  1. Open PDF in a viewer and observe layers are present and functional
pymupdf-layers-evince.webm
  1. use doc.get_ocgs() and observe PyMuPDF does finds layers in document

set(ocg["name"] for ocg in doc.get_ocgs().values()) = {'blue', 'green', 'grey', 'red'}

  1. use doc[0].get_drawings() and observe PyMuPDF does find the drawings but they all have 'layer': ''

set(d.get("layer") for d in doc[0].get_drawings()) = {''}

Expected behavior

The drawing items should have the appropriate OC name as their layer attribute.

Your configuration

sys.version = '3.10.12 (main, Jun  8 2023, 00:00:00) [GCC 13.1.1 20230511 (Red Hat 13.1.1-2)]'
sys.platform = 'linux'
fitz.__doc__ = '\nPyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.\nVersion date: 2023-06-21 00:00:01.\nBuilt for Python 3.10 on linux (64-bit).\n'

Additional context

Does get_drawings infer layer only based on BDC/EMC tags in streams, while show_pdf_page marks the OC as an attribute on objects? In which case this may be a MuPDF bug rather than PyMuPDF? Would it be an acceptable workaround to have show_pdf_page add [superfluous] BDC/EMC commands to make MuPDF happy?

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jul 13, 2023

This is a duplicate. The fix for this bug will roll out with the next version. This example file
text-oc.pdf
will correctly show this result:

In [1]: import fitz
In [2]: doc = fitz.open("text-oc.pdf")
In [3]: page = doc[0]
In [4]: page.get_drawings()
Out[4]:
[{'items': [('re', Rect(45.0, 45.0, 405.0, 305.0), 1)],
  'type': 'fs',
  'even_odd': False,
  'fill_opacity': 1.0,
  'fill': (0.800000011920929, 0.800000011920929, 0.800000011920929),
  'rect': Rect(45.0, 45.0, 405.0, 305.0),
  'seqno': 0,
  'layer': 'graphic',
  'stroke_opacity': 1.0,
  'color': (0.0, 0.0, 1.0),
  'width': 1.0,
  'lineCap': (0, 0, 0),
  'lineJoin': 0.0,
  'closePath': False,
  'dashes': '[ 3 1 ] 0'}]

@snoyer
Copy link
Author

snoyer commented Jul 13, 2023

This is a duplicate. The fix for this bug will roll out with the next version.

Great, thank you for the super quick answer :)

Out of curiosity, can you share a link to the original bug/fix?

@JorjMcKie
Copy link
Collaborator

This was the closed issue #2462.

Why closed:
There was an error in the base library which returned an arbitrary invalid pointer and not the one to the layer character string. That occasionally caused an interpreter crash, because PyMuPDF was assuming a valid pointer at that point in time.
After more cautiously using this pointer (and returning an empty string where necessary), no more crashes are occurring since v1.22.5.
However, the necessary base library change to return the actual layer name was not implemented at that point in time. It is now.

@snoyer
Copy link
Author

snoyer commented Jul 13, 2023

Thanks again for the detail :)

At the risk of sounding argumentative I'm not convinced it's the same bug...
I have cloned master and according to the printf I added in jm_lineart_begin_layer that function is called when reading your text-oc.pdf but is not when reading the PDF I generate with the show_pdf_page commands.
So, regardless of invalid pointers, the optional content generated by show_pdf_page are not recognized the same way as marked sections in streams.


However, the necessary base library change to return the actual layer name was not implemented at that point in time. It is now.

For what it's worth, I still get randomly garbled strings (but no crashes) on text-oc.pdf with the package installed with pip install git+https://github.com/pymupdf/PyMuPDF.git:

  • set(d.get("layer") for d in doc[0].get_drawings()) = {'àÃTa\x05'}
  • set(d.get("layer") for d in doc[0].get_drawings()) = {"ÿ'¦b\x05"}
  • set(d.get("layer") for d in doc[0].get_drawings()) = {''}
  • set(d.get("layer") for d in doc[0].get_drawings()) = {'\x1b\x88\x05X\x05'}
  • ...

This seem to be fixable by copying the name string in jm_lineart_begin_layer instead of just assigning it:

static void
jm_lineart_begin_layer(fz_context *ctx, fz_device *dev_, const char *name)
{	
	// layer_name = name;
	layer_name = realloc(layer_name, strlen(name) + 1);
	strcpy(layer_name, name);
}
  • set(d.get("layer") for d in doc[0].get_drawings()) = {'graphic'}

@JorjMcKie
Copy link
Collaborator

At the risk of sounding argumentative I'm not convinced it's the same bug...

Ah, of course you are right: this is your bug, not my bug 😂!
This modification of your script does it right:

from pathlib import Path

import fitz

colors = [
    ("red", (1, 0, 0)),
    ("green", (0, 1, 0)),
    ("blue", (0, 0, 1)),
    ("grey", (0.5, 0.5, 0.5)),
]

doc = fitz.open()
bounds = fitz.Rect(0, 0, 205, 55)
page = doc.new_page()
page.set_mediabox(bounds)

for i, (name, color) in enumerate(colors):
    tmp_doc = fitz.open()
    oc = tmp_doc.add_ocg(name)
    tmp_page = tmp_doc.new_page()
    tmp_page.set_mediabox(bounds)
    shape = tmp_page.new_shape()
    x, y = 5 + i * 50, 5
    shape.draw_rect((x, y, x + 45, y + 45))
    shape.finish(fill=color, oc=oc)
    shape.commit()

    page.show_pdf_page(bounds, tmp_doc, 0)


doc.save(Path(__file__).with_suffix(".pdf"))

print(f'{set(ocg["name"] for ocg in doc.get_ocgs().values()) = }')
print(f'{set(d.get("layer") for d in doc[0].get_drawings()) = }')

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jul 13, 2023

You were assigning the OCG to the XObject representing the page insert by show_pdf_page() and not to the drawing itself.
One somewhat surprising problem remains however:
If objects within a page imported by show_pdf_page() point to an OCG, then this OCG will become part of the target PDF, but it will not make it to the /OCProperties dictionary of the target PDF. So the target PDF may remain not supporting OC at all, even if there are objects with an assigned OCG.

This is unfortunate but inevitable is like that.
So I am taking back my before sarcaasm.

@snoyer
Copy link
Author

snoyer commented Jul 13, 2023

Ah, I didn't read the docs hard enough to find out about the oc param one shape.finish, nice one :D

So basically that makes your version an implementation of the workaround I was wondering about in my original post, as it wraps the shape in marked content tags:

q
/OC /MC0 BDC
55 5 45 45 re
h
0 G 0 1 0 rg B
EMC
Q

But where does that leave us on show_pdf_page(..., oc=xref) generating optional content that does works in Evince but isn't recognized as layers by get_drawings()? Maybe it's not your bug, not my bug, but their bug (because it doesn't trigger _begin_layer)? Or is it not considered a bug at all?

@JorjMcKie
Copy link
Collaborator

But where does that leave us on show_pdf_page(..., oc=xref) generating optional content that does works in Evince but isn't recognized as layers by get_drawings()? Maybe it's not your bug, not my bug, but their bug (because it doesn't trigger _begin_layer)? Or is it not considered a bug at all?

As I wrote above: you and I are not guilty here.
If you import pages via show_pddf_page() you can assign them an OCG - which will be reflected in the target PDF. But any source page object with an OCG is being ignored as having this and will be shown unconditionally.

@JorjMcKie
Copy link
Collaborator

Method get_drawings() however cannot know anything about the above, will determine whether there is an attached OCG and report it.
It will not / cannot check whether the PDF actually has entries for that OCG in its /OCProperties.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jul 13, 2023

All this brings up the question about whether the layer name is of much value at all ... in get_drawings().

@snoyer
Copy link
Author

snoyer commented Jul 13, 2023

I understand the situation from a PDF spec point-of-view. And I think the layer name does have value.

However, the confusing part is that, as a user of the API and not an expert of the details of the PDF spec, I would expect OC to behave consistently just like it does in viewers (that is, without having to find out about stream tags vs xobjects attributes)

Again, I understand now that this is not a PyMuPDF issue, it would fix itself if MuPDF would call _begin_layer and _end_layer when processing an xobject that has an OC attribute (but maybe they have a good reason not to, would be worth asking them...).

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jul 13, 2023

They do this begin-end business upon encountering /OC /MC0 BDC up to EMC. And that's it.
No syntax or other issue anywhere. Every viewer behaves in exactly the same way.
The fact that the OCG xref behind /MC0 is not part of any mentioning in the PDF catalog sub-dictionary /OCProperties is also legal. So why bother from MuPDF's point of view.
The PDF spec clearly says that OCG objects not mentioned in /OCProperties/OCGs must be ignored.

Also notable:

  1. page.get_drawings() works the same way for all supported document types (XPS, EPUB, MOBI,...). They obviously have their own technical way to implement optional content (if supported at all). So inheriting a layer name into code that is outside the page appearance source code (as we have it in PDF) would be document type-specific possibly be bound to fail anyway.
  2. Doing page.show_pdf_page() creates up to two Form XOBjects:
    • number 1 is small and referenced by the target page. This one has the /OC attribute.
    • number 2 is the large one containing the source page's contents.
  3. This construction makes it impossible (and potentially contradictory), to inherit the OC value from 1 to 2 and then down the the contents of 2.
  4. The value of this design behind show_pdf_page() is that the (potentially large) source page is only imported once. Multiple target pages can show it (with their own OC) without duplicating it.

@snoyer
Copy link
Author

snoyer commented Jul 13, 2023

Well I'm no PDF expert so I cannot tell who/what is right or wrong here, I'm not even sure what should be reasonably expected anymore...
But here are my far less technical observations, to recap:

my original example, example1.pdf:

  • has /OCProperties<</OCGs
  • opens as expected in Evince and Acrobat, layers show in side panels, are toggleable
  • set(ocg["name"] for ocg in doc.get_ocgs().values()) = {'green', 'red', 'blue', 'grey'}
  • uses <</Type/XObject/Subtype/Form ... /OC 10 0 R>>
  • does not trigger _begin_layer/_end_layer
  • set(d.get("layer") for d in doc[0].get_drawings()) = {''}

your modified example, example2.pdf:

  • does not have OCProperties
  • opens in Evince/Acrobat without layers in side panel
  • set(ocg["name"] for ocg in doc.get_ocgs().values()) = set()
  • uses BDC
  • does trigger _begin_layer/_end_layer
  • set(d.get("layer") for d in doc[0].get_drawings()) = {'grey', 'blue', 'red', 'green'}

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jul 13, 2023

@snoyer yes, I have produced the same results, and in fact my previous post was based on them.

So then let me wrap up on where we are, what can be expected and what is impossible to achieve.

If a souce page is imported into a target page via target_page.show_pdf_page():

  1. Any object of the source page under control of an Optional Content object in the source PDF will abandon such dependency when it becomes part of the target page. These source page objects will become unconditionally visible on the target - except in situations as explained in the next point. This may be confusing, because certain functions may still show an attached OCG - which however never is a member of the target PDF's Optional Content configuration (there even may not exist any).
  2. When imported, the source page may have attached an OCG to it that is defined in the target PDF. The visibility of the whole source page inside the target page is then fully under control of the target PDF's OC configuration. This situation is however not (and cannot be) detectable when looking at images, text or drawings inside the import source page.

Behaviors explained under the two points above are inevitable and cannot be influenced for principle reasons.

So the remaining part of your post remains the above referenced other issue.

Obviously, all this shouldbe documented to avoid any such confusion going forward.

@JorjMcKie
Copy link
Collaborator

Documentation now corectly comments on this situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants