`get_drawings()` does not see `show_pdf_page`'s layers #2539

snoyer · 2023-07-13T07:25:33Z

Describe the bug

PyMuPDF's get_drawings() does not grab the layer information from pages generated using show_pdf_page with an oc arguments.

To Reproduce

Generate a PDF document containing layers using doc.show_pdf_page(..., oc="blah")

from pathlib import Path

import fitz

colors = [
    ("red", (1, 0, 0)),
    ("green", (0, 1, 0)),
    ("blue", (0, 0, 1)),
    ("grey", (0.5, 0.5, 0.5)),
]

doc = fitz.open()
bounds = fitz.Rect(0, 0, 205, 55)
page = doc.new_page()
page.set_mediabox(bounds)

for i, (name, color) in enumerate(colors):
    tmp_doc = fitz.open()
    tmp_page = tmp_doc.new_page()
    tmp_page.set_mediabox(bounds)
    shape = tmp_page.new_shape()
    x, y = 5 + i * 50, 5
    shape.draw_rect((x, y, x + 45, y + 45))
    shape.finish(fill=color)
    shape.commit()

    oc = doc.add_ocg(name)
    page.show_pdf_page(bounds, tmp_doc, 0, oc=oc)


doc.save(Path(__file__).with_suffix(".pdf"))

print(f'{set(ocg["name"] for ocg in doc.get_ocgs().values()) = }')
print(f'{set(d.get("layer") for d in doc[0].get_drawings()) = }')

Open PDF in a viewer and observe layers are present and functional

pymupdf-layers-evince.webm

use doc.get_ocgs() and observe PyMuPDF does finds layers in document

set(ocg["name"] for ocg in doc.get_ocgs().values()) = {'blue', 'green', 'grey', 'red'}

use doc[0].get_drawings() and observe PyMuPDF does find the drawings but they all have 'layer': ''

set(d.get("layer") for d in doc[0].get_drawings()) = {''}

Expected behavior

The drawing items should have the appropriate OC name as their layer attribute.

Your configuration

sys.version = '3.10.12 (main, Jun  8 2023, 00:00:00) [GCC 13.1.1 20230511 (Red Hat 13.1.1-2)]'
sys.platform = 'linux'
fitz.__doc__ = '\nPyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.\nVersion date: 2023-06-21 00:00:01.\nBuilt for Python 3.10 on linux (64-bit).\n'

Additional context

Does get_drawings infer layer only based on BDC/EMC tags in streams, while show_pdf_page marks the OC as an attribute on objects? In which case this may be a MuPDF bug rather than PyMuPDF? Would it be an acceptable workaround to have show_pdf_page add [superfluous] BDC/EMC commands to make MuPDF happy?

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2023-07-13T07:50:36Z

This is a duplicate. The fix for this bug will roll out with the next version. This example file
text-oc.pdf
will correctly show this result:

In [1]: import fitz
In [2]: doc = fitz.open("text-oc.pdf")
In [3]: page = doc[0]
In [4]: page.get_drawings()
Out[4]:
[{'items': [('re', Rect(45.0, 45.0, 405.0, 305.0), 1)],
  'type': 'fs',
  'even_odd': False,
  'fill_opacity': 1.0,
  'fill': (0.800000011920929, 0.800000011920929, 0.800000011920929),
  'rect': Rect(45.0, 45.0, 405.0, 305.0),
  'seqno': 0,
  'layer': 'graphic',
  'stroke_opacity': 1.0,
  'color': (0.0, 0.0, 1.0),
  'width': 1.0,
  'lineCap': (0, 0, 0),
  'lineJoin': 0.0,
  'closePath': False,
  'dashes': '[ 3 1 ] 0'}]

snoyer · 2023-07-13T08:07:34Z

This is a duplicate. The fix for this bug will roll out with the next version.

Great, thank you for the super quick answer :)

Out of curiosity, can you share a link to the original bug/fix?

JorjMcKie · 2023-07-13T11:57:48Z

This was the closed issue #2462.

Why closed:
There was an error in the base library which returned an arbitrary invalid pointer and not the one to the layer character string. That occasionally caused an interpreter crash, because PyMuPDF was assuming a valid pointer at that point in time.
After more cautiously using this pointer (and returning an empty string where necessary), no more crashes are occurring since v1.22.5.
However, the necessary base library change to return the actual layer name was not implemented at that point in time. It is now.

snoyer · 2023-07-13T12:58:21Z

Thanks again for the detail :)

At the risk of sounding argumentative I'm not convinced it's the same bug...
I have cloned master and according to the printf I added in jm_lineart_begin_layer that function is called when reading your text-oc.pdf but is not when reading the PDF I generate with the show_pdf_page commands.
So, regardless of invalid pointers, the optional content generated by show_pdf_page are not recognized the same way as marked sections in streams.

However, the necessary base library change to return the actual layer name was not implemented at that point in time. It is now.

For what it's worth, I still get randomly garbled strings (but no crashes) on text-oc.pdf with the package installed with pip install git+https://github.com/pymupdf/PyMuPDF.git:

set(d.get("layer") for d in doc[0].get_drawings()) = {'àÃTa\x05'}
set(d.get("layer") for d in doc[0].get_drawings()) = {"ÿ'¦b\x05"}
set(d.get("layer") for d in doc[0].get_drawings()) = {''}
set(d.get("layer") for d in doc[0].get_drawings()) = {'\x1b\x88\x05X\x05'}
...

This seem to be fixable by copying the name string in jm_lineart_begin_layer instead of just assigning it:

static void
jm_lineart_begin_layer(fz_context *ctx, fz_device *dev_, const char *name)
{	
	// layer_name = name;
	layer_name = realloc(layer_name, strlen(name) + 1);
	strcpy(layer_name, name);
}

set(d.get("layer") for d in doc[0].get_drawings()) = {'graphic'}

JorjMcKie · 2023-07-13T13:14:57Z

At the risk of sounding argumentative I'm not convinced it's the same bug...

Ah, of course you are right: this is your bug, not my bug 😂!
This modification of your script does it right:

from pathlib import Path

import fitz

colors = [
    ("red", (1, 0, 0)),
    ("green", (0, 1, 0)),
    ("blue", (0, 0, 1)),
    ("grey", (0.5, 0.5, 0.5)),
]

doc = fitz.open()
bounds = fitz.Rect(0, 0, 205, 55)
page = doc.new_page()
page.set_mediabox(bounds)

for i, (name, color) in enumerate(colors):
    tmp_doc = fitz.open()
    oc = tmp_doc.add_ocg(name)
    tmp_page = tmp_doc.new_page()
    tmp_page.set_mediabox(bounds)
    shape = tmp_page.new_shape()
    x, y = 5 + i * 50, 5
    shape.draw_rect((x, y, x + 45, y + 45))
    shape.finish(fill=color, oc=oc)
    shape.commit()

    page.show_pdf_page(bounds, tmp_doc, 0)


doc.save(Path(__file__).with_suffix(".pdf"))

print(f'{set(ocg["name"] for ocg in doc.get_ocgs().values()) = }')
print(f'{set(d.get("layer") for d in doc[0].get_drawings()) = }')

JorjMcKie · 2023-07-13T13:21:43Z

You were assigning the OCG to the XObject representing the page insert by show_pdf_page() and not to the drawing itself.
One somewhat surprising problem remains however:
If objects within a page imported by show_pdf_page() point to an OCG, then this OCG will become part of the target PDF, but it will not make it to the /OCProperties dictionary of the target PDF. So the target PDF may remain not supporting OC at all, even if there are objects with an assigned OCG.

This is unfortunate but inevitable is like that.
So I am taking back my before sarcaasm.

snoyer · 2023-07-13T13:34:15Z

Ah, I didn't read the docs hard enough to find out about the oc param one shape.finish, nice one :D

So basically that makes your version an implementation of the workaround I was wondering about in my original post, as it wraps the shape in marked content tags:

q
/OC /MC0 BDC
55 5 45 45 re
h
0 G 0 1 0 rg B
EMC
Q

But where does that leave us on show_pdf_page(..., oc=xref) generating optional content that does works in Evince but isn't recognized as layers by get_drawings()? Maybe it's not your bug, not my bug, but their bug (because it doesn't trigger _begin_layer)? Or is it not considered a bug at all?

JorjMcKie · 2023-07-13T13:37:04Z

But where does that leave us on show_pdf_page(..., oc=xref) generating optional content that does works in Evince but isn't recognized as layers by get_drawings()? Maybe it's not your bug, not my bug, but their bug (because it doesn't trigger _begin_layer)? Or is it not considered a bug at all?

As I wrote above: you and I are not guilty here.
If you import pages via show_pddf_page() you can assign them an OCG - which will be reflected in the target PDF. But any source page object with an OCG is being ignored as having this and will be shown unconditionally.

JorjMcKie · 2023-07-13T13:39:31Z

Method get_drawings() however cannot know anything about the above, will determine whether there is an attached OCG and report it.
It will not / cannot check whether the PDF actually has entries for that OCG in its /OCProperties.

JorjMcKie · 2023-07-13T13:43:55Z

All this brings up the question about whether the layer name is of much value at all ... in get_drawings().

snoyer · 2023-07-13T13:55:06Z

I understand the situation from a PDF spec point-of-view. And I think the layer name does have value.

However, the confusing part is that, as a user of the API and not an expert of the details of the PDF spec, I would expect OC to behave consistently just like it does in viewers (that is, without having to find out about stream tags vs xobjects attributes)

Again, I understand now that this is not a PyMuPDF issue, it would fix itself if MuPDF would call _begin_layer and _end_layer when processing an xobject that has an OC attribute (but maybe they have a good reason not to, would be worth asking them...).

JorjMcKie · 2023-07-13T14:03:19Z

They do this begin-end business upon encountering /OC /MC0 BDC up to EMC. And that's it.
No syntax or other issue anywhere. Every viewer behaves in exactly the same way.
The fact that the OCG xref behind /MC0 is not part of any mentioning in the PDF catalog sub-dictionary /OCProperties is also legal. So why bother from MuPDF's point of view.
The PDF spec clearly says that OCG objects not mentioned in /OCProperties/OCGs must be ignored.

Also notable:

page.get_drawings() works the same way for all supported document types (XPS, EPUB, MOBI,...). They obviously have their own technical way to implement optional content (if supported at all). So inheriting a layer name into code that is outside the page appearance source code (as we have it in PDF) would be document type-specific possibly be bound to fail anyway.
Doing page.show_pdf_page() creates up to two Form XOBjects:
- number 1 is small and referenced by the target page. This one has the /OC attribute.
- number 2 is the large one containing the source page's contents.
This construction makes it impossible (and potentially contradictory), to inherit the OC value from 1 to 2 and then down the the contents of 2.
The value of this design behind show_pdf_page() is that the (potentially large) source page is only imported once. Multiple target pages can show it (with their own OC) without duplicating it.

snoyer · 2023-07-13T16:43:16Z

Well I'm no PDF expert so I cannot tell who/what is right or wrong here, I'm not even sure what should be reasonably expected anymore...
But here are my far less technical observations, to recap:

my original example, example1.pdf:

has /OCProperties<</OCGs
opens as expected in Evince and Acrobat, layers show in side panels, are toggleable
set(ocg["name"] for ocg in doc.get_ocgs().values()) = {'green', 'red', 'blue', 'grey'}
uses <</Type/XObject/Subtype/Form ... /OC 10 0 R>>
does not trigger _begin_layer/_end_layer
set(d.get("layer") for d in doc[0].get_drawings()) = {''}

your modified example, example2.pdf:

does not have OCProperties
opens in Evince/Acrobat without layers in side panel
set(ocg["name"] for ocg in doc.get_ocgs().values()) = set()
uses BDC
does trigger _begin_layer/_end_layer
set(d.get("layer") for d in doc[0].get_drawings()) = {'grey', 'blue', 'red', 'green'}

JorjMcKie · 2023-07-13T17:53:59Z

@snoyer yes, I have produced the same results, and in fact my previous post was based on them.

So then let me wrap up on where we are, what can be expected and what is impossible to achieve.

If a souce page is imported into a target page via target_page.show_pdf_page():

Any object of the source page under control of an Optional Content object in the source PDF will abandon such dependency when it becomes part of the target page. These source page objects will become unconditionally visible on the target - except in situations as explained in the next point. This may be confusing, because certain functions may still show an attached OCG - which however never is a member of the target PDF's Optional Content configuration (there even may not exist any).
When imported, the source page may have attached an OCG to it that is defined in the target PDF. The visibility of the whole source page inside the target page is then fully under control of the target PDF's OC configuration. This situation is however not (and cannot be) detectable when looking at images, text or drawings inside the import source page.

Behaviors explained under the two points above are inevitable and cannot be influenced for principle reasons.

So the remaining part of your post remains the above referenced other issue.

Obviously, all this shouldbe documented to avoid any such confusion going forward.

For an example file, see: pymupdf/PyMuPDF#2539 (comment)

JorjMcKie · 2023-10-01T11:25:22Z

Documentation now corectly comments on this situation.

JorjMcKie added bug duplicate labels Jul 13, 2023

JorjMcKie mentioned this issue Jul 26, 2023

Segmentation fault at caling get_cdrawings(extended=True) #2556

Closed

chris-liddell pushed a commit to ArtifexSoftware/mupdf that referenced this issue Aug 31, 2023

Emit begin/end_layer for OCG XObjects.

a4b306f

For an example file, see: pymupdf/PyMuPDF#2539 (comment)

JorjMcKie added documentation issue and removed bug duplicate labels Sep 27, 2023

JorjMcKie closed this as completed Oct 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`get_drawings()` does not see `show_pdf_page`'s layers #2539

`get_drawings()` does not see `show_pdf_page`'s layers #2539

snoyer commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023 •

edited

Loading

snoyer commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023

snoyer commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023 •

edited

Loading

snoyer commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023 •

edited

Loading

snoyer commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023 •

edited

Loading

snoyer commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023 •

edited

Loading

JorjMcKie commented Oct 1, 2023

get_drawings() does not see show_pdf_page's layers #2539

get_drawings() does not see show_pdf_page's layers #2539

Comments

snoyer commented Jul 13, 2023

Describe the bug

To Reproduce

Expected behavior

Your configuration

Additional context

JorjMcKie commented Jul 13, 2023 • edited Loading

snoyer commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023

snoyer commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023 • edited Loading

snoyer commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023 • edited Loading

snoyer commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023 • edited Loading

snoyer commented Jul 13, 2023

JorjMcKie commented Jul 13, 2023 • edited Loading

JorjMcKie commented Oct 1, 2023

`get_drawings()` does not see `show_pdf_page`'s layers #2539

`get_drawings()` does not see `show_pdf_page`'s layers #2539

JorjMcKie commented Jul 13, 2023 •

edited

Loading

JorjMcKie commented Jul 13, 2023 •

edited

Loading

JorjMcKie commented Jul 13, 2023 •

edited

Loading

JorjMcKie commented Jul 13, 2023 •

edited

Loading

JorjMcKie commented Jul 13, 2023 •

edited

Loading