Inline images lost when conversion from ePub to HTML #10395

someth2say · 2024-11-20T08:47:34Z

Explain the problem.
I am converting an ePub file to HTML (for later conversion to PDF).
The command line is simple: pandoc file.epub --from=epub --to=html -o file.html

The ePub file is generated from AsciiDoc (via AsciiDoctor-epub3), and include an inline image (to be precise, a callout).

Exploding the ePub file, the offending portion is (redacted for clarity):

<figure class="listing">
        <pre>Some text <img src="1.svg" class="inline conum"/></pre>
</figure>

The HTML resulting from the conversion is missing the image:

<figure class="listing">
<pre><code>Some text </code></pre>
</figure>

As per my (very light) investigation, it seems that img tags inside pre are not retained, even img is a valid child for pre.

Pandoc version?
pandoc 3.5
Features: +server +lua
Scripting engine: Lua 5.4
OS: Linux (Fedora 40)

The text was updated successfully, but these errors were encountered:

someth2say · 2024-11-20T10:22:09Z

Seems that the problem is not with the transformation, but with the parsing itself.
The generated AST does not include the inline image:

                , Figure
                    ( "" , [ "listing" ] , [] )
                    (Caption Nothing [])
                    [ CodeBlock
                        ( "" , [] , [] )
                        "Some text  "
                    ]

someth2say · 2024-11-20T10:32:17Z

Not surprising, but worth commenting: I see the same behaviour when seeing the AST generated for an (X)HTML document:

<div class="listingblock">
<div class="content">
<pre>Some text <img src="1.svg" alt="1">
</div>
</div>

generates a similar AST node:

, Div
    ( "" , [ "listingblock" ] , [] )
    [ Div
        ( "" , [ "content" ] , [] )
        [ CodeBlock
            ( "" , [] , [] )
            "\\include::{gls_topic_assets_dir}/assets/my_asset.txt[] "
        ]
    ]

jgm · 2024-11-20T17:34:11Z

That's right. We just extract text from pre/code elements, ignoring any other elements. The reason is that pandoc's internal representation of CodeBlock is just a string value -- it doesn't allow formatted inline content in this context.

jgm · 2024-11-20T17:35:58Z

I think this can be closed as out of scope.

someth2say · 2024-11-21T06:57:14Z

I think the root issue is the mapping pre <-> CodeBlock.
In HTML (and other languages), semantics for pre blocks allow formatting and other child blocks; but CodeBlocks are just plain text blocks, so they do not accept children. No not a perfect match.

AFAIK, there is no HTML tag that only accepts text, so I see assumptions need to be done during conversion.

The only solution I can think of is updating the mapping to pre <-> Div with some extra attributes (e.g. 'pre`).
I understand that this can be a breaking change, so it can be an opt-in, via a config, or something like that.

jgm · 2024-11-21T19:15:12Z

Pandoc conversions are often lossy, and that's okay given that different formats don't match precisely. E.g., if you're converting an HTML code block to markdown, you're not going to be able to put emphasis inside it, and generally the desirable behavior is to just leave that out. If we used a Div, we'd get output that wasn't a code block, which is undesirable.

someth2say added the bug label Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inline images lost when conversion from ePub to HTML #10395

Inline images lost when conversion from ePub to HTML #10395

someth2say commented Nov 20, 2024

someth2say commented Nov 20, 2024

someth2say commented Nov 20, 2024

jgm commented Nov 20, 2024

jgm commented Nov 20, 2024

someth2say commented Nov 21, 2024

jgm commented Nov 21, 2024

Inline images lost when conversion from ePub to HTML #10395

Inline images lost when conversion from ePub to HTML #10395

Comments

someth2say commented Nov 20, 2024

someth2say commented Nov 20, 2024

someth2say commented Nov 20, 2024

jgm commented Nov 20, 2024

jgm commented Nov 20, 2024

someth2say commented Nov 21, 2024

jgm commented Nov 21, 2024