Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline images lost when conversion from ePub to HTML #10395

Open
someth2say opened this issue Nov 20, 2024 · 6 comments
Open

Inline images lost when conversion from ePub to HTML #10395

someth2say opened this issue Nov 20, 2024 · 6 comments
Labels

Comments

@someth2say
Copy link

Explain the problem.
I am converting an ePub file to HTML (for later conversion to PDF).
The command line is simple: pandoc file.epub --from=epub --to=html -o file.html

The ePub file is generated from AsciiDoc (via AsciiDoctor-epub3), and include an inline image (to be precise, a callout).

Exploding the ePub file, the offending portion is (redacted for clarity):

<figure class="listing">
        <pre>Some text <img src="1.svg" class="inline conum"/></pre>
</figure>

The HTML resulting from the conversion is missing the image:

<figure class="listing">
<pre><code>Some text </code></pre>
</figure>

As per my (very light) investigation, it seems that img tags inside pre are not retained, even img is a valid child for pre.

Pandoc version?
pandoc 3.5
Features: +server +lua
Scripting engine: Lua 5.4
OS: Linux (Fedora 40)

@someth2say someth2say added the bug label Nov 20, 2024
@someth2say
Copy link
Author

Seems that the problem is not with the transformation, but with the parsing itself.
The generated AST does not include the inline image:

                , Figure
                    ( "" , [ "listing" ] , [] )
                    (Caption Nothing [])
                    [ CodeBlock
                        ( "" , [] , [] )
                        "Some text  "
                    ]

@someth2say
Copy link
Author

Not surprising, but worth commenting: I see the same behaviour when seeing the AST generated for an (X)HTML document:

<div class="listingblock">
<div class="content">
<pre>Some text <img src="1.svg" alt="1">
</div>
</div>

generates a similar AST node:

, Div
    ( "" , [ "listingblock" ] , [] )
    [ Div
        ( "" , [ "content" ] , [] )
        [ CodeBlock
            ( "" , [] , [] )
            "\\include::{gls_topic_assets_dir}/assets/my_asset.txt[] "
        ]
    ]

@jgm
Copy link
Owner

jgm commented Nov 20, 2024

That's right. We just extract text from pre/code elements, ignoring any other elements. The reason is that pandoc's internal representation of CodeBlock is just a string value -- it doesn't allow formatted inline content in this context.

@jgm
Copy link
Owner

jgm commented Nov 20, 2024

I think this can be closed as out of scope.

@someth2say
Copy link
Author

I think the root issue is the mapping pre <-> CodeBlock.
In HTML (and other languages), semantics for pre blocks allow formatting and other child blocks; but CodeBlocks are just plain text blocks, so they do not accept children. No not a perfect match.

AFAIK, there is no HTML tag that only accepts text, so I see assumptions need to be done during conversion.

The only solution I can think of is updating the mapping to pre <-> Div with some extra attributes (e.g. 'pre`).
I understand that this can be a breaking change, so it can be an opt-in, via a config, or something like that.

@jgm
Copy link
Owner

jgm commented Nov 21, 2024

Pandoc conversions are often lossy, and that's okay given that different formats don't match precisely. E.g., if you're converting an HTML code block to markdown, you're not going to be able to put emphasis inside it, and generally the desirable behavior is to just leave that out. If we used a Div, we'd get output that wasn't a code block, which is undesirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants