Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML and EPUB documents are not produced with pipeline #152

Open
anetader opened this issue Nov 29, 2024 · 10 comments
Open

HTML and EPUB documents are not produced with pipeline #152

anetader opened this issue Nov 29, 2024 · 10 comments

Comments

@anetader
Copy link
Contributor

After setting epub-output: true or html-output: true, the pipelines Produce HTML/EPUB document are failing.
image
image

See failed pipelines: https://github.com/istqborg/istqb_product_base/actions/runs/12083600897

@anetader anetader changed the title HTML and EPUB documents are not produced HTML and EPUB documents are not produced with pipeline Nov 29, 2024
@Witiko
Copy link
Contributor

Witiko commented Dec 2, 2024

There is a number of issues preventing the conversion of the example document to EPUB/HTML:

  • The TeX Live package dvisvgm, which is used to convert vector graphics to the SVG format, isn't installed. This must have been a problem ever since we switched from a full TeX Live Docker image to a minimal Docker image with manually listed dependencies in Create a single DOCX file for every TeX document and optimize pipeline #84 this July. Apparently, no one tried to compile to EPUB/HTML since then.

    The solution is to add dvisvgm to file DEPENDS.txt in this repository.

    The command tlmgr path add should also be added after all occurrences of the command tlmgr install in .github/workflows/compile.yml, so that symlinks to the system /bin directories are automatically added for packages like dvisvgm, which install new scripts. This will allow PRs and non-main branches that add such packages to DEPENDS.txt to work correctly, since we only build a Docker image for branch main and reuse the Docker image ghcr.io/istqborg/istqb_product_base otherwise.

  • There appears to be an issue with using the LaTeX packages babel and nicematrix in EPUB and HTML output. The failures are about as informative as the average LaTeX log file (not much). Further investigation seems necessary. Regardless, this seems potentially much more annoying to fix, especially the part with tables.

@Witiko
Copy link
Contributor

Witiko commented Dec 2, 2024

@danopolan @anetader: I estimate 2–4 hours of effort. When do you need this fixed?

@danopolan
Copy link
Contributor

Thx for the analysis. It's not a priority now, so I will unassign you from this Issue and we will plan its implementation when it will be needed.

@Witiko
Copy link
Contributor

Witiko commented Dec 19, 2024

@danopolan: Our current method for generating HTML and EPUB documents depends on the TeX4ht system. TeX4ht works by patching existing LaTeX packages to produce correct HTML output. However, this approach often lags behind active package development and can be unreliable, especially when dealing with modern, actively maintained LaTeX packages.

Over the past several years, the LaTeX team has been enhancing the LaTeX kernel to support the creation of PDF 2.0 documents. These documents are designed to be fully accessible, complying with both the PDF/UA-2 standard and the Well-Tagged PDF (WTPDF) specification. Accessible PDFs are particularly useful because the WTPDF specification provides a well-defined general algorithm for extracting HTML (and EPUB) content directly from PDF documents.

Like TeX4ht, many LaTeX packages are currently incompatible with accessible PDFs. However, the effort to make LaTeX packages compatible with accessible PDFs is decentralized, and the implementations are more likely to be stable and sustainably maintained by the package authors themselves. Given this, transitioning to accessible PDFs could be a better long-term alternative to TeX4ht. Moreover, future legal requirements might mandate ISTQB to produce materials as accessible PDFs in certain jurisdictions.

Creating accessible PDFs requires many features of the LuaTeX engine. While pdfTeX is technically compatible with accessible PDFs, it may lack support for some features required by this process. Additionally, we need LuaTeX for other purposes, as discussed in issue #51 and issue #145 on our GitHub repository. As a result, the switch to accessible PDFs would also necessitate migrating our TeX codebase from pdfTeX to LuaTeX.

In the meantime, we still need to maintain and work with the current code that relies on TeX4ht. To address this, I have reached out to the primary developer of TeX4ht for assistance in patching the LaTeX packages that are causing issues. We could start addressing these problems as early as December or January, depending on your preference.

@Witiko
Copy link
Contributor

Witiko commented Jan 3, 2025

To address this, I have reached out to the primary developer of TeX4ht for assistance in patching the LaTeX packages that are causing issues. We could start addressing these problems as early as December or January, depending on your preference.

@danopolan: Please, let me know if this is something to look into. The primary developer of TeX4ht won't be always available, so now may be a good time to get the export to HTML and EPUB working and set up automated tests to prevent future regressions.

@danopolan
Copy link
Contributor

@Witiko if I understood correctly, there are two main things discussed.

  1. Migrating out codebase from pdfTeX to LuaTeX
  2. Fixing TeX4ht packages in order to generate HTML and EPUB from out codebase.

Regarding 1), I am not sure about the required effort and timeline. But yes accessible PDFs would be a priority for ISTQB.
On the other hand, regarding 2), having HTML and EPUB exports is not a priority for ISTQB now and would maybe not be till the end of 2025.

If this assumption is right, we do not need to reach out to TeX4ht developer now.

@Witiko
Copy link
Contributor

Witiko commented Jan 6, 2025

Accessible PDFs would be a priority for ISTQB.

We don't need to either migrate from pdfTeX to LuaTeX or fix issues with TeX4ht to start making our documents more accessible. In theory, just adding the following line on the first line of our .tex files should make the output PDFs fully accessible, complying with the PDF/UA-2 standard:

\DocumentMetadata{pdfversion=2.0, pdfstandard=ua-2, testphase={phase-III, title, table, math, firstaid}}

In practice, various LaTeX packages have varying degree of support for PDF tagging. Furthermore, in the future, the LaTeX team is expected to only support some features related to PDF tagging in LuaTeX, not pdfTeX. Therefore, some parts of the documents, such as tables, may be incorrectly tagged. However, unlike with TeX4ht, these are "soft" failures: We can detect them by parsing the logs for warnings and by checking the produced PDFs but they shouldn't crash the compilation.

@danopolan
Copy link
Contributor

Ok, so I see two actions here:

  1. make PDFs fully accessible based on PDF/UA-2 standard
  2. plan migration from pdfTeX to LuaTeX and catch errors so we can act on them

I would not proceed with any of tasks now, since we have right before the release of CTAL TA in March and there is plenty of work going on based on the Beta review. But I would report both so we can start working on them when there will be a right time.

@Witiko
Copy link
Contributor

Witiko commented Jan 7, 2025

Absolutely, I just wanted to put these ideas forward to help with long-term planning, as they'll need our attention eventually. Another important long-term task to consider—though I realize this is off-topic for this ticket—is enhancing the validation for input documents.

@danopolan
Copy link
Contributor

I would most probably schedule a call for next month to discuss the long-term planning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants