Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown/AsciiDoc --> PDF tool #39

Open
robogeek opened this issue Dec 4, 2024 · 28 comments
Open

Markdown/AsciiDoc --> PDF tool #39

robogeek opened this issue Dec 4, 2024 · 28 comments

Comments

@robogeek
Copy link
Contributor

robogeek commented Dec 4, 2024

Long-standing question for AkashaCMS is producing PDF from content for AkashaCMS e.g. EPUB. The prompting for this current work is related to a protocol specification that is being converted to Markdown, for which the standards org needs to produce a PDF.

https://github.com/alanshaw/markdown-pdf -- Uses the Remarkable toolchain, does a conversion of Markdown to HTML, then uses PhantomJS to print that to PDF.

With AkashaCMS - setup a project with templates etc - convert the Markdown or AsciiDoc to HTML as is typical for AkashaCMS. Then, the HTML needs to become a single HTML, with the content of each HTML in place, and a page break between each block of HTML. Probably - <section id="file-1">...</section><section id="file-2">...</section>.... Then, as is done in markdown-pdf, use PhantomJS to print it to PDF.

This gives the option of fancy layout and styling that makes its way into the PDF.

If the input were an EPUB document tree, then you'd have the TOC to guide the ordering of HTML files.

I notice that the markdown-pdf repository has several issues requesting that the project switch away from PhantomJS to something else because Phantom is no longer supported.

https://www.npmjs.com/package/mdpdf is a very similar package that uses Puppeteer under the covers.

@robogeek
Copy link
Contributor Author

robogeek commented Dec 4, 2024

Alternate to using PhantomJS is - https://github.com/dompdf/dompdf - this is the PHP implementation of PDF rendering.

@robogeek
Copy link
Contributor Author

robogeek commented Dec 4, 2024

Discussion: https://superuser.com/questions/689056/how-can-i-convert-github-flavored-markdown-to-a-pdf

Listing of several tools: https://gist.github.com/justincbagley/ec0a6334cc86e854715e459349ab1446

Here's two resources for tool recommendations. One thing I gathered is Pandoc relies on LaTeX under the hood - I know about LaTeX from having used it in college many years ago - but it is a rather large and heavy-weight thing.

https://superuser.com/questions/689056/how-can-i-convert-github-flavored-markdown-to-a-pdf
https://gist.github.com/justincbagley/ec0a6334cc86e854715e459349ab1446

https://github.com/alanshaw/markdown-pdf - This tool converts Markdown to HTML, then uses PhantomJS to print that to PDF. PhantomJS is a wrapper around the Chrome engine that is most often used for automated UI testing. It means it is a "Browser" that you can drive from software, and therefore use it to generate PDF.

https://github.com/dompdf/dompdf - is an implementation of high quality PDF rendering in PHP.

https://gist.github.com/justincbagley/ec0a6334cc86e854715e459349ab1446?permalink_comment_id=5281580#gistcomment-5281580 -- Is an interesting comment in the Gist above. It relies on a Visual Studio Code extension for printing PDF from Markdown.

@robogeek
Copy link
Contributor Author

robogeek commented Dec 9, 2024

Trying to use Paged.js -- https://pagedjs.org/

The getting started page has instructions for Node.js usage at - https://pagedjs.org/documentation/2-getting-started-with-paged.js/#starting-paged.js

Commands:

npm install -g pagedjs-cli pagedjs
pagedjs-cli index.html -o result.pdf

The resulting installation:

$ npm ls pagedjs-cli pagedjs
[email protected] /home/david/Projects/openadr/docx2pdf
├─┬ [email protected]
│ └── [email protected] deduped
└── [email protected]

The primary tool is: https://www.npmjs.com/package/pagedjs-cli

I have constructed an AkashaCMS website project with a single Markdown file. This file is generated using Pandoc from a DOCX file created with Word by a standards agency. The DOCX file is in good shape. The Markdown file was not very good, but usable and could be cleaned up.

Output:

$ npx pagedjs-cli -i out/Definition.html -o def-paged.pdf
◷ Loading: out/Definition.htmlError: Failed to launch the browser process!
[1251282:1251282:1209/225626.894306:FATAL:zygote_host_impl_linux.cc(127)] No usable sandbox! Update your kernel or see https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox_development.md for more information on developing with the SUID sandbox. If you want to live dangerously and need an immediate workaround, you can try using --no-sandbox.


TROUBLESHOOTING: https://pptr.dev/troubleshooting

    at ChildProcess.onClose (file:///home/david/Projects/openadr/docx2pdf/node_modules/@puppeteer/browsers/lib/esm/launch.js:268:24)
    at ChildProcess.emit (node:events:530:35)
    at ChildProcess._handle.onexit (node:internal/child_process:293:12)

The web page referenced in the error discusses installing chrome-devel-sandbox but that doesn't seem to exist in the Ubuntu package system.

@robogeek
Copy link
Contributor Author

robogeek commented Dec 9, 2024

For the testing so far - I had a DOCX file which was created in MS Word by a standards organization. The plan is to

  • From the DOCX, to produce a Markdown file
  • The standards organization will then be able to edit (via GitHub) the standard in Markdown format
  • Develop a process for creating a good-looking PDF from the Markdown file which the standards organization can release to the public
  • Also producing a DOCX is a much lower priority optional result

The first stab at the first step - DOCX->Markdown - I tried two programs.

  1. Mammoth - https://www.npmjs.com/package/mammoth - Node.js tool. The command line is on the NPM page.
  2. Pandoc - https://pandoc.org/ - This is a general purpose tool for document format conversion.

Pandoc Markdown-HTML conversion:

pandoc test1.md -f markdown -t html -s -o test1.html

Pandoc DOCX to Markdown conversion

pandoc 3.1.0/Definition.docx -t markdown -o def.md

Replace -t html with -t pdf then add a .pdf extension to the output file name, and it is supposed to convert to PDF. But, this failed saying that PDFLaTeX must be installed. I was unable to find out how to do so. This tool is not listed in the Ubuntu packages findable with "apt-get".

This meant two tools for DOCX-Markdown conversion. Neither did a terribly good job. The least bad result was with Pandoc.

@robogeek
Copy link
Contributor Author

robogeek commented Dec 9, 2024

Markdown-pdf is a Node.js tool that can convert directly from Markdown to PDF. In theory this is less desirable since there are fewer opportunities to customize the output with good font and color choices or other fancy formatting. But, it is worth exploring. Browsing the source code it does a Markdown-HTML conversion, then loads the HTML into a headless Chrome (web browser) instance, and tells that instance to Print to PDF.

https://www.npmjs.com/package/markdown-pdf

The command is:

$ npx markdown-pdf -o def-md-pdf.pdf documents/Definition.html.md 

But, this fails with the following errors:

A4 portrait 2cm 0 10000
Auto configuration failed
131987413665600:error:25066067:DSO support routines:DLFCN_LOAD:could not load the shared library:dso_dlfcn.c:185:filename(libproviders.so): libproviders.so: cannot open shared object file: No such file or directory
131987413665600:error:25070067:DSO support routines:DSO_load:could not load the shared library:dso_lib.c:244:
131987413665600:error:0E07506E:configuration file routines:MODULE_LOAD_DSO:error loading dso:conf_mod.c:285:module=providers, path=providers
131987413665600:error:0E076071:configuration file routines:MODULE_RUN:unknown module name:conf_mod.c:222:module=providers

After some head scratching and searching this is found: https://forums.gentoo.org/viewtopic-p-8793806.html?sid=a7a4227e46aa82bb934fc71187a89cb9

After a lot of discussion those Gentoo guys came up with this solution:

export OPENSSL_CONF=/etc/ssl

That seemingly random suggestion actually works great. It produces a PDF that - considering the state of the Markdown file - is not bad.

The USAGE information hints at some interesting options for customizing the output using CSS

Usage: markdown-pdf [options] <markdown-file-path>

Options:
  -V, --version                            output the version number
  <markdown-file-path>                     Path of the markdown file to convert
  -c, --cwd [path]                         Current working directory
  -p, --phantom-path [path]                Path to phantom binary
  -h, --runnings-path [path]               Path to runnings (header, footer)
  -s, --css-path [path]                    Path to custom CSS file
  -z, --highlight-css-path [path]          Path to custom highlight-CSS file
  -m, --remarkable-options [json-options]  Options to pass to remarkable
  -f, --paper-format [format]              "A3", "A4", "A5", "Legal", "Letter" or "Tabloid"
  -r, --paper-orientation [orientation]    "portrait" or "landscape"
  -b, --paper-border [measurement]         Supported dimension units are: "mm", "cm", "in", "px"
  -d, --render-delay [millis]              Delay before rendering the PDF
  -t, --load-timeout [millis]              Timeout before the page is rendered in case `page.onLoadFinished` isn't fired
  -o, --out [path]                         Path of where to save the PDF
  -h, --help                               output usage information

@robogeek
Copy link
Contributor Author

Puppeteer - https://pptr.dev/ -

Puppeteer is a JavaScript library which provides a high-level API to control Chrome or Firefox over the DevTools Protocol or WebDriver BiDi. Puppeteer runs in the headless (no visible UI) by default

The model here is to generate HTML from Markdown using AkashaCMS, then do what's necessary to use Puppeteer to PrintToPDF

Docs: https://pptr.dev/api/puppeteer.page.pdf

Docs: https://pptr.dev/guides/pdf-generation

Docs: https://pptr.dev/api/puppeteer.pdfoptions

Tutorial: https://blog.risingstack.com/pdf-from-html-node-js-puppeteer/

Tutorial: https://apitemplate.io/blog/tips-for-generating-pdfs-with-puppeteer/

Tutorial: https://www.bannerbear.com/blog/how-to-make-a-pdf-from-html-with-node-js-and-puppeteer/

Tutorial: https://doppio.sh/blog/how-to-generate-pdfs-with-node-js-and-puppeteer

Tutorial: https://medium.com/@fmoessle/use-html-and-puppeteer-to-create-pdfs-in-node-js-566dbaf9d9ca

Tutorial: https://www.webshare.io/academy-article/puppeteer-html-to-pdf

Implementation details: https://stackoverflow.com/questions/51458286/how-does-header-and-footer-printing-work-in-puppeters-page-pdf-api

Tutorial: https://medium.com/@chrisgordon256/how-to-convert-html-to-pdf-with-puppeteer-7c543e69a3c2

Tutorial: https://dev.to/gabrielqueirozdev/generate-pdf-with-puppeteer-handlebars-355h

Tutorial: https://raphaelstaebler.info/en/blog/advanced-pdf-generation-for-node-js-using-puppeteer/

Tutorial: https://medium.com/@damcossetfreelance/generate-a-pdf-from-html-with-puppeteer-dc1dd17b499f

Tool: https://www.npmjs.com/package/pdf-puppeteer -- A tool for generating PDF from HTML

Tutorial: https://medium.com/@bikramkawan/how-to-generate-high-quality-pdf-using-puppeteer-for-node-js-on-apple-m1-fa666cf72eab

Tutorial: https://codestax.medium.com/multi-page-pdf-with-distinct-layout-using-puppeteer-ee8d45c7594b -- Multi-Page - Page Break - Distinct styling per page

Gist: https://gist.github.com/glenhallworthreadify/d447e9d6b1fc9cb807b46f952236d4bc

Tutorial: https://ironpdf.com/blog/pdf-tools/html-to-pdf-node-js-puppeteer/

Tutorial: https://www.browserless.io/blog/puppeteer-pdf-generator

@robogeek
Copy link
Contributor Author

Using DOMPDF. I first tried using Composer to set up DOMPDF as per some of the tutorials. But this gave an error that is unclear how to resolve:

PHP Warning:  The use statement with non-compound name 'Dompdf' has no effect in /home/david/Projects/openadr/docx2pdf/t/cvt.php on line 7
PHP Fatal error:  Uncaught Error: Class "Dompdf" not found in /home/david/Projects/openadr/docx2pdf/t/cvt.php:10
Stack trace:
#0 {main}
  thrown in /home/david/Projects/openadr/docx2pdf/t/cvt.php on line 10

The script source is:

<?php

// The Composer autoloader
require_once 'vendor/autoload.php';

// Reference the Dompdf namespace
use Dompdf\Dompdf as Dompdf;

// Instantiate and use the dompdf class
$dompdf = new Dompdf();

// Load HTML content to generate a PDF
// $dompdf->loadHtml('<h1 style="color:blue;">AllPHPTricks.com</h1>');

// Load PDF content from an HTML file
$html_file = file_get_contents("../../out/Definition.html");

$dompdf->loadHtml($html_file);

// (Optional) Setup the paper size and orientation
$dompdf->setPaper('A4', 'landscape');

// Render the HTML as PDF
$dompdf->render();

// Returns the PDF file as a string.
$pdf_string = $dompdf->output();

// PDF file name and location to store file
$pdf_file_loc = 'Definition.pdf';

// Save generated PDF to the desired location with custom name
file_put_contents($pdf_file_loc, $pdf_string);

?>

This is from one of the tutorials, modified lightly to directly load HTML from a local file.

But - instead going to the DOMPDF GitHub repository (https://github.com/dompdf/dompdf) there is a Releases section. I downloaded the most recent release, dompdf.3.0.1.zip. The releases page says these .zip files have everything required, but they also recommend using Composer for managing dependencies. Right.

Notice that the script loads vendor/autoload which appears to be how PHP files can find classes.

Notice that installing Composer as a native thing on my laptop resulted in installation of a zillion PHP things.

$ php cvt.php 
PHP Fatal error:  Uncaught Error: Class "DOMImplementation" not found in /home/david/Projects/openadr/docx2pdf/tt/dompdf/vendor/masterminds/html5/src/HTML5/Parser/DOMTreeBuilder.php:172
Stack trace:
#0 /home/david/Projects/openadr/docx2pdf/tt/dompdf/vendor/masterminds/html5/src/HTML5.php(157): Masterminds\HTML5\Parser\DOMTreeBuilder->__construct()
#1 /home/david/Projects/openadr/docx2pdf/tt/dompdf/vendor/masterminds/html5/src/HTML5.php(89): Masterminds\HTML5->parse()
#2 /home/david/Projects/openadr/docx2pdf/tt/dompdf/vendor/dompdf/dompdf/src/Dompdf.php(514): Masterminds\HTML5->loadHTML()
#3 /home/david/Projects/openadr/docx2pdf/tt/dompdf/cvt.php(18): Dompdf\Dompdf->loadHtml()
#4 {main}
  thrown in /home/david/Projects/openadr/docx2pdf/tt/dompdf/vendor/masterminds/html5/src/HTML5/Parser/DOMTreeBuilder.php on line 172

Fixing that required the following:

$ sudo apt-cache search php-dom
[sudo] password for david: 
php8.3-xml - DOM, SimpleXML, XML, and XSL module for PHP
$ sudo apt-get install php8.3-xml

After which:

$ php cvt.php 

No output means success. Indeed, Definition.pdf was in the directory. Using a PDF viewer showed a PDF that matched the Markdown.

@robogeek
Copy link
Contributor Author

Using Puppeteer, developed the following script that both handles rendering Markdown to HTML and then using Puppeteer to render HTML to PDF

import { default as config } from './config.mjs';
import puppeteer from 'puppeteer';

const akasha = config.akasha;
await akasha.setup(config);

// await data.removeAll();
await config.copyAssets();
let results = await akasha.render(config);

// Initialization comes from 
// https://apitemplate.io/blog/tips-for-generating-pdfs-with-puppeteer/
const browser = await puppeteer.launch({
    headless: true,
    userDataDir: './tmp',
    args: [   '--disable-features=IsolateOrigins',
              '--disable-site-isolation-trials',
              '--autoplay-policy=user-gesture-required',
              '--disable-background-networking',
              '--disable-background-timer-throttling',
              '--disable-backgrounding-occluded-windows',
              '--disable-breakpad',
              '--disable-client-side-phishing-detection',
              '--disable-component-update',
              '--disable-default-apps',
              '--disable-dev-shm-usage',
              '--disable-domain-reliability',
              '--disable-extensions',
              '--disable-features=AudioServiceOutOfProcess',
              '--disable-hang-monitor',
              '--disable-ipc-flooding-protection',
              '--disable-notifications',
              '--disable-offer-store-unmasked-wallet-cards',
              '--disable-popup-blocking',
              '--disable-print-preview',
              '--disable-prompt-on-repost',
              '--disable-renderer-backgrounding',
              '--disable-setuid-sandbox',
              '--disable-speech-api',
              '--disable-sync',
              '--hide-scrollbars',
              '--ignore-gpu-blacklist',
              '--metrics-recording-only',
              '--mute-audio',
              '--no-default-browser-check',
              '--no-first-run',
              '--no-pings',
              '--no-sandbox',
              '--no-zygote',
              '--password-store=basic',
              '--use-gl=swiftshader',
              '--use-mock-keychain']
});

const page = await browser.newPage();
await page.goto(`file://${__dirname}/out/Definition.html`, { waitUntil: 'networkidle0' });

// Generate PDF at default resolution
const pdf = await page.pdf({format: 'A4'});

// Write PDF to file
fs.writeFileSync('default.pdf', pdf);

await browser.close();
await akasha.closeCaches();

Timing for just building using the traditional build process:

 time npm run build

> [email protected] build
> npm-run-all build:copy build:render


> [email protected] build:copy
> akasharender copy-assets config.mjs


> [email protected] build:render
> akasharender render --quiet config.mjs

Could not find the language '{=html}', did you forget to load/include a language module?

real	0m3.579s
user	0m4.205s
sys	0m0.491s

Then building using the above script:

$ time npx zx build.mjs 
Could not find the language '{=html}', did you forget to load/include a language module?

real	0m5.252s
user	0m4.260s
sys	0m0.892s

@robogeek
Copy link
Contributor Author

In Puppeteer the page.pdf method can take an options object that supports setting page margins, page format, and to set headers and footers. I haven't been able to get the latter working, but the other parts do work.

// Generate PDF at default resolution
const pdf2 = await page2.pdf({
    format: 'A4',
    margin: { top: '20mm', right: '20mm', bottom: '20mm', left: '20mm' },
    displayHeaderFooter: true,
    headerTemplate: '<div class="title">TITLE GOES HERE</div>',
    footerTemplate: '<div>Page <span class="pageNumber"></span> of <span class="totalPages"></span></div>'
});

@robogeek
Copy link
Contributor Author

For this specification there is a table of information that can be pulled in from JSON schema files.

I implemented a Mahafunc to do so, and render through a Nunjucks template.

@robogeek
Copy link
Contributor Author

As for adding an auto-generated Table of Contents.

import { default as MarkdownItAnchor } from 'markdown-it-anchor';
import { default as MarkdownItTOC } from 'markdown-it-table-of-contents';

config.findRendererName('.html.md')
    // ...
    .use(MarkdownItAnchor)
    .use(MarkdownItTOC)

These two work together to generate a TOC from the header tags. Simply place [[toc]] in the content at the desired location.

https://www.npmjs.com/package/markdown-it-hierarchy -- looked like a useful extension to this. It would automatically generate section numbers based on the header hierarchy. While it adds section numbers to the rendered Hn tags, the ToC does not include the section numbers.

It would be better to manually insert section numbers.

@robogeek
Copy link
Contributor Author

Old-school support for images was solely the <img> tag. The modern way is <figure><img><figcaption>..</figcaption></figure> While AkashaCMS has a way to handle this, <img figure href=.. caption=..> there is a Markdown-IT plugin for this purpose.

https://www.npmjs.com/package/markdown-it-image-figures

Example:

![OAuth2 client credential flow](./img/defs-fig-16-oath-client-credential-flow.png "Figure 16: OAuth2 client credential flow")

@robogeek
Copy link
Contributor Author

Using draw.io to create diagrams is very easy. That service is especially geared towards creating software diagrams. You can use it online via the domain name, or there is a cross-platform application that's even available for Linux. This app lets one directly save files to the disk rather than saving to Google Drive. Hence that's good for personal privacy, since Google has no business knowing what diagrams I'm drawing.

In the application, Export As the drawing, and make sure to tick the checkbox that says to include the diagram in the image. The diagram can then be edited later by opening it again in draw.io. The app does allow saving a .drawio file that contains the diagram, but that file cannot be viewed as an image. Hence it is preferred to save the image in PNG with the embedded diagram rather than managing two files.

An alternate way to embed diagrams is using PlantUML. The PlantUML extension is installed in the configuration. The PlantUML diagram description is simple text, but which is not well documented. It is recommended to configure the plugin for SVG output. The resulting diagram does render into PDF.

Problem 1 - A diagram description that made sense from the documentation inscrutibly gave an error.

Problem 2 - After rendering one image subsequent images had a banner with a QR code that presumably requires the payment of a usage fee. If I was to be regularly using PlantUML that might be worthwhile. Actually, carefully reading the website there is nothing about any fee, and everything is GPL.

Problem 3 - the most righteous way of using PlantUML is with a local server. But how to manage that in a Node.js build environment? There are some Markdown-IT PlantUML plugins that work with a local plantuml.jar - I didn't try any of them but maybe that's all which is required and the plugin takes care of it.

Problem 4 - The diagrams used in the Definitions document are not supported in PlantUML but were easy to draw using draw.io.

@robogeek
Copy link
Contributor Author

Doing some research into alternatives - jsPDF looks good. There's a simple example that would work something like this:

import { promises as fsp } from 'node:fs';
import { jsPDF } from 'jspdf/dist/jspdf.node.js'; // will automatically load the node version

const defs = await fsp.readFile('out/Definition.html', 'utf-8');

const doc = new jsPDF();

// doc.text("Hello world!", 10, 10);
// doc.save("a4.pdf");

doc.html(defs, {
    callback: (doc) => {
        doc.save(defs.pdf);
    }
});

But it fails with this message:

ReferenceError: document is not defined

The stack trace points to the createElement function, and it is expecting document.createElement to work, but document is undefined.

The following issue explains what's going on.
parallax/jsPDF#2805 (comment)

Namely - to convert HTML to something the jsPDF library can use, a full DOM implementation is required, which in turn requires a web browser. Their recommendation is using Puppeteer.

Yes, I follow that a DOM implementation is required.

@robogeek
Copy link
Contributor Author

Getting header/footer to work requires setting enough CSS styles.

Some discussion said that Puppeteer header/footer sections have zero CSS - puppeteer/puppeteer#1853 and puppeteer/puppeteer#10024

    headerTemplate: `
        <div class="text-center title" style="margin-left: auto; margin-right: auto; font-size: 12px;">TITLE GOES HERE</div>
    `,
    // headerTemplate: `
    //     <div class="text-center title" style="margin: 0 15mm 5mm; font-size: 12px;">TITLE GOES HERE</div>
    // `,
    footerTemplate: `
        <div class="text-left"  style="margin: 0 auto 0 20mm; text-align: left; font-size: 12px;">Copyright © OpenADR Alliance (2023-24). All Rights Reserved</div>
        <div class="text-right" style="margin: 0 20mm 0 auto; text-align: right; font-size: 12px;">Page <span class="pageNumber"></span> of <span class="totalPages"></span></div>
    `,

@robogeek
Copy link
Contributor Author

Implementing the page break before/after the Table of Contents. That is, the title page should be on its own page, then the table of contents on a new page, and the content start on a new page.

It was necessary to carefully read: https://www.w3schools.com/cssref/pr_print_pagebb.php

A page-break-before and page-break-after means what they say, that there is a page break before the element and another afterward.

Hence, the section for Table of Contents is:

<div style="page-break-before: always; page-break-after: always;"> <!-- Your page Content -->

# CONTENTS { #contents .page_break }

[[toc]]

</div>

That's one solution - to throw a <div> around it with inline CSS styles.

With the current Mahabhuta-IT configuration, the structure is:

<section id="contents" class="page_break" tabindex="-1">
<h1 class="page_break" tabindex="-1">CONTENTS</h1>
... ToC
</h1>
</section>

The <section> is autogenerated. The ID #contents is attached to the H1 declaration with { #contents .page_break }.

Hence, this CSS can be used - and therefore avoid putting a <div> around it.

section#contents {
    page-break-after:  always;
    page-break-before: always;
}

/* https://stackoverflow.com/questions/22746958/dompdf-adding-a-new-page-to-pdf */

// div.page_break {
//     page-break-before: always;
// }

Hence, the generated <section> has page break before and after, which is how it functions in the PDF.

@robogeek
Copy link
Contributor Author

robogeek commented Dec 14, 2024

As cool as the Paged.js project sounds and as comprehensive their approach is ...? Is all we need a good stylesheet for printing?

https://printedcss.com/ -- This is one. Simply add it to the stylesheets and the content looks much better. Plus the stylesheet can be tweaked as desired.

NOTE - While this made some visual improvement - the page break behavior is strange. There is a page break between the cover page and the CONTENTS section, and there is a page break after the CONTENTS section before the content. But, there is now a page break between the CONTENTS header and the <div><ul></ul></div> for the table of contents. Exploring the HTML and looking at the style sheet doesn't make it clear why this spurious page break occurs.

One possibility is this:

	a, blockquote, canvas, details, dl, figure, img, ol, picture, svg, table, ul {
		break-inside: avoid;
		page-break-inside: avoid; }

Since the table of contents is a big <ul></ul> which is longer than one page, the browser may have taken this to move the ToC to its own page.

Yes - commenting out "ul" from this did eliminate the spurious page break.

That was found with the query "css style printing" which turns up lots of tutorials. "css stylesheet printing" is a little better.

https://gist.github.com/davidhund/0cf7dd437402c5a1dcb7bd701141a4a7

https://iangmcdowell.com/blog/posts/laying-out-a-book-with-css/

https://ebooks.stackexchange.com/questions/6742/any-opensource-or-free-css-stylesheets-for-books

http://bbebooksthailand.com/bb-CSS-boilerplate.html

https://ideatrash.net/2012/11/my-updated-css-stylesheet-for-ebook.html

https://datatracker.ietf.org/doc/rfc7993/ -- Guidance for CSS stylesheets used with IETF RFCs

https://design-system.w3.org/ -- The w3c publishes lots of standards. This describes the tools for doing so (CSS and JavaScript anyway)

https://www.w3.org/Guide/manual-of-style/ W3C Manual of Style

@robogeek
Copy link
Contributor Author

DO NOT USE markdown-it-hierarchy. It seems like a perfectly wonderful plugin -- automatically generating a numerical hierarchy for H tags in a document. But, if your application processes multiple documents, the numbering will keep increasing for each document.

The numbering MUST reset to 1 for each document.

shytikov/markdown-it-hierarchy#4

@robogeek
Copy link
Contributor Author

I found an immensely crazy solution for Hn tag numbering that also handles the Table of Contents. It works purely in the CSS so does not require any Markdown or Mahabhuta extension. Perhaps, though, the concept can be turned into some code, because it would be useful for rendered HTML to have these numbers rather than it being generated upon display.

https://gist.github.com/rodolfoap/6cd714a65a891c6fe699ab91f0d22384

The concept appears to rely on an implicit tree traversal...?

There is a failing - the numbering in the ToC is different from the numbering in the content. This is because the H1 for CONTENTS has .header-title which the CSS explicitly detects and skips numbering, but there's no facility in the ToC.

There's an implicit failure here. The numbering should be computed in the Hn headers, and have a CSS class to indicate Hn headers that should not be numbered (.skip-numbering). Then, using ID values (?) copy the numbering over to the matching ToC entry.

A more serious failure is that the numbering in the text is completely wrong. Every H2/H3/etc tag has numbering that starts at 0.

It's way to Zen for me at the moment.

body {
    counter-reset: h1
}

h1 {
    counter-reset: h2
}

h2 {
    counter-reset: h3
}

h3 {
    counter-reset: h4
}

h1:not(.header-title)::before {
    counter-increment: h1;
    content: counter(h1) ". "
}

h2:before {
    counter-increment: h2;
    content: counter(h1) "." counter(h2) ". "
}

h3:before {
    counter-increment: h3;
    content: counter(h1) "." counter(h2) "." counter(h3) ". "
}

h4:before {
    counter-increment: h4;
    content: counter(h1) "." counter(h2) "." counter(h3) "." counter(h4) ". "
}

ul {
  counter-reset: section;
  list-style-type: none;
}

ul li {
  position: relative;
}

ul li::before {
  counter-increment: section;
  content: counters(section, ".") ". ";
}

ul ul li::before {
  content: counters(section, ".") ". ";
}

ul ul {
  counter-reset: section;
}

@robogeek
Copy link
Contributor Author

robogeek commented Dec 15, 2024

How to do Citations to Other Technical Documents in a similar way to how IETF or W3C standards do it?

That is a section in the document has several things like this - [RFC-nnnn] Title string for RFCnnnn - and in the text you'll find [RFC-nnnn].

For scientific paper references, there's this: https://medium.com/@oleksandr.kosovan/how-to-cite-a-paper-using-github-markdown-syntax-5a46268c4ff -- Namely, there are multiple formats for Scientific Paper Citations, but they boil down to having a link to https://doi.org. So.. for that link, use a Markdown link, and then use the citation format which is preferred.

For references like are used in Standards Documents: https://stackoverflow.com/questions/26587527/cite-a-paper-using-github-markdown-syntax

In the References section, have a list like this:

## References
<a id="1">[1]</a> : Dijkstra, E. W. (1968). Go to statement considered harmful. Communications of the ACM, 11(3), 147-148.

Then to refer to it within the document use [[1]](#1). This is a standard Markdown link where the anchor text is [1] and it has a link to #1 which is an internal document link.

If the -attrs extension is installed the reference list entry could be

## References
[1]{#1}: Dijkstra, E. W. (1968). Go to statement considered harmful. Communications of the ACM, 11(3), 147-148.

After experimenting several formats - none worked. For example, this entry:

<span id=ISO8601">[ISO 8601]</span> ISO date and time format. https://www.iso.org/iso-8601-date-and-time-format.html

Resulted in this output:

<span id=ISO8601">[ISO 8601] ISO date and time format. https://www.iso.org/iso-8601-date-and-time-format.html: https://www.iso.org/iso-8601-date-and-time-format.html

That's just wrong. Later in the document a reference to this was underlined but not an active link.

Inspecting the HTML, the <span id=ISO8601"> is not an HTML element but HTML-encoded text.

@robogeek
Copy link
Contributor Author

Created a Mahabhuta function for a) numbering Hn headers in the document, b) generating a Table of Contents from those Hn headers, nesting them as appropriate

The auto-numbering technique is similar to the CSS above, but in a Mahabhuta function.

@robogeek
Copy link
Contributor Author

For footnotes -- the document had Pandoc-generated footnotes where [^1] is a reference to Footnote 1. Elsewhere in the text the footnote is defined as so:

[^1] Footnote text

The implementation comes from: https://www.npmjs.com/package/markdown-it-footnote

It does not rely on the AkashaCMS Footnotes plugin. In fact, that plugin should be deprecated in favor of https://www.npmjs.com/package/markdown-it-footnote

The presentation was helped by a custom rule as described in the package documentation. To add a custom rule, the Markdown Renderer in the Renderers package had to be changed to expose the md.renderer.rules object.

@robogeek
Copy link
Contributor Author

The Table of Contents is a nested UL/LI/UL/LI list. For good semantic HTML, it's preferable to surround it with a <nav> tag. But, when that tag was used the PrintToPDF made the ToC be invisible. Ergo, the template for rendering the ToC throws a <span> around the list.

It was noted that a nested UL had some blank space above it. It was determined the blank space came from this definition in print.css:

address, blockquote, details, dl, figure, ol, p, pre, ruby, table, ul {
	margin: 1em 0; }

So -- margin-top is 1em which is approximately the correct blank space. WHY?

In style.css the following CSS was added:

/* This is for the Table of Contents, which is a
 * nested ul/li/ul/li list.  The nested elements
 * had blank space.  In print.css there is a
 * declaration where ul has margin 1em 0, which means
 * margin-top is 1em.  This overrides that for lists
 * with class list-no-margin.
 *
 * UPDATE: It was discovered that every nested UL/LI
 * list had this issue.  So, we simply apply the
 * override to every <ul>
 */
ul.list-no-margin, ul {
    margin-top: 0em;
}

And, in the Table of Contents Mahabhuta function, the template has list-no-margin in UL elements. That caused the blank space to disappear.

Elsewhere in the text is another nested UL/LI/UL/LI structure. It also had the blank space.

I first experimented with applying this to all UL elements. But that did not fix the problem.

Instead, adding { .list-no-margin } to the UL list in question gave that UL the class and therefore the margin-top became 0em.

Because the Mahabhuta-it-Attrs plugin is installed the method is:

* list item
* list item
    * sub-list-item
    * sub-list-item
    { .list-no-margin }
* list item
* list item
    * sub-list-item
    * sub-list-item
    { .list-no-margin }

In other words, the { .list-no-item } thingymajiggy has to be added to each sub-list.

@robogeek
Copy link
Contributor Author

robogeek commented Dec 22, 2024

It's not possible to add a caption to a code block. All through the OpenADR spec we see this:

\`\`\`json
{
  "title": "Not Found",
  "status": 404,
  "detail": "Unrecognized URL"
}
\`\`\`

**Figure 3. Problem Example**

That is, there's a normal code block, following which is a Figure n. label. It would be correct to tie the two together.

But, there is not a Markdown-IT extension for this.

This article discusses the problem and says there has been discussion in the Commonmark community, but no agreed-upon syntax. https://thesynack.com/posts/codeblock-labels-markdown/

In <table> there is the <caption> tag. For images, there is <figure>...<figcaption>.

According to MSDN the <figure> tag can be used with other things, but it does not explain using it with code blocks.

This does say it can be used with <code> blocks https://webdesign.tutsplus.com/quick-tip-consider-wrapping-your-code-with-a-figure-element--cms-21646t

Hence, this is handled somewhat okay:

<figure>
<pre><code class="json">
{
  "title": "Not Found",
  "status": 404,
  "detail": "Unrecognized URL"
}
</code></pre>
<figcaption>
<strong>Figure 3. Problem Example</strong>
</figcaption>
</figure>

But - the code block is not properly highlighted, meaning highlight.js doesn't see it, and the code block is centered rather than left-aligned. But, the caption is part of the figure.

Here's a screen capture:

image

Further, why are we using Markdown to switch to HTML for something so important?

The markdown-it-anchors https://www.npmjs.com/package/markdown-it-attrs package allows this:

  \```python {data=asdf}
  nums = [x for x in range(10)]
  \```

Which outputs as this:

<pre><code data="asdf" class="language-python">
nums = [x for x in range(10)]
</code></pre>

But it appears this doesn't allow a caption? Trying this in the document

json {data=asdf caption="Figure 3. Problem Example"}

Results in this output:

<pre><code data="asdf" caption="Figure 3. Problem Example" class="language-json">

Hence, this is like the example. But the caption attribute does not show up anywhere, and further highlight.js no longer recognizes that code block for highlighting.

image

@robogeek
Copy link
Contributor Author

robogeek commented Dec 23, 2024

There are multiple choices for embedding UML or other diagrams in Markdown documents. One choice is using Draw.io to draw the diagram, export it as a PNG, then embed the PNG.

PlantUML has the advantage of a textual description in the Markdown, and embedding the actual diagram as an SVG which can then be manipulated using CSS or JavaScript at run time.

https://mermaid.js.org/ - Mermaid is similar to PlantUML, a textual description of diagrams. The documentation for Mermaid is extremely better than for PlantUML. But, it is not certain how to implement Mermaid in the same way - and instead that Mermaid can only be rendered by browser-side code in a web browser. If so, a page using Mermaid must have JavaScript at the bottom to cause Mermaid drawings to be rendered in the browser, and only after the drawings are rendered to print the document to PDF.

There are multiple Mermaid plugins for Markdown-IT. This one appears to be the most up-to-date: https://www.npmjs.com/package/@liradb2000/markdown-it-mermaid

https://www.npmjs.com/package/@mermaid-js/mermaid-cli is a command-line tool for rendering Mermaid documents to SVG. Ergo, a directory tree of Mermaid documents could exist, and this tool is used to render to SVG, that can then be included using an <img> tag into a Markdown document. But, I see Puppeteer in the dependencies, and therefore behind the scenes Puppeteer is used for this rendering.

https://www.npmjs.com/package/markdown-it-textual-uml This package supports PlantUML, Mermaid, and two other packages for rendering drawings.

Mermaid can be used on GitHub

sequenceDiagram
    Alice->>Bob: Hello Bob, how are you ?
    Bob->>Alice: Fine, thank you. And you?
    create participant Carl
    Alice->>Carl: Hi Carl!
    create actor D as Donald
    Carl->>D: Hi!
    destroy Carl
    Alice-xCarl: We are too many
    destroy Bob
    Bob->>Alice: I agree
Loading

One way that might work to support Mermaid is to ignore the Markdown-IT plugin. Instead, write a custom tag <use-mermaid> that can be added to any document, or else add something to the frontmatter. The goal is for Mermaid to be included included for pages that contain Mermaid diagrams. Or, instead of a flag, to search for <pre> or <code> blocks that reference Mermaid.

The Mermaid code should not execute on every page, but only on pages where it is needed. Therefore it would be incorrect for an AkashaCMS project to use addFooterJavascript to add Mermaid because that would happen on every page.

@robogeek
Copy link
Contributor Author

Prism is an alternative to Highlight.js for code highlighting.

https://prismjs.com/ -- Package home page

https://www.npmjs.com/package/markdown-it-prism - Markdown-IT plugin

While it appears Prism is meant for browser-side highlighting, it supports server-side on Node.js.

@robogeek
Copy link
Contributor Author

Issues found on techsparx.com ...

FIrst - Home page thumbs were gigantic. This turned out to be that LESS to CSS compilation didn't work right. This had been observed with the PDF production but not recognized. This is now fixed.

Second - Directly calling plugin-affiliate - getRandomProduct did not return a product even though inserting console.log statements showed that the product was selected. But - getrandomProduct is now async and cannot be changed, but NJK does synchronous template rendering and does not work when calling an async function. This necessitates using a custom element.

The fix is to change this existing element to call getRandomProduct:

<affiliate-product-link
        random="yes"
        href="/lifestyle/geek-gear.html">
</affiliate-product-link>

Adding random=yes causes it to select a random product.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant