Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Conversion of HTML manual pages to markdown fails for HTML figure code #4864

Open
4 tasks
neteler opened this issue Dec 20, 2024 · 7 comments
Open
4 tasks
Assignees
Labels
bug Something isn't working docs HTML Related code is in HTML manual Documentation related issues markdown
Milestone

Comments

@neteler
Copy link
Member

neteler commented Dec 20, 2024

Describe the bug

I am working on the mass conversion of all HTML manual pages to markdown. To convert all HTML files to markdown I have written a pandoc based converter script (see #4620) which already does most of the job.

A showstopper in the conversion of HTML manual pages to markdown are the figures as the related HTML snippets vary from manual page to manual page, nonetheless there is a style recommendation.

For an easier discussion, I have moved the figure issue here to separate it out from #4748.

Many figures looks ugly after MD conversion (resulting MD code is paertially garbage):

  • v.fill.holes.html figures
  • v.to.rast3.html figure
  • ... many more
  • often the figure caption are not properly detected: mkdocs/site/raster3dintro.html

I have written a LUA filter for pandoc (yet unsubmitted) but it can only convert that specific HTML code. With so many HTML variants I have no idea how to do that.

To reproduce

I tried to submit the converted MD files for community review but I get stuck in the pre-commit stage:

From my terminal:

markdownlint-fix.........................................................Failed
- hook id: markdownlint-fix
- exit code: 1
- files were modified by this hook
display/d.rast/d.rast.md:14:1 MD033/no-inline-html Inline HTML [Element: div]
display/d.rast/d.rast.md:16:1 MD033/no-inline-html Inline HTML [Element: img]
display/d.rast/d.rast.md:29:1 MD033/no-inline-html Inline HTML [Element: div]
display/d.rast/d.rast.md:31:1 MD033/no-inline-html Inline HTML [Element: img]
display/d.rast/d.rast.md:43:1 MD033/no-inline-html Inline HTML [Element: div]
display/d.rast/d.rast.md:45:1 MD033/no-inline-html Inline HTML [Element: img]
gui/wxpython/docs/wxGUI.toolboxes.md:180:1 MD033/no-inline-html Inline HTML [Element: img]
gui/wxpython/timeline/g.gui.timeline.md:14:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.li/r.li.cwed/r.li.cwed.md:12:6 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.cwed/r.li.cwed.md:12:26 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.cwed/r.li.cwed.md:14:6 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.cwed/r.li.cwed.md:14:26 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.cwed/r.li.cwed.md:21:1 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.mpa/r.li.mpa.md:10:6 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.mpa/r.li.mpa.md:10:26 MD033/no-inline-html Inline HTML [Element: span]
raster/r.path/r.path.md:122:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.path/r.path.md:124:2 MD033/no-inline-html Inline HTML [Element: img]
raster/r.path/r.path.md:176:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.path/r.path.md:178:2 MD033/no-inline-html Inline HTML [Element: img]
raster/r.resamp.filter/r.resamp.filter.md:98:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.resamp.filter/r.resamp.filter.md:100:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.sim/r.sim.water/r.sim.water.md:30:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.sim/r.sim.water/r.sim.water.md:32:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.sim/r.sim.water/r.sim.water.md:154:81 MD013/line-length Line length [Expected: 80; Actual: 147]
raster/r.sim/r.sim.water/r.sim.water.md:168:81 MD013/line-length Line length [Expected: 80; Actual: 95]
raster/r.sim/r.sim.water/r.sim.water.md:175:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.sim/r.sim.water/r.sim.water.md:177:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.sunmask/r.sunmask.md:89:81 MD013/line-length Line length [Expected: 80; Actual: 96]
raster/r.univar/r.univar.md:59:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.univar/r.univar.md:61:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.univar/r.univar.md:187:1 MD033/no-inline-html Inline HTML [Element: div]
...

Expected behavior

I wonder if we have to touch the ~170 HTML files manually to streamline the HTML figure code therein in order to eventually develop a single pandoc LUA filer.

Support welcome!

@neteler neteler added bug Something isn't working manual Documentation related issues HTML Related code is in HTML docs markdown labels Dec 20, 2024
@neteler neteler added this to the 8.5.0 milestone Dec 20, 2024
@neteler neteler self-assigned this Dec 20, 2024
@echoix
Copy link
Member

echoix commented Dec 20, 2024

Or, if we want to keep things moving, add an exclusion for now. Is there a pattern that could be used or it would be impossible?

It's ok to not have them perfect on the first try.

neteler added a commit to neteler/grass that referenced this issue Dec 20, 2024
Test submission of conversion of all HTML manual pages to markdown using the `pandoc` based converter script (see OSGeo#4620).

For figure code conversion issues, see OSGeo#4864
@neteler
Copy link
Member Author

neteler commented Dec 20, 2024

For easier inspection, converted MD files submitted in #4865.

@ninsbl
Copy link
Member

ninsbl commented Dec 23, 2024

Maybe this python library by Microsoft could be worth a try: https://github.com/microsoft/markitdown ?

@echoix
Copy link
Member

echoix commented Dec 23, 2024

Maybe this python library by Microsoft could be worth a try: https://github.com/microsoft/markitdown ?

I didn't know about this one :)

@ninsbl
Copy link
Member

ninsbl commented Dec 23, 2024

I just tried the markitdown tool on v.fill.holes.html And the result looks quite OK. Images are bigger compared to the pandoc conversion. However, pymarkdownlnt and markdownlint-cli for example complain about line length and missing blank lines (amongst others)... Also code blocks are not automatically defined as shell... So, there some post-processing would be needed too...

@neteler
Copy link
Member Author

neteler commented Dec 27, 2024

I tried it as well, but no success with e.g. this file:

cd raster3d/r3.to.rast/
cat r3.to.rast.html | markitdown  
Traceback (most recent call last):
  File "/home/mneteler/.local/bin/markitdown", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mneteler/.local/lib/python3.12/site-packages/markitdown/__main__.py", line 38, in main
    result = markitdown.convert_stream(sys.stdin.buffer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mneteler/.local/lib/python3.12/site-packages/markitdown/_markitdown.py", line 1142, in convert_stream
    result = self._convert(temp_path, extensions, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mneteler/.local/lib/python3.12/site-packages/markitdown/_markitdown.py", line 1260, in _convert
    raise UnsupportedFormatException(
markitdown._markitdown.UnsupportedFormatException: Could not convert '/home/mneteler/tmp/tmplrsg96v_' to Markdown. The formats [] are not supported.

What's the trick, @ninsbl ?

@neteler
Copy link
Member Author

neteler commented Jan 2, 2025

I just tried the markitdown tool on v.fill.holes.html And the result looks quite OK.

@ninsbl would you mind to share the command you have used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docs HTML Related code is in HTML manual Documentation related issues markdown
Projects
Development

No branches or pull requests

3 participants