Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't emit unnecessary classes in HTML tables #9325

Closed
ThomasSoeiro opened this issue Jan 10, 2024 · 15 comments
Closed

Don't emit unnecessary classes in HTML tables #9325

ThomasSoeiro opened this issue Jan 10, 2024 · 15 comments

Comments

@ThomasSoeiro
Copy link
Contributor

ThomasSoeiro commented Jan 10, 2024

It is currently not possible to prevent pandoc from adding attributes to the HTML output from a Markdown input (e.g. .header, .odd, .even in the ReprEx below). It is only possible to drop attributes using filters.

Since both the CommonMark and GitHub Flavored Markdown specs do not mention default attributes in HTML output, shouldn't this be opt-in by default? Or possible to opt-out at least?

ReprEx

Using e.g. this input:

| foo | bar |
| --- | --- |
| baz | bim |
| baz | bim |

And converting to HTML using pandoc --from gfm --to html5, we get:

<table>
<thead>
<tr class="header">
<th>foo</th>
<th>bar</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>baz</td>
<td>bim</td>
</tr>
<tr class="even">
<td>baz</td>
<td>bim</td>
</tr>
</tbody>
</table>
@jgm
Copy link
Owner

jgm commented Jan 10, 2024

These are harmless; just don't add CSS rules that do things with them.

I don't think adding the option to avoid these is worth the increase in complexity.

@jgm
Copy link
Owner

jgm commented Jan 10, 2024

Further note: the commonmark spec says

Note that not every feature of the HTML samples is mandated by the spec. For example, the spec says what counts as a link destination, but it doesn’t mandate that non-ASCII characters in the URL be percent-encoded. To use the automatic tests, implementers will need to provide a renderer that conforms to the expectations of the spec examples (percent-encoding non-ASCII characters in URLs). But a conforming implementation can use a different renderer and may choose not to percent-encode non-ASCII characters in URLs.

The spec is not about HTML output, it's about specifying how the commonmark document should be parsed into a structured document.

@ThomasSoeiro
Copy link
Contributor Author

These are harmless; just don't add CSS rules that do things with them.

I agree this is not a big deal. However, these are common class names that are likely to be used elsewhere in a project. It would require to either drop them to reuse the class names, or use less meaninful class names.

@jgm
Copy link
Owner

jgm commented Jan 10, 2024

Agreed -- we could use something like table-header, even-row, odd-row.
Of course, it would be a backwards-incompatible change, so I'm not sure it's a good idea.

@ThomasSoeiro
Copy link
Contributor Author

My concern was for the sake of dropping "unecessary" classes to prevent name clashes since we can easely select .header using thead and .odd/even using variations of tbody tr:nth-child(2n).

Anyway, if you think theses classes are useful, you can close the issue as is.

Thanks a lot for your work on Pandoc!

@jgm
Copy link
Owner

jgm commented Jan 11, 2024

My concern was for the sake of dropping "unecessary" classes to prevent name clashes since we can easely select .header using thead and .odd/even using variations of tbody tr:nth-child(2n).

This is true, though it wasn't in earlier versions of pandoc (when nth-child wasn't supported in CSS and we didn't put the header in thead!).

I think it might be worth stopping using these classes, and using the alternative you suggest instead.

@jgm jgm changed the title Option to prevent pandoc from adding attributes to the HTML output from a Markdown input Don't emit unnecessary classes in HTML tables Jan 11, 2024
@tarleb
Copy link
Collaborator

tarleb commented Jan 19, 2024

Could this be a "good first issue"?

@jgm
Copy link
Owner

jgm commented Jan 19, 2024

Yes, it would be an easy one -- just have to change the HTML writer, the styles.html template, and some tests I think.

@ThomasSoeiro
Copy link
Contributor Author

Could you point me to the HTML writer please?
(I'll have a look but I don't know Pandoc internals nor haskell...)

@jgm
Copy link
Owner

jgm commented Jan 19, 2024

I think that if you can't find the HTML writer yourself, you're unlikely to be able to fix this issue, so I'll leave that as an exercise to the reader. :)

@gregdan3
Copy link

gregdan3 commented Jun 7, 2024

For what it's worth, I just ran into this issue- I was using the classname header and I didn't want or expect the class to be on every table's first tr*. I would've used a lua filter to omit it, but I can't find a way to remove classes via a filter- I suspect that's related to #684?

But uh, my use case is incredibly silly.
I'm building my site with Pandoc, and I want it to be able to render readably on the Nintendo DS Browser. That browser throws out CSS rules for elements it doesn't recognize, and it doesn't recognize most semantic html elements including header- it does recognize CSS rules for classes though, so I (reasonably I thought) assigned a class header to the element header, and then moved all my header style rules to that class. And that worked great, until I spotted every first tr* with its content squashed to the left. Anyway, I'll just rename my header for now.

Funny enough, my silly usecase is exactly the one where I'd still want the odd/even/header classes to style, since I explicitly want to target older browsers that lack the CSS.

I'm only familiar with pandoc as a user, not a dev, but why not shunt these classes into an extension rather than remove them entirely?

*: corrected

@bpj
Copy link

bpj commented Jun 7, 2024

Can you give an example of your (markdown?) source? If I understand you correctly you are adding a class .header to your heading elements like

## Heading {.header}

which conflicts with the header class which Pandoc adds automatically to <thead> elements in HTML?

It does indeed seem like your options are

  1. to use another custom class name like .heading.1
  2. to post-process your HTML removing the header class from <thead> elements.

Obviously 1 is the easier option if possible.

Footnotes

  1. A class .heading has the advantage of being terminologically correct: tables have headers but sections have headings. Pandoc calling its heading class Header is a misnomer (which it is too late to change!)

@bpj
Copy link

bpj commented Jun 7, 2024

This minimal Lua filter will change the class on all "Header" elements.

-- Predicate function to filter out 'header' class
local is_not_header_class = function(x)
  return 'header' ~= x
end

-- Global function to process heading elements
Header = function(head)
  -- -- Restrict to heading levels 2 through 4 (<h2>, <h3>, <h4>)
  -- if 2 > head.level or 4 < head.level then
  --   return nil -- leave unchanged
  -- end
  if head.classes:includes('header') then
    -- Make sure not to remove other classes!
    head.classes = head.classes:filter(is_not_header_class)
    head.classes:insert(1, 'heading')
    return head
  end
  -- Else leave unchanged
  return nil
end

@gregdan3
Copy link

gregdan3 commented Jun 7, 2024

If I understand you correctly you are adding a class .header to your heading elements

Ah, sorry, I'm adding a class to my header element, which is not one that pandoc emits. I'm substituting my generated markdown into an HTML template to make the base of every page, since there's a fair amount of web specific stuff that markdown doesn't want or need to do.

My template, trimmed some:

<!DOCTYPE HTML>
<html lang="en-US">
  <head>
    <!-- metadata -->
  </head>
  <div class="body">
    <div class="-header"> 
        <!-- initial page content -->
    </div>

    <div class="article">
      $for(include-before)$ $include-before$ $endfor$ $body$
      $for(include-after)$ $include-after$ $endfor$
    </div>

    <div class="footer">
        <!-- end of page content -->
    </div>

  </div>
</html>

Example page:

---
title: test page!
date: 2024-05-30
author: gregdan3
description: a secret test page for all my formatting
---

# Tables

|     center aligned     | left aligned  | right aligned | default alignment |
| :--------------------: | :------------ | ------------: | ----------------- |
|        Item1.1         | Item2.1       |       Item3.1 | Item4.1           |
| **_bold italic item_** | Item2.2       |       Item3.2 | `mono item`       |
|        Item1.3         | **bold item** |       Item3.3 | Item4.3           |
|        Item1.4         | Item2.4       |       Item3.4 | Item4.4           |

Gluing these together:

cat pages/test.md | pandoc --lua-filter=pandoc/filters.lua --from=markdown+yaml_metadata_block+wikilinks_title_after_pipe-definition_lists-smart \
        --template=templates/default.html \
        --metadata="directory:test.md" \
        -o build/test.html

And the result:

<!DOCTYPE html>
<html lang="en-US">
  <head>
    <!-- metadata -->
  </head>
  <div class="body">
    <div class="-header">
      <!-- initial page content -->
    </div>

    <div class="article">
       <h1 id="tables">Tables</h1>
<table>
<thead>
<tr class="header">
<th style="text-align: center;">center aligned</th>
<th style="text-align: left;">left aligned</th>
<th style="text-align: right;">right aligned</th>
<th>default alignment</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;">Item1.1</td>
<td style="text-align: left;">Item2.1</td>
<td style="text-align: right;">Item3.1</td>
<td>Item4.1</td>
</tr>
<tr class="even">
<td style="text-align: center;"><strong><em>bold italic
item</em></strong></td>
<td style="text-align: left;">Item2.2</td>
<td style="text-align: right;">Item3.2</td>
<td><code>mono item</code></td>
</tr>
<tr class="odd">
<td style="text-align: center;">Item1.3</td>
<td style="text-align: left;"><strong>bold item</strong></td>
<td style="text-align: right;">Item3.3</td>
<td>Item4.3</td>
</tr>
<tr class="even">
<td style="text-align: center;">Item1.4</td>
<td style="text-align: left;">Item2.4</td>
<td style="text-align: right;">Item3.4</td>
<td>Item4.4</td>
</tr>
</tbody>
</table>
      
    </div>

    <div class="footer">
      <!-- end of page content -->
    </div>
  </div>
</html>

Also, I was mistaken before; it was the first tr being given the class header, not thead.

@bpj
Copy link

bpj commented Jun 7, 2024

I think you need to post-process the HTML.

I would do it with either of

These two have pretty similar interfaces:

Perl code:

use 5.016;
use utf8;
use strict;
use warnings;
use warnings FATAL => 'utf8';
use autodie;

use Path::Tiny qw[path];
use Mojo::DOM;

my $file = path 'test.html';

my $html = $file->slurp_utf8;

my $dom = Mojo::DOM->new($html);

my $fix_classes = sub {
  my($elem) = @_;
  if ( 'header' eq $elem->{class} ) {
    delete $elem->{class};
  }
  else {
    $elem->{class} =~ s!\bheader\b!!;
  }
};

$dom->find('tr.header')->each($fix_classes);

$file->spew_utf8($dom);

Python code:

from bs4 import BeautifulSoup

with open('test.html', mode='r' encoding='UTF-8') as fh:
  text = fh.read()
  soup = BeautifulSoup(text, 'html.parser')

for tr in soup.select('tr.header'):
  if 1 == len(tr['class']):
    del tr['class']
  else:
    tr['class'] = [c for c in tr['class'] if 'header' != c]
    
open('test.html', mode='w', encoding='UTF-8').write(soup.prettify())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants