`pre` tag keep format #10

ricleal · 2019-03-31T15:18:40Z

I just wonder if there's an option to keep the format (e.g. white spaces, tabs, etc) in the pre tag.
Thanks!

The text was updated successfully, but these errors were encountered:

matthiask · 2019-03-31T17:57:20Z

Currently there isn't. Whitespace normalization is the first thing that happens, before anything else, before even parsing the HTML fragment:

html-sanitizer/html_sanitizer/sanitizer.py

Line 194 in c914481

html = normalize_overall_whitespace(html)

That being said, now that we have a real testsuite (feincms-cleanse didn't have good coverage) I wouldn't be against selectively normalizing whitespace as long as nothing else changes, resp. only whitespace changes without effect. This would probably mean normalizing elem.text and elem.tail of all elements except of a few where normalization would be skipped.

ricleal · 2019-04-01T11:18:32Z

Thanks a lot for the reply. It looks like it's not implemented yet but there's a way forward. 👍

mirukana · 2019-04-21T08:43:41Z

html-sanitizer/html_sanitizer/sanitizer.py

Line 194 in c914481

html = normalize_overall_whitespace(html)

Commenting this line doesn't seem to break any tests at first glance, but I see this function removes more kind of whitespace than normalize_whitespace_in_text_or_tail.
Would including these additional whitespace in that function be enough to not need normalize_overall_whitespace, as a first step?
Thanks for the great package by the way! This is the last issue I've had with it.

matthiask · 2019-04-21T08:54:02Z

I think that the line does some things which are worthwhile such as normalizing various forms of whitespace. It could do this without collapsing whitespace though -- this decision could be left to normalize_whitespace_in_text_or_tail. This in turn could only run if we aren't inside <pre> at the moment (respectively inside a set of whitespace-preserving tags). Note that HTML elements are processed from the end to the beginning and from the inside out, so you'd have to peek ahead in the backlog deque to find out whether we are inside such an element or not.

I'm having a hard time constructing a non-artificial test case which fails without normalize_overall_whitespace when copy-pasting content from various sources. This may be because modern rich text widgets for web and/or browsers' contenteditable implementations are smarter than they were 10 years ago and they generally don't produce that ugly HTML anymore. I'm still reluctant to just remove all upfront whitespace normalization though. This is a piece of badly tested code, but code with a long legacy... in fact, a part of this is in use for almost 10 years now (feincms/feincms@0186b47)

matthiask · 2019-04-21T08:55:51Z

Oh, reading your comment again: I think it might work to just move the functionality inside normalize_whitespace_in_text_or_tail but there may be an interaction with only_whitespace_re where some elements might not be dropped anymore -- but that's just conjecture, I haven't taken a close look at the code.

matthiask added the enhancement label Apr 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pre` tag keep format #10

`pre` tag keep format #10

ricleal commented Mar 31, 2019

matthiask commented Mar 31, 2019

ricleal commented Apr 1, 2019

mirukana commented Apr 21, 2019

matthiask commented Apr 21, 2019

matthiask commented Apr 21, 2019

pre tag keep format #10

pre tag keep format #10

Comments

ricleal commented Mar 31, 2019

matthiask commented Mar 31, 2019

ricleal commented Apr 1, 2019

mirukana commented Apr 21, 2019

matthiask commented Apr 21, 2019

matthiask commented Apr 21, 2019

`pre` tag keep format #10

`pre` tag keep format #10