Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RubyConf] Create scrubber for replacing double breakpoints into paragraph nodes #284

Merged

Conversation

josecolella
Copy link
Contributor

@josecolella josecolella commented May 8, 2024

Why?

  • Users would like to have functionality within loofah that provides the ability to replace double breakpoints into paragraph nodes

What?

  • Creating a new scrubber (:double_breakpoint) that replaces double breakpoints into paragraph nodes (thank you @flavorjones )

How did we test?

  • Created a new integration test case.

Important

There is a failing test right now where the expectation and actual result match except for newline characters. In discussing with @flavorjones, this might be related to minitest and how it formats html

(ruby) doc.xpath("/html/body").inner_html
"<p>Some text here in a logical paragraph.</p><p>Some more text, apparently a second paragraph.</p><p>Et cetera...</p>"
(ruby) BREAKPOINT_RESULT
"<p>Some text here in a logical paragraph.</p><p>Some more text, apparently a second paragraph.</p><p>Et cetera...</p>"
(ruby) BREAKPOINT_RESULT == doc.xpath("/html/body").inner_html

References #279

@josecolella josecolella changed the title Jc th add breakpoint scrubber [RubyConf] Create scrubber for replacing double breakpoints into paragraph nodes May 8, 2024
@josecolella josecolella marked this pull request as ready for review May 8, 2024 19:08
@flavorjones
Copy link
Owner

@torihuang @josecolella Thanks again for this PR!

Looking at the failing tests, it seems like the HTML4 and the HTML5 parser handle newlines differently, and that's causing a failure one way or the other depending on whether there are newlines in the expected result or not.

Demonstration:

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "loofah"
end

input = "<html><body><h1>Hello</h1><div>World</div></body></html>"

doc = Nokogiri::HTML4::Document.parse(input)
doc.at_css("h1").add_next_sibling("<b>there</b>")
doc.at_css("body").inner_html
# => "<h1>Hello</h1>\n" + "<b>there</b><div>World</div>"

doc = Nokogiri::HTML4::Document.parse(input)
doc.at_css("h1").add_next_sibling("<p>there</p>")
doc.at_css("body").inner_html
# => "<h1>Hello</h1>\n" + "<p>there</p>\n" + "<div>World</div>"

doc = Nokogiri::HTML5::Document.parse(input)
doc.at_css("h1").add_next_sibling("<b>there</b>")
doc.at_css("body").inner_html
# => "<h1>Hello</h1><b>there</b><div>World</div>"

doc = Nokogiri::HTML5::Document.parse(input)
doc.at_css("h1").add_next_sibling("<p>there</p>")
doc.at_css("body").inner_html
# => "<h1>Hello</h1><p>there</p><div>World</div>"

In any case, now that I understand what's going on, I'll wrap this up by tomorrow!

@josecolella
Copy link
Contributor Author

Sounds good. Thanks @flavorjones

@josecolella
Copy link
Contributor Author

@flavorjones were you still interested in getting this merged?

@flavorjones
Copy link
Owner

@josecolella Totally! I've been really distracted the last few weeks, but I will absolutely circle back on this.

@josecolella
Copy link
Contributor Author

@flavorjones Any update here?

@flavorjones
Copy link
Owner

@josecolella Really sorry for the delay. This was harder than I expected to wrap up (at least in a way that I didn't think was gross). I will be spending a few weeks (at my new job!) on the sanitizer stack starting in late October and will do my best to get this merged then.

@flavorjones flavorjones force-pushed the jc-th-add-breakpoint-scrubber branch from 59aac52 to 5ac043d Compare December 31, 2024 21:35
josecolella and others added 5 commits December 31, 2024 17:05
because the html4 and html5 parsers just handle tags and insert
newlines differently, and their presence/absence is orthogonal to
wrapping in a `p` tag.
@flavorjones flavorjones force-pushed the jc-th-add-breakpoint-scrubber branch from 5ac043d to 4d94183 Compare December 31, 2024 22:10
@flavorjones
Copy link
Owner

@torihuang and @josecolella -- Sorry this took so long to get back to! Thank you so much for the work, and for your patience.

Merging, will be in v2.24.0 shortly.

@flavorjones flavorjones merged commit 2abdafc into flavorjones:main Jan 1, 2025
13 checks passed
@josecolella josecolella deleted the jc-th-add-breakpoint-scrubber branch January 1, 2025 18:26
@josecolella
Copy link
Contributor Author

Thank you @flavorjones for your time in getting this merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants