Create decoder for HTML entities #2563

rgmz · 2024-03-10T15:55:51Z

Description:

This creates a decoder to handle HTML entities. Tests pass, but the implementation may not be the most efficient.

This fixes #2231.

Checklist:

Tests passing (make test-community)?
Lint passing (make lint this requires golangci-lint)?

dustin-decker · 2024-03-20T15:51:29Z

pkg/decoders/html_entity.go

+
+	if matched {
+		decodableChunk := &DecodableChunk{
+			DecoderType: detectorspb.DecoderType_ESCAPED_UNICODE,


Should this be a new decoder type?

dustin-decker · 2024-03-20T15:51:36Z

I think we've reached the point where we should consider adding a --enabled/disabled-decoders flag, similar to what we have for detectors. This one seems pretty impactful on performance in its current state, and many data sources might not benefit much from it.

One potential improvement might be to implement this as a handler and do identification of the whole file before decoding and chunking it out.

rgmz · 2024-03-21T03:11:36Z

I do worry about the impact of having too many decoders. At a minimum, having something like ahocorasick might be more efficient than checking regexp.Match() against each chunk.

One potential improvement might be to implement this as a handler and do identification of the whole file before decoding and chunking it out.

While I think identifying the mimetype of a file would be a great addition (and make way for other enhancements), I'm not sure how much it would help in this case. HTML, Markdown, and AsciiDoc files are obviously sources that would benefit, but HTML-encoded content can show up in weird places like config files, .txt files, or source code.

This decoder was act inspired by #1550; I found several live connection strings that were not detected by TruffleHog because they contained encoded & instead of a literal &.

mongodb://dave:password@localhost:27017/?authMechanism=DEFAULT&amp;authSource=db&amp;ssl=true&quot;

rgmz requested a review from a team as a code owner March 10, 2024 15:55

dustin-decker reviewed Mar 20, 2024

View reviewed changes

rgmz marked this pull request as draft April 14, 2024 14:21

rgmz force-pushed the feat/html-decoder branch from 79050b1 to 2bb2410 Compare June 5, 2024 00:41

rgmz force-pushed the feat/html-decoder branch 2 times, most recently from 721ba1d to 0512f94 Compare June 21, 2024 02:54

rgmz force-pushed the feat/html-decoder branch 2 times, most recently from 6a98dcc to 1180b27 Compare July 1, 2024 18:38

rgmz force-pushed the feat/html-decoder branch 3 times, most recently from 729714d to d612f5b Compare November 8, 2024 14:01

rgmz force-pushed the feat/html-decoder branch from d612f5b to 4df5b0e Compare November 11, 2024 19:22

rgmz force-pushed the feat/html-decoder branch 3 times, most recently from ca46f5d to 45eb1ed Compare December 2, 2024 14:01

rgmz marked this pull request as ready for review December 2, 2024 14:02

rgmz requested review from a team as code owners December 2, 2024 14:02

rgmz force-pushed the feat/html-decoder branch 2 times, most recently from 3676e9b to cb4c962 Compare December 21, 2024 16:06

rgmz force-pushed the feat/html-decoder branch 2 times, most recently from 6083804 to ac02868 Compare December 31, 2024 14:57

rgmz force-pushed the feat/html-decoder branch from ac02868 to fd211ad Compare January 11, 2025 19:01

rgmz force-pushed the feat/html-decoder branch from fd211ad to 607c49c Compare January 20, 2025 14:56

feat(decoders): HTML entities

1af25ee

rgmz force-pushed the feat/html-decoder branch from 607c49c to 1af25ee Compare January 27, 2025 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create decoder for HTML entities #2563

Create decoder for HTML entities #2563

rgmz commented Mar 10, 2024 •

edited

Loading

dustin-decker Mar 20, 2024

dustin-decker commented Mar 20, 2024

rgmz commented Mar 21, 2024

Create decoder for HTML entities #2563

Are you sure you want to change the base?

Create decoder for HTML entities #2563

Conversation

rgmz commented Mar 10, 2024 • edited Loading

Description:

Checklist:

dustin-decker Mar 20, 2024

Choose a reason for hiding this comment

dustin-decker commented Mar 20, 2024

rgmz commented Mar 21, 2024

rgmz commented Mar 10, 2024 •

edited

Loading