[bugfix] Fix unicode-unaware word boundary check in hashtags #1049

illfygli · 2022-11-14T18:11:01Z

Go \b does not care for Unicode, and without lookahead in the regex engine, the workarounds got very ugly. So I replaced the regex with a parser that uses Go's Unicode functions directly.

The parser runs in O(n) time and performance should be comparable.

Might even be faster since I think this is less work, but I don't know how to benchmark. :)

I'm sure it's all sorts of unidiomatic but maybe it can be salvaged!

Hashtags that wouldn't work before were those ending in non-ASCII characters, like #blå, since \b would consider l followed by å to be a word boundary.

illfygli · 2022-11-14T18:18:40Z

Whoopz

Go `\b` does not care for Unicode, and without lookahead, the workarounds got very ugly. So I replaced the regex with a parser. The parser runs in O(n) time and performance should not be affected.

NyaaaWhatsUpDoc · 2022-11-14T21:54:14Z

Unidiomatic? This is miles better than so much Go code I see, it's great!

internal/util/statustools.go

tsmethurst · 2022-11-14T21:59:32Z

Yeah what kim said. I'll read through this properly tomorrow but on a first read through it looks really good.

NyaaaWhatsUpDoc · 2022-11-14T22:00:08Z

Other than the one comment, I'm otherwise happy with this :), once it's sorted it'll be good to go imo

internal/util/statustools.go

NyaaaWhatsUpDoc · 2022-11-14T22:42:28Z

Alright then this looks good to me :). @tsmethurst you will probably want to catch up on the above conversations as well for some further context on things when you get to reviewing this

tsmethurst

Sick, I love it ❤️

Just a couple little comments :)

tsmethurst · 2022-11-15T09:54:15Z

internal/util/statustools.go

+	return r == '#' ||
+		unicode.IsSpace(r) ||
+		unicode.IsControl(r) ||
+		('&' != r && '/' != r && unicode.Is(unicode.Categories["Po"], r))


i think it's worth putting a comment here to explain this

Added comments and made it a bit more lenient so e.g. (#chungus) will work.

tsmethurst · 2022-11-15T09:55:05Z

internal/util/statustools.go

+	inTag := false
+
+	for i, r := range text {
+		if r == '#' && isHashtagBoundary(prev) {


worth replacing these with a switch statement or nah? just a style thing

Cool I didn't know about switch /* nothing here */ { ... }!
I tried it but then I cuoldn't do assignment in a case, so I left it like that instead of duplicating that bit. :)

tsmethurst · 2022-11-15T15:03:29Z

Looks great to me now, good work! If you're happy with it i'll merge it :)

illfygli · 2022-11-15T15:04:24Z

Yes :D

tsmethurst · 2022-11-15T15:05:42Z

Boom! thank you!!

illfygli force-pushed the main branch 2 times, most recently from e8f6d58 to d2be1a0 Compare November 14, 2022 18:31

[bugfix] Fix unicode-unaware word boundary check in hashtag regex

3990f24

Go `\b` does not care for Unicode, and without lookahead, the workarounds got very ugly. So I replaced the regex with a parser. The parser runs in O(n) time and performance should not be affected.

illfygli force-pushed the main branch from d2be1a0 to 3990f24 Compare November 14, 2022 18:46

NyaaaWhatsUpDoc reviewed Nov 14, 2022

View reviewed changes

internal/util/statustools.go Show resolved Hide resolved

NyaaaWhatsUpDoc reviewed Nov 14, 2022

View reviewed changes

internal/util/statustools.go Outdated Show resolved Hide resolved

illfygli force-pushed the main branch 3 times, most recently from 9783c85 to a3f928c Compare November 14, 2022 22:39

illfygli force-pushed the main branch from a3f928c to cabc3c3 Compare November 14, 2022 22:52

tsmethurst reviewed Nov 15, 2022

View reviewed changes

[bugfix] Add back hashtag max length and add tests for it

3221613

illfygli force-pushed the main branch from cabc3c3 to 3221613 Compare November 15, 2022 14:44

tsmethurst merged commit 5210977 into superseriousbusiness:main Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] Fix unicode-unaware word boundary check in hashtags #1049

[bugfix] Fix unicode-unaware word boundary check in hashtags #1049

illfygli commented Nov 14, 2022 •

edited

Loading

illfygli commented Nov 14, 2022

NyaaaWhatsUpDoc commented Nov 14, 2022

tsmethurst commented Nov 14, 2022

NyaaaWhatsUpDoc commented Nov 14, 2022

NyaaaWhatsUpDoc commented Nov 14, 2022

tsmethurst left a comment

tsmethurst Nov 15, 2022

illfygli Nov 15, 2022

tsmethurst Nov 15, 2022

illfygli Nov 15, 2022 •

edited

Loading

tsmethurst commented Nov 15, 2022

illfygli commented Nov 15, 2022

tsmethurst commented Nov 15, 2022

[bugfix] Fix unicode-unaware word boundary check in hashtags #1049

[bugfix] Fix unicode-unaware word boundary check in hashtags #1049

Conversation

illfygli commented Nov 14, 2022 • edited Loading

illfygli commented Nov 14, 2022

NyaaaWhatsUpDoc commented Nov 14, 2022

tsmethurst commented Nov 14, 2022

NyaaaWhatsUpDoc commented Nov 14, 2022

NyaaaWhatsUpDoc commented Nov 14, 2022

tsmethurst left a comment

Choose a reason for hiding this comment

tsmethurst Nov 15, 2022

Choose a reason for hiding this comment

illfygli Nov 15, 2022

Choose a reason for hiding this comment

tsmethurst Nov 15, 2022

Choose a reason for hiding this comment

illfygli Nov 15, 2022 • edited Loading

Choose a reason for hiding this comment

tsmethurst commented Nov 15, 2022

illfygli commented Nov 15, 2022

tsmethurst commented Nov 15, 2022

illfygli commented Nov 14, 2022 •

edited

Loading

illfygli Nov 15, 2022 •

edited

Loading