Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Fix unicode-unaware word boundary check in hashtags #1049

Merged
merged 2 commits into from
Nov 15, 2022

Conversation

illfygli
Copy link
Contributor

@illfygli illfygli commented Nov 14, 2022

Go \b does not care for Unicode, and without lookahead in the regex engine, the workarounds got very ugly. So I replaced the regex with a parser that uses Go's Unicode functions directly.

The parser runs in O(n) time and performance should be comparable.

Might even be faster since I think this is less work, but I don't know how to benchmark. :)

I'm sure it's all sorts of unidiomatic but maybe it can be salvaged!

Hashtags that wouldn't work before were those ending in non-ASCII characters, like #blå, since \b would consider l followed by å to be a word boundary.

@illfygli
Copy link
Contributor Author

Whoopz

@illfygli illfygli force-pushed the main branch 2 times, most recently from e8f6d58 to d2be1a0 Compare November 14, 2022 18:31
Go `\b` does not care for Unicode, and without lookahead, the workarounds got
very ugly. So I replaced the regex with a parser.

The parser runs in O(n) time and performance should not be affected.
@NyaaaWhatsUpDoc
Copy link
Member

Unidiomatic? This is miles better than so much Go code I see, it's great!

@tsmethurst
Copy link
Contributor

Yeah what kim said. I'll read through this properly tomorrow but on a first read through it looks really good.

@NyaaaWhatsUpDoc
Copy link
Member

Other than the one comment, I'm otherwise happy with this :), once it's sorted it'll be good to go imo

@illfygli illfygli force-pushed the main branch 3 times, most recently from 9783c85 to a3f928c Compare November 14, 2022 22:39
@NyaaaWhatsUpDoc
Copy link
Member

Alright then this looks good to me :). @tsmethurst you will probably want to catch up on the above conversations as well for some further context on things when you get to reviewing this

Copy link
Contributor

@tsmethurst tsmethurst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sick, I love it ❤️

Just a couple little comments :)

return r == '#' ||
unicode.IsSpace(r) ||
unicode.IsControl(r) ||
('&' != r && '/' != r && unicode.Is(unicode.Categories["Po"], r))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's worth putting a comment here to explain this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments and made it a bit more lenient so e.g. (#chungus) will work.

inTag := false

for i, r := range text {
if r == '#' && isHashtagBoundary(prev) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth replacing these with a switch statement or nah? just a style thing

Copy link
Contributor Author

@illfygli illfygli Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool I didn't know about switch /* nothing here */ { ... }!
I tried it but then I cuoldn't do assignment in a case, so I left it like that instead of duplicating that bit. :)

@tsmethurst
Copy link
Contributor

Looks great to me now, good work! If you're happy with it i'll merge it :)

@illfygli
Copy link
Contributor Author

Yes :D

@tsmethurst tsmethurst merged commit 5210977 into superseriousbusiness:main Nov 15, 2022
@tsmethurst
Copy link
Contributor

Boom! thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants