-
-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix] Fix unicode-unaware word boundary check in hashtags #1049
Conversation
Whoopz |
e8f6d58
to
d2be1a0
Compare
Go `\b` does not care for Unicode, and without lookahead, the workarounds got very ugly. So I replaced the regex with a parser. The parser runs in O(n) time and performance should not be affected.
Unidiomatic? This is miles better than so much Go code I see, it's great! |
Yeah what kim said. I'll read through this properly tomorrow but on a first read through it looks really good. |
Other than the one comment, I'm otherwise happy with this :), once it's sorted it'll be good to go imo |
9783c85
to
a3f928c
Compare
Alright then this looks good to me :). @tsmethurst you will probably want to catch up on the above conversations as well for some further context on things when you get to reviewing this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sick, I love it ❤️
Just a couple little comments :)
internal/util/statustools.go
Outdated
return r == '#' || | ||
unicode.IsSpace(r) || | ||
unicode.IsControl(r) || | ||
('&' != r && '/' != r && unicode.Is(unicode.Categories["Po"], r)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's worth putting a comment here to explain this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments and made it a bit more lenient so e.g. (#chungus)
will work.
inTag := false | ||
|
||
for i, r := range text { | ||
if r == '#' && isHashtagBoundary(prev) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth replacing these with a switch statement or nah? just a style thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool I didn't know about switch /* nothing here */ { ... }
!
I tried it but then I cuoldn't do assignment in a case, so I left it like that instead of duplicating that bit. :)
Looks great to me now, good work! If you're happy with it i'll merge it :) |
Yes :D |
Boom! thank you!! |
Go
\b
does not care for Unicode, and without lookahead in the regex engine, the workarounds got very ugly. So I replaced the regex with a parser that uses Go's Unicode functions directly.The parser runs in O(n) time and performance should be comparable.
Might even be faster since I think this is less work, but I don't know how to benchmark. :)
I'm sure it's all sorts of unidiomatic but maybe it can be salvaged!
Hashtags that wouldn't work before were those ending in non-ASCII characters, like
#blå
, since\b
would considerl
followed byå
to be a word boundary.