Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashtag regex doesn't catch Unicode hashtags #4398

Closed
mojodna opened this issue Oct 5, 2017 · 3 comments
Closed

Hashtag regex doesn't catch Unicode hashtags #4398

mojodna opened this issue Oct 5, 2017 · 3 comments
Labels
bug A bug - let's fix this!

Comments

@mojodna
Copy link
Collaborator

mojodna commented Oct 5, 2017

After 9719a31, Unicode characters (including emoji ;-) in hashtags are no longer included (they're now treated as word separators):

> "#g1b".match(/#[\w-]+/g);
[ '#g1b' ]
> "#g1ü".match(/#[\w-]+/g);
[ '#g1' ]

I've been using /(#[^\d][^#\s,;:]*)/g as my current regex of choice (with the dubious assumption that hashtags shouldn't start with numbers).

@bhousel bhousel added the bug A bug - let's fix this! label Oct 5, 2017
@bhousel
Copy link
Member

bhousel commented Oct 5, 2017

Whoa that is interesting and kind of surprising to me. Lots of info here:
https://stackoverflow.com/questions/280712/javascript-unicode-regexes
https://mathiasbynens.be/notes/es6-unicode-regex

I think, this is fixed in ES6 by adding the /u regex flag. Unfortunately this means it won't work in IE11.

@mojodna, can you submit a PR with a fix that uses one of the ES5-safe options?

@mojodna
Copy link
Collaborator Author

mojodna commented Oct 5, 2017

/u doesn't expand the word class, unfortunately. I saw some Stack Overflow posts with suggestions for Unicode ranges that roughly equate to "word", but they seem excessively complex. Inverting punctuation seems promising (potentially with no need for /u support since they're 4 character codes in pairs?). This is from https://stackoverflow.com/a/25575009:

/(#[^\u2000-\u206F\u2E00-\u2E7F\s\\'!"#$%&()*+,\-.\/:;<=>?@\[\]^_`{|}~]*)/g

> "#÷¿øü?,#tag".match(/(#[^\u2000-\u206F\u2E00-\u2E7F\s\\'!"#$%&()*+,\-.\/:;<=>?@\[\]^_`{|}~]*)/g)
[ '#÷¿øü', '#tag' ]

Thoughts?

@bhousel
Copy link
Member

bhousel commented Oct 6, 2017

@mojodna That seems fine.. I'm ok with a big ugly regex, as long as the code is readable and commented, something like..

// match unicode range and non punctuation, see #4398
var hashtagRegex = /biguglyregex/;   
whatever.match(hashtagRegex);

mojodna added a commit to mojodna/iD that referenced this issue Oct 9, 2017
Unicode ranges for punctuation are simpler than creating a Unicode-aware
word class, so delimit on non-words.

Fixes openstreetmap#4398
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug - let's fix this!
Projects
None yet
Development

No branches or pull requests

2 participants