Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

None of the regexes match emoji, and only emoji #174

Open
robintown opened this issue Jun 5, 2024 · 5 comments
Open

None of the regexes match emoji, and only emoji #174

robintown opened this issue Jun 5, 2024 · 5 comments
Assignees

Comments

@robintown
Copy link

A regex that matches emoji would be a really useful thing to have in the JS ecosystem! Unfortunately, between Emojibase and emoji-regex, I still haven't seen a package that actually does this. In the case of Emojibase:

  • emojibase-regex matches some textual characters such as '↔'.
  • emojibase-regex/emoji doesn't match emoji without U+FE0F, such as '✨'.
  • emojibase-regex/emoji-loose matches some textual characters without U+FE0E, such as '↔'.
  • And the rest of the provided regexes are obviously not intended to be used for matching emoji.

What's missing is a regex that matches exactly those character sequences that are presented to users as emoji. Some characters are defined in Unicode to default to emoji presentation (see the Emoji_Presentation section), while others require U+FE0F to change their presentation mode. A correct implementation would account for both of these facts, and use a negative lookahead to avoid matching characters with U+FE0E.

@milesj milesj self-assigned this Jun 5, 2024
robintown added a commit to robintown/matrix-react-sdk that referenced this issue Jun 6, 2024
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
robintown added a commit to robintown/matrix-react-sdk that referenced this issue Jun 6, 2024
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
robintown added a commit to robintown/matrix-react-sdk that referenced this issue Jun 6, 2024
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
@milesj
Copy link
Owner

milesj commented Jun 7, 2024

I'll be honest, it's been so long since I've worked on this emoji stuff that I've forgotten a lot of how they work. I always have to re-learn the codebase each time I update it. So I'm sure there's bugs everywhere.

With that said, I am tinkering with the regex's here: #175

@milesj
Copy link
Owner

milesj commented Jun 9, 2024

So after looking at this post and the code again, this assumption is correct in how it works. It's by design.

  • emojibase-regex matches some textual characters such as '↔'.
  • emojibase-regex/emoji doesn't match emoji without U+FE0F, such as '✨'.
  • emojibase-regex/emoji-loose matches some textual characters without U+FE0E, such as '↔'.
  • And the rest of the provided regexes are obviously not intended to be used for matching emoji.

I also use regexgen (https://github.com/devongovett/regexgen) to generate the regex pattern, and it does not support negative lookaheads. I'm not aware of another library to handle this and I'm definitely not going to write it from scratch.

There is a regex using unicode properties, but I haven't tested it in years: https://emojibase.dev/docs/regex#unicode-property-support

@milesj
Copy link
Owner

milesj commented Jun 9, 2024

Been thinking about this more, and I think we could solve this by using functions, like isEmojiPresentation and isTextPresentation, instead of relying purely on RegExp instances. With functions we could run the necessary checks to ensure it's exactly what you want.

@robintown
Copy link
Author

Re: the Unicode properties approach, I was happy to discover that the new RegExp v mode makes writing an emoji regex by hand pretty easy, and this is what I've ended up going for.

/\p{RGI_Emoji}(?!\uFE0E)(?:(?<!\uFE0F)\uFE0F)?/v

All major browsers support it, though only as of late 2023. You can get a version that kinda sorta works while only using u mode if you replace \p{RGI_Emoji} with this regex, but it's not going to do well with flags and ZWJ sequences unless you teach it exactly what the valid sequences are.

robintown added a commit to robintown/matrix-react-sdk that referenced this issue Jun 12, 2024
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
robintown added a commit to robintown/matrix-react-sdk that referenced this issue Jun 12, 2024
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
@milesj
Copy link
Owner

milesj commented Jun 12, 2024

Nice, good to know! Been waiting years for all those to become available.

robintown added a commit to robintown/matrix-react-sdk that referenced this issue Jul 4, 2024
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
robintown added a commit to robintown/matrix-react-sdk that referenced this issue Jul 4, 2024
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
robintown added a commit to robintown/matrix-react-sdk that referenced this issue Jul 4, 2024
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
github-merge-queue bot pushed a commit to matrix-org/matrix-react-sdk that referenced this issue Jul 4, 2024
* Don't consider textual characters to be emoji

We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.

* Add a fallback for BIGEMOJI_REGEX as well
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants