How to I find all occurences of emojis in my codebase? #1623

ssbarnea · 2020-06-20T16:02:58Z

ssbarnea
Jun 20, 2020

I would like to use ripgrep to indentify all usages of emoji inside a repository but I was not able to find the right command to perform such a search.

Still I was able to find a mention of a something about emoji at

ripgrep/crates/core/app.rs

Line 2187 in 1b2c1dc

* A large array of classes like '\\p{Emoji}' are available.

Any hints?

Answered by BurntSushi

Jun 21, 2020

I think you might be a bit confused about what \p{Emoji} actually is. \p{Emoji} is a single Unicode property that matches exactly one codepoint. But Emoji can of course be multiple codepoints, and there is no fixed length as to how long they can be. Your regex, for example, only matches emoji that are at most two codepoints. This leaves out any emoji that are encoded with more than two codepoints. You can find lots of examples in Unicode's full list of emoji: https://unicode.org/emoji/charts/full-emoji-list.html

One such example is 0️⃣. If that doesn't render correctly, then it looks like this:

This particular emoji is made up of three codepoints: U+0030 U+FE0F U+20E3. Notice that U+0030 …

View full answer

blankname · 2020-06-20T16:35:37Z

blankname
Jun 20, 2020

I would expect rg '\p{Emoji}' to work, but it seems to be matching digits as well as emoji.

2 replies

ssbarnea Jun 21, 2020
Author

Yep, that is what I tried and when I seen the amount of digits matches, I decided it does not work.

BurntSushi Jun 21, 2020
Maintainer

This is correct and intended behavior. The Emoji Unicode property is defined to include ASCII digits according to Unicode. See emoji-data.txt as part of the UCD.

BurntSushi · 2020-06-20T16:56:09Z

BurntSushi
Jun 20, 2020
Maintainer

Could you provide a test case so that I can play with it? Just to make sure we're on the same page about what specifically you're looking for.

4 replies

ssbarnea Jun 21, 2020
Author

I do expect the obove character class to perform a match similar to one at https://www.regextester.com/106421 -- in fact Unicode standard is very clear regarding what character ranges fit under emoji range(s).

BurntSushi Jun 21, 2020
Maintainer

Could you please link/cite the section of the Unicode standard that you're referring to?

BurntSushi Jun 21, 2020
Maintainer

And where did you get that regex from? I don't see how it can be correct with respect to UAX#51.

BurntSushi Jun 21, 2020
Maintainer

It does not, for example, match 0️⃣. (that is, U+0030 U+FE0F U+20E3).

BurntSushi · 2020-06-21T11:07:04Z

BurntSushi
Jun 21, 2020
Maintainer

I think you might be a bit confused about what \p{Emoji} actually is. \p{Emoji} is a single Unicode property that matches exactly one codepoint. But Emoji can of course be multiple codepoints, and there is no fixed length as to how long they can be. Your regex, for example, only matches emoji that are at most two codepoints. This leaves out any emoji that are encoded with more than two codepoints. You can find lots of examples in Unicode's full list of emoji: https://unicode.org/emoji/charts/full-emoji-list.html

One such example is 0️⃣. If that doesn't render correctly, then it looks like this:

This particular emoji is made up of three codepoints: U+0030 U+FE0F U+20E3. Notice that U+0030 is the ASCII digit 0. This is why the \p{Emoji} class matches digits (along with # and *). Namely, \p{Emoji} is only part of the full definition of Emojis according to Unicode.

UAX#51 is the canonical definition of emoji with respect to Unicode. In particular, its EBNF and regex section provide an easy to use pattern that will match a superset of emoji. Apparently, one needs to verify the results of that scan in accordance with more complex rules (listed in UAX#51), but maybe the regex is good enough for your use cases. Performing these validity checks is probably not something you can do with ripgrep alone. But you can at least run the regex it provides (most grep tools and even regex engines do not have sufficient Unicode support to do even this):

$ rg '\p{RI}\p{RI}|\p{Emoji}(\p{EMod}|\x{FE0F}\x{20E3}?|[\x{E0020}-\x{E007E}]+\x{E007F})?(\x{200D}\p{Emoji}(\p{EMod}|\x{FE0F}\x{20E3}?|[\x{E0020}-\x{E007E}]+\x{E007F})?)*'

This will still match decimal digits, along with * and #. As far as I can tell, this is correct and expected. I don't have the time to check whether decimal digits alone are technically considered proper emoji yet (if I did, I would check the validity rules defined in UAX#51), but even if they aren't, it's still correct because UAX#51 specifically says that the above regex matches a superset of emojis. That is, it may yield false positives.

If you do not want decimal numbers or other such things appearing in your results, then there is a simple work-around: just subtract the ASCII subset of codepoints from \p{Emoji}. ripgrep supports character class set operations. So, use [\p{Emoji}--\p{Ascii}] instead of \p{Emoji} in the regex above. This will however cause the regex to miss 0️⃣ though, along with other emojis whose base is an ASCII codepoint.

3 replies

jacobhq Jun 27, 2024

This is great, thanks! I'm working with pretty big datasets though (~3gb), so any suggestions to improve performance?

BurntSushi Jun 27, 2024
Maintainer

The regex is quite big, and if you're searching lots of non-ASCII data, there's an outside chance that it's filly up ripgrep's lazy DFA cache. So you could try setting --dfa-size-limit 999999999 or something like that. Otherwise, there aren't any literals in the regex, so this basically has to just use one giant state machine.

The only way I can think of to make this go faster is to write a bespoke program to do it. But even then, it would require some kind of data dependent heuristic to beat ripgrep (probably).

jacobhq Jun 28, 2024

Thank you!

ssbarnea · 2020-06-22T09:10:57Z

ssbarnea
Jun 22, 2020
Author

Shortly, at this moment, there is no way to use ripgrep to identify all emojis. Hopefully someone will add support for this. PS. Getting numbers matches is deal-breaker, for obvious reasons.

1 reply

BurntSushi Jun 22, 2020
Maintainer

Did you see my comment? ripgrep can identify all emojis. It will just include some false positives. I showed how to remove numbers from your matches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to I find all occurences of emojis in my codebase? #1623

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 10 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to I find all occurences of emojis in my codebase? #1623

ssbarnea Jun 20, 2020

Replies: 4 comments · 10 replies

blankname Jun 20, 2020

ssbarnea Jun 21, 2020 Author

BurntSushi Jun 21, 2020 Maintainer

BurntSushi Jun 20, 2020 Maintainer

ssbarnea Jun 21, 2020 Author

BurntSushi Jun 21, 2020 Maintainer

BurntSushi Jun 21, 2020 Maintainer

BurntSushi Jun 21, 2020 Maintainer

BurntSushi Jun 21, 2020 Maintainer

jacobhq Jun 27, 2024

BurntSushi Jun 27, 2024 Maintainer

jacobhq Jun 28, 2024

ssbarnea Jun 22, 2020 Author

BurntSushi Jun 22, 2020 Maintainer

ssbarnea
Jun 20, 2020

Replies: 4 comments 10 replies

blankname
Jun 20, 2020

ssbarnea Jun 21, 2020
Author

BurntSushi Jun 21, 2020
Maintainer

BurntSushi
Jun 20, 2020
Maintainer

ssbarnea Jun 21, 2020
Author

BurntSushi Jun 21, 2020
Maintainer

BurntSushi Jun 21, 2020
Maintainer

BurntSushi Jun 21, 2020
Maintainer

BurntSushi
Jun 21, 2020
Maintainer

BurntSushi Jun 27, 2024
Maintainer

ssbarnea
Jun 22, 2020
Author

BurntSushi Jun 22, 2020
Maintainer