How to I find all occurences of emojis in my codebase? #1623
-
I would like to use ripgrep to indentify all usages of emoji inside a repository but I was not able to find the right command to perform such a search. Still I was able to find a mention of a something about emoji at Line 2187 in 1b2c1dc Any hints? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 10 replies
-
I would expect |
Beta Was this translation helpful? Give feedback.
-
Could you provide a test case so that I can play with it? Just to make sure we're on the same page about what specifically you're looking for. |
Beta Was this translation helpful? Give feedback.
-
I think you might be a bit confused about what One such example is This particular emoji is made up of three codepoints: UAX#51 is the canonical definition of emoji with respect to Unicode. In particular, its EBNF and regex section provide an easy to use pattern that will match a superset of emoji. Apparently, one needs to verify the results of that scan in accordance with more complex rules (listed in UAX#51), but maybe the regex is good enough for your use cases. Performing these validity checks is probably not something you can do with ripgrep alone. But you can at least run the regex it provides (most grep tools and even regex engines do not have sufficient Unicode support to do even this):
This will still match decimal digits, along with If you do not want decimal numbers or other such things appearing in your results, then there is a simple work-around: just subtract the ASCII subset of codepoints from |
Beta Was this translation helpful? Give feedback.
-
Shortly, at this moment, there is no way to use ripgrep to identify all emojis. Hopefully someone will add support for this. PS. Getting numbers matches is deal-breaker, for obvious reasons. |
Beta Was this translation helpful? Give feedback.
I think you might be a bit confused about what
\p{Emoji}
actually is.\p{Emoji}
is a single Unicode property that matches exactly one codepoint. But Emoji can of course be multiple codepoints, and there is no fixed length as to how long they can be. Your regex, for example, only matches emoji that are at most two codepoints. This leaves out any emoji that are encoded with more than two codepoints. You can find lots of examples in Unicode's full list of emoji: https://unicode.org/emoji/charts/full-emoji-list.htmlOne such example is
0️⃣
. If that doesn't render correctly, then it looks like this:This particular emoji is made up of three codepoints:
U+0030 U+FE0F U+20E3
. Notice thatU+0030
…