Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emojis splitted up unexpectedly (e.g. https://emojipedia.org/ninja-cat/) #29

Open
ClemensSchneider opened this issue Oct 30, 2019 · 5 comments

Comments

@ClemensSchneider
Copy link

Hi there,

first of all, thanks a lot for this library and the efforts you put in!

I've got a scenario, where some emojis seem to be split up the wrong way.

When splitting up the following emoji-sequence:
πŸ±β€πŸ’»πŸ±β€πŸš€πŸ±β€πŸ‘€

I get the following string-tokens (notice the first two matching and the ninja-cat being split into two):
image

Is there an easy explanation for the behavior or is there a general guideline on which emojis are supported and which aren't?

I'm on Windows 10.

Thanks!

@jasonsbarr
Copy link

See #28

@ClemensSchneider
Copy link
Author

This seems to explain it, yes.
So is there an easy way to disallow inputting / identify non-recommended unicode emojis that are not meant to be supported cross-platform?

@shmibbles
Copy link

shmibbles commented May 4, 2020

i'm also getting unexpected splitting: πŸ‘¨πŸΏβ€πŸ¦° splits into [ "πŸ‘¨πŸΏβ€", "🦰" ], and this emoji is officially part of unicode, check it out. Might have something to do with zero width joiners not being recognized correctly?

Edit: just did some more testing, and πŸ‘¨πŸΏβ€πŸ¦° is split after the ZWJ (\u200d), so πŸ‘¨πŸΏβ€πŸ¦° split looks like this:
[ "\ud83d\udc68\ud83c\udfff\u200d", "\ud83e\uddb0" ]. Maybe the red hair component (\ud83e\uddb0) and other hair components aren't recognized after the ZWJ? hope this helps

@rodrigobutta
Copy link

Hi! Same here, this emoji πŸ‘¨β€πŸ¦³ gets splited into [ 'πŸ‘¨β€', '🦳' ] that are these unicodes [ 128104, 8205 ], [ 129459 ]. Also happening with other similar emojis. I coulnd`t find a patron yet so I can fix it.

@Lemmingh
Copy link

@shmibbles

I think you're right. They were introduced in Unicode 11.0, so this depends on #24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants