Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: UTF-8 character can take up 4 bytes #27060

Closed
ChrisGreenaway opened this issue Oct 20, 2017 · 4 comments
Closed

Docs: UTF-8 character can take up 4 bytes #27060

ChrisGreenaway opened this issue Oct 20, 2017 · 4 comments
Labels
>docs General docs changes

Comments

@ChrisGreenaway
Copy link

Elasticsearch version (bin/elasticsearch --version): master

Plugins installed: N/A

JVM version (java -version): N/A

OS version (uname -a if on a Unix-like system): N/A

Description of the problem including expected versus actual behavior:

ignore-above.asciidoc says "If you use UTF-8 text with many non-ASCII characters, you may want to set the limit to 32766 / 3 = 10922 since UTF-8 characters may occupy at most 3 bytes." however a UTF-8 character can take up 4 bytes.

Steps to reproduce:

Look at ignore-above.asciidoc

Provide logs (if relevant):

@nik9000 nik9000 added the >docs General docs changes label Oct 20, 2017
@nik9000
Copy link
Member

nik9000 commented Oct 20, 2017

@clintongormley git blame says you wrote this. @ChrisGreenaway is right, 4 bytes is possible, but not common. It looks like 4 bytes is fairly rare unless you are dealing with fairly special text. Or emoji.

@DaveCTurner
Copy link
Contributor

I agree - pretty common to see emoji these days. I opened #27083.

@DaveCTurner
Copy link
Contributor

Thanks for the report, @ChrisGreenaway, much appreciated.

@ChrisGreenaway
Copy link
Author

You are welcome. Thanks for fixing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes
Projects
None yet
Development

No branches or pull requests

3 participants