-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update charwidths for Unicode 8 #46
Comments
Perhaps the easiest thing to do is to define the character widths on a per-block basis. For example, all the Anatolian Hieroglyphs could be assigned a width of 2 based on inspecting the representative glyphs in the code charts. http://www.unicode.org/charts/PDF/Unicode-8.0/U80-14400.pdf |
I was just thinking of something simpler: #############################################################################
# Use a default width of 1 for all character categories that are
# letter/symbol/number-like. This can be overriden by Unifont or UAX 11
# below, but provides a useful nonzero fallback for new codepoints when
# a new Unicode version has been released but Unifont hasn't been updated yet.
zerowidth = Set{Int}() # categories that may contain zero-width chars
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CN)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_MN)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_MC)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_ME)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_SK)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_ZS)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_ZL)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_ZP)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CC)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CF)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CS)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CO)
for c in 0x0000:0x110000
if catcode(c) ∉ zerowidth
CharWidths[c] = 1
end
end Already with this I noticed something odd in our current charwidth data:
Since https://codepoints.net/U+1D56C is from Unicode 3.1 and clearly has nonzero width, why are we reporting zero width? Does Unifont not have this codepoint? |
shouldn't that be |
It doesn't matter, because invalid codepoints return category Cn. |
Yep, it's just my OCD acting up! |
Adding if (w == 0 &&
((cat >= UTF8PROC_CATEGORY_LU && cat <= UTF8PROC_CATEGORY_LO) ||
(cat >= UTF8PROC_CATEGORY_ND && cat <= UTF8PROC_CATEGORY_SC) ||
(cat >= UTF8PROC_CATEGORY_SO && cat <= UTF8PROC_CATEGORY_ZS))) {
fprintf(stderr, "zero width for symbol-like char %x\n", c);
error = 1;
} to the loop in |
Just filed a Unifont bug: https://savannah.gnu.org/bugs/index.php?45395 |
Nice finding the bugs even in other people's software! |
We updated the data tables in #45, but I'm guessing it doesn't have charwidths for many of the new codepoints. Probably we need to wait for a new version of GNU Unifont for up-to-date charwidths, but at the very least we probably shouldn't default to zero for codepoints in letter-like categories.
e.g. this doesn't seem good:
The text was updated successfully, but these errors were encountered: