Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update charwidths for Unicode 8 #46

Closed
stevengj opened this issue Jun 24, 2015 · 8 comments
Closed

Update charwidths for Unicode 8 #46

stevengj opened this issue Jun 24, 2015 · 8 comments

Comments

@stevengj
Copy link
Member

We updated the data tables in #45, but I'm guessing it doesn't have charwidths for many of the new codepoints. Probably we need to wait for a new version of GNU Unifont for up-to-date charwidths, but at the very least we probably shouldn't default to zero for codepoints in letter-like categories.

e.g. this doesn't seem good:

test/printproperty 0x14400
U+0x14400:
  category = Lo
  combining_class = 0
  bidi_class = 1
  decomp_type = 0
  uppercase_mapping = ffffffff
  lowercase_mapping = ffffffff
  titlecase_mapping = ffffffff
  comb1st_index = -1
  comb2nd_index = -1
  bidi_mirrored = 0
  comp_exclusion = 0
  ignorable = 0
  control_boundary = 0
  boundclass = 1
  charwidth = 0
@jiahao
Copy link
Collaborator

jiahao commented Jun 24, 2015

Perhaps the easiest thing to do is to define the character widths on a per-block basis.

For example, all the Anatolian Hieroglyphs could be assigned a width of 2 based on inspecting the representative glyphs in the code charts. http://www.unicode.org/charts/PDF/Unicode-8.0/U80-14400.pdf

@stevengj
Copy link
Member Author

I was just thinking of something simpler:

#############################################################################   
# Use a default width of 1 for all character categories that are                
# letter/symbol/number-like.  This can be overriden by Unifont or UAX 11        
# below, but provides a useful nonzero fallback for new codepoints when         
# a new Unicode version has been released but Unifont hasn't been updated yet.  

zerowidth = Set{Int}() # categories that may contain zero-width chars           
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CN)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_MN)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_MC)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_ME)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_SK)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_ZS)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_ZL)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_ZP)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CC)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CF)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CS)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CO)
for c in 0x0000:0x110000
    if catcode(c)  zerowidth
        CharWidths[c] = 1
    end
end

Already with this I noticed something odd in our current charwidth data:

$ test/printproperty 0x1D56C
U+0x1D56C:
  category = Lu
  combining_class = 0
  bidi_class = 1
  decomp_type = 1
  uppercase_mapping = ffffffff
  lowercase_mapping = ffffffff
  titlecase_mapping = ffffffff
  comb1st_index = -1
  comb2nd_index = -1
  bidi_mirrored = 0
  comp_exclusion = 0
  ignorable = 0
  control_boundary = 0
  boundclass = 1
  charwidth = 0

Since https://codepoints.net/U+1D56C is from Unicode 3.1 and clearly has nonzero width, why are we reporting zero width? Does Unifont not have this codepoint?

@ScottPJones
Copy link
Contributor

shouldn't that be for c in 0x0:0x10ffff?

@stevengj
Copy link
Member Author

It doesn't matter, because invalid codepoints return category Cn.

@ScottPJones
Copy link
Contributor

Yep, it's just my OCD acting up!

@stevengj
Copy link
Member Author

Adding

          if (w == 0 &&
                         ((cat >= UTF8PROC_CATEGORY_LU && cat <= UTF8PROC_CATEGORY_LO) ||
                          (cat >= UTF8PROC_CATEGORY_ND && cat <= UTF8PROC_CATEGORY_SC) ||
                          (cat >= UTF8PROC_CATEGORY_SO && cat <= UTF8PROC_CATEGORY_ZS))) {
               fprintf(stderr, "zero width for symbol-like char %x\n", c);
               error = 1;
          }

to the loop in test/charwidth.c turns up a lot (~5700) of "letter-like" symbols, including a number in Unicode 7, where we are currently reporting zero width, but which seem like they must have nonzero width in any font that actually supports them.

@stevengj
Copy link
Member Author

Just filed a Unifont bug: https://savannah.gnu.org/bugs/index.php?45395

@ScottPJones
Copy link
Contributor

Nice finding the bugs even in other people's software!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants