Update charwidths for Unicode 8 #46

stevengj · 2015-06-24T15:56:02Z

We updated the data tables in #45, but I'm guessing it doesn't have charwidths for many of the new codepoints. Probably we need to wait for a new version of GNU Unifont for up-to-date charwidths, but at the very least we probably shouldn't default to zero for codepoints in letter-like categories.

e.g. this doesn't seem good:

test/printproperty 0x14400
U+0x14400:
  category = Lo
  combining_class = 0
  bidi_class = 1
  decomp_type = 0
  uppercase_mapping = ffffffff
  lowercase_mapping = ffffffff
  titlecase_mapping = ffffffff
  comb1st_index = -1
  comb2nd_index = -1
  bidi_mirrored = 0
  comp_exclusion = 0
  ignorable = 0
  control_boundary = 0
  boundclass = 1
  charwidth = 0

The text was updated successfully, but these errors were encountered:

jiahao · 2015-06-24T16:08:56Z

Perhaps the easiest thing to do is to define the character widths on a per-block basis.

For example, all the Anatolian Hieroglyphs could be assigned a width of 2 based on inspecting the representative glyphs in the code charts. http://www.unicode.org/charts/PDF/Unicode-8.0/U80-14400.pdf

stevengj · 2015-06-24T16:28:54Z

I was just thinking of something simpler:

#############################################################################   
# Use a default width of 1 for all character categories that are                
# letter/symbol/number-like.  This can be overriden by Unifont or UAX 11        
# below, but provides a useful nonzero fallback for new codepoints when         
# a new Unicode version has been released but Unifont hasn't been updated yet.  

zerowidth = Set{Int}() # categories that may contain zero-width chars           
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CN)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_MN)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_MC)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_ME)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_SK)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_ZS)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_ZL)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_ZP)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CC)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CF)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CS)
push!(zerowidth, UTF8proc.UTF8PROC_CATEGORY_CO)
for c in 0x0000:0x110000
    if catcode(c) ∉ zerowidth
        CharWidths[c] = 1
    end
end

Already with this I noticed something odd in our current charwidth data:

$ test/printproperty 0x1D56C
U+0x1D56C:
  category = Lu
  combining_class = 0
  bidi_class = 1
  decomp_type = 1
  uppercase_mapping = ffffffff
  lowercase_mapping = ffffffff
  titlecase_mapping = ffffffff
  comb1st_index = -1
  comb2nd_index = -1
  bidi_mirrored = 0
  comp_exclusion = 0
  ignorable = 0
  control_boundary = 0
  boundclass = 1
  charwidth = 0

Since https://codepoints.net/U+1D56C is from Unicode 3.1 and clearly has nonzero width, why are we reporting zero width? Does Unifont not have this codepoint?

ScottPJones · 2015-06-24T16:31:07Z

shouldn't that be for c in 0x0:0x10ffff?

stevengj · 2015-06-24T16:32:13Z

It doesn't matter, because invalid codepoints return category Cn.

ScottPJones · 2015-06-24T16:42:54Z

Yep, it's just my OCD acting up!

stevengj · 2015-06-24T16:53:23Z

Adding

          if (w == 0 &&
                         ((cat >= UTF8PROC_CATEGORY_LU && cat <= UTF8PROC_CATEGORY_LO) ||
                          (cat >= UTF8PROC_CATEGORY_ND && cat <= UTF8PROC_CATEGORY_SC) ||
                          (cat >= UTF8PROC_CATEGORY_SO && cat <= UTF8PROC_CATEGORY_ZS))) {
               fprintf(stderr, "zero width for symbol-like char %x\n", c);
               error = 1;
          }

to the loop in test/charwidth.c turns up a lot (~5700) of "letter-like" symbols, including a number in Unicode 7, where we are currently reporting zero width, but which seem like they must have nonzero width in any font that actually supports them.

stevengj · 2015-06-24T17:58:32Z

Just filed a Unifont bug: https://savannah.gnu.org/bugs/index.php?45395

ScottPJones · 2015-06-24T18:13:27Z

Nice finding the bugs even in other people's software!

stevengj mentioned this issue Jun 24, 2015

fix #46 (make sure symbol-like codepoints have nonzero width) #47

Merged

stevengj closed this as completed in 6a7f92d Jun 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update charwidths for Unicode 8 #46

Update charwidths for Unicode 8 #46

stevengj commented Jun 24, 2015

jiahao commented Jun 24, 2015

stevengj commented Jun 24, 2015

ScottPJones commented Jun 24, 2015

stevengj commented Jun 24, 2015

ScottPJones commented Jun 24, 2015

stevengj commented Jun 24, 2015

stevengj commented Jun 24, 2015

ScottPJones commented Jun 24, 2015

Update charwidths for Unicode 8 #46

Update charwidths for Unicode 8 #46

Comments

stevengj commented Jun 24, 2015

jiahao commented Jun 24, 2015

stevengj commented Jun 24, 2015

ScottPJones commented Jun 24, 2015

stevengj commented Jun 24, 2015

ScottPJones commented Jun 24, 2015

stevengj commented Jun 24, 2015

stevengj commented Jun 24, 2015

ScottPJones commented Jun 24, 2015