You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
split_by_number is really nice for ensuring models learn to treat each digit separately, but the current implementation only applies to Western Arabic and Full-Width digits:
0123456789
0123456789
It'd be nice to include anything in the Unicode General Category Nd (Numeric, decimal digits), so that it covers Eastern Arabic numerals, Chinese numerals, etc.:
۰۱۲۳۴۵۶۷۸۹
零一二三四五六七八九十百千萬億
〇一二三四五六七八九十百千万亿
We could also cover the whole category N, which would also included non-digit numeric characters, like the ¹ (SUPERSCRIPT ONE) or Ⅹ (ROMAN NUMERAL TEN, not the letter X), but I'm not sure how intuitive it is to include those, or how often it even matters.
The text was updated successfully, but these errors were encountered:
split_by_number
is really nice for ensuring models learn to treat each digit separately, but the current implementation only applies to Western Arabic and Full-Width digits:It'd be nice to include anything in the Unicode General Category
Nd
(Numeric, decimal digits), so that it covers Eastern Arabic numerals, Chinese numerals, etc.:We could also cover the whole category
N
, which would also included non-digit numeric characters, like the¹
(SUPERSCRIPT ONE) orⅩ
(ROMAN NUMERAL TEN, not the letter X), but I'm not sure how intuitive it is to include those, or how often it even matters.The text was updated successfully, but these errors were encountered: