Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT wordpiece tokenizer differers from official HF implementation #5496

Closed
cebtenzzre opened this issue Feb 14, 2024 · 8 comments
Closed

BERT wordpiece tokenizer differers from official HF implementation #5496

cebtenzzre opened this issue Feb 14, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Feb 14, 2024

Our wordpiece tokenizer has issues with unicode. One of the problems is incomplete NFD normalization, causing many characters with accents to be dropped entirely when tokenized. Examples include Kantō -> Kant and lǜshi -> lshi.

Here is the diff for nomic-embed-text-v1 on wikitext.test.raw:

Diff
--- good_tokens.txt	2024-02-14 14:57:16.501519622 -0500
+++ lcpp_tokens.txt	2024-02-14 14:57:16.832522224 -0500
@@ -3665,8 +3665,8 @@
 1006: (
 100: [UNK]
 1790: 史
-11895: shi
-11895: shi
+14021: sh
+14021: sh
 1007: )
 1012: .
 1996: the
@@ -3848,7 +3848,7 @@
 1006: (
 100: [UNK]
 100: [UNK]
-11895: shi
+14021: sh
 25981: sheng
 1007: )
 1010: ,
@@ -4490,8 +4490,8 @@
 2124: known
 2005: for
 2010: his
-16299: lush
-2072: ##i
+1048: l
+6182: ##shi
 1010: ,
 1037: a
 2828: type
@@ -4538,8 +4538,8 @@
 1012: .
 2010: his
 2190: best
-16299: lush
-2072: ##i
+1048: l
+6182: ##shi
 2224: use
 1996: the
 5903: parallel
@@ -5255,8 +5255,8 @@
 1999: in
 17903: transforming
 1996: the
-16299: lush
-2072: ##i
+1048: l
+6182: ##shi
 2013: from
 8210: mere
 2773: word
@@ -5400,7 +5400,6 @@
 22281: mats
 19098: ##uo
 24234: bash
-2080: ##o
 1010: ,
 1996: the
 2200: very
@@ -5483,10 +5482,9 @@
 2004: as
 25277: bunk
 2050: ##a
-18454: shu
+14021: sh
 2890: ##re
 4509: ##ish
-2226: ##u
 1999: in
 1996: the
 6280: 9th
@@ -5570,11 +5568,11 @@
 18952: sai
 6806: ##ho
 22332: ##kus
-6979: ##hu
+2232: ##h
 1012: .
 2010: his
 3076: student
-14684: chu
+10381: ch
 5289: ##gan
 25540: eng
 8454: ##ets
@@ -5598,14 +5596,14 @@
 18443: preface
 2015: ##s
 1012: .
-14684: chu
+10381: ch
 5289: ##gan
 1005: '
 1055: s
 3076: student
 21025: gi
-3527: ##do
-18454: shu
+2094: ##d
+14021: sh
 17426: ##shin
 2018: had
 2485: close
@@ -5637,7 +5635,7 @@
 2028: one
 2154: day
 9152: ni
-5558: ##jo
+3501: ##j
 10930: yo
 6182: ##shi
 15319: ##moto
@@ -5661,7 +5659,7 @@
 1010: ,
 2356: asked
 21025: gi
-3527: ##do
+2094: ##d
 1010: ,
 1000: "
 2323: should
@@ -5678,7 +5676,7 @@
 1029: ?
 1000: "
 21025: gi
-3527: ##do
+2094: ##d
 15048: dared
 2000: to
 7514: reply
@@ -5772,7 +5770,6 @@
 2386: ##man
 1010: ,
 24234: bash
-2080: ##o
 1010: ,
 1998: and
 18454: shu
@@ -5817,8 +5814,8 @@
 11865: fu
 1005: '
 1055: s
-16299: lush
-2072: ##i
+1048: l
+6182: ##shi
 1006: (
 100: [UNK]
 100: [UNK]
@@ -5847,7 +5844,7 @@
 14483: ##cian
 5784: scholars
 1998: and
-16480: cho
+10381: ch
 11483: ##nin
 1006: (
 27938: townspeople
@@ -5892,12 +5889,12 @@
 4261: 37
 1997: of
 11721: ga
-6806: ##ho
+2232: ##h
 21122: bun
-14235: ##shu
+4095: ##sh
 2008: that
 1062: z
-14428: ##ime
+2213: ##m
 2072: ##i
 1031: [
 4241: du
@@ -5945,7 +5942,6 @@
 22281: mats
 19098: ##uo
 24234: bash
-2080: ##o
 1010: ,
 1996: the
 4602: greatest
@@ -6151,8 +6147,8 @@
 7893: verse
 1010: ,
 2030: or
-16299: lush
-2072: ##i
+1048: l
+6182: ##shi
 1007: )
 1010: ,
 1998: and
@@ -9195,7 +9191,7 @@
 1996: the
 11003: preceding
 11865: fu
-6499: ##so
+2015: ##s
 2465: class
 1010: ,
 2027: they
@@ -9216,7 +9212,6 @@
 1996: the
 2307: great
 26044: kant
-2080: ##o
 8372: earthquake
 1999: in
 4927: 1923
@@ -9490,7 +9485,7 @@
 1997: of
 1996: the
 11865: fu
-6499: ##so
+2015: ##s
 1030: @
 1011: -
 1030: @
@@ -9614,10 +9609,10 @@
 2142: united
 2163: states
 1012: .
-20251: sato
+2938: sat
 8915: te
 10422: ##tsu
-28160: ##taro
+7559: ##tar
 1010: ,
 1037: a
 2887: japanese
@@ -9676,7 +9671,7 @@
 2023: this
 6463: ratio
 1010: ,
-20251: sato
+2938: sat
 14833: theo
 18425: ##rized
 1010: ,
@@ -10020,7 +10015,7 @@
 1997: of
 1996: the
 11865: fu
-6499: ##so
+2015: ##s
 2465: class
 2020: were
 4821: ultimately
@@ -10032,7 +10027,7 @@
 2093: three
 2062: more
 11865: fu
-6499: ##so
+2015: ##s
 1030: @
 1011: -
 1030: @
@@ -10048,7 +10043,7 @@
 1010: ,
 1998: and
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 1007: )
 2020: were
@@ -10089,7 +10084,7 @@
 2063: ##e
 1998: and
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 2127: until
 1996: the
@@ -10115,7 +10110,7 @@
 5082: progress
 1997: of
 11865: fu
-6499: ##so
+2015: ##s
 1005: '
 1055: s
 2810: construction
@@ -10145,7 +10140,7 @@
 4757: ##ss
 1996: the
 11865: fu
-6499: ##so
+2015: ##s
 1030: @
 1011: -
 1030: @
@@ -10251,7 +10246,7 @@
 1999: in
 1996: the
 11865: fu
-6499: ##so
+2015: ##s
 2465: class
 1998: and
 3041: earlier
@@ -10442,7 +10437,7 @@
 1999: in
 1996: the
 11865: fu
-6499: ##so
+2015: ##s
 2465: class
 1012: .
 2023: this
@@ -10514,7 +10509,7 @@
 1997: of
 1996: the
 11865: fu
-6499: ##so
+2015: ##s
 2465: class
 2008: that
 2009: it
@@ -10953,7 +10948,7 @@
 1007: )
 1006: (
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 1998: and
 2003: is
@@ -10969,7 +10964,7 @@
 27829: kam
 26029: ##pon
 20996: ro
-2175: go
+1043: g
 2300: water
 1030: @
 1011: -
@@ -11082,7 +11077,7 @@
 1007: )
 1998: and
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 14872: exceeded
 2008: that
@@ -11224,7 +11219,7 @@
 2063: ##e
 1998: and
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 2018: had
 2093: three
@@ -13815,7 +13810,7 @@
 7584: conversion
 2138: because
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 2018: had
 4265: suffered
@@ -13869,7 +13864,8 @@
 1012: .
 1996: the
 11865: fu
-17063: ##sos
+2015: ##s
+2015: ##s
 2020: were
 5115: scheduled
 2000: to
@@ -15034,7 +15030,7 @@
 4170: fleet
 1012: .
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 2018: had
 2019: an
@@ -15148,7 +15144,6 @@
 4927: 1923
 2307: great
 26044: kant
-2080: ##o
 8372: earthquake
 4930: struck
 1010: ,
@@ -15206,7 +15201,7 @@
 4739: 1931
 1998: and
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 1005: '
 1055: s
@@ -15269,7 +15264,7 @@
 2257: august
 4347: 1937
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 10768: fe
 22155: ##rrie
@@ -15405,8 +15400,8 @@
 1996: the
 2422: light
 6839: carrier
-7570: ho
-22231: ##sho
+1044: h
+4095: ##sh
 2004: as
 6802: distant
 3104: cover
@@ -15431,7 +15426,7 @@
 2063: ##e
 1998: and
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 4066: sort
 6340: ##ied
@@ -15503,7 +15498,7 @@
 3282: gun
 1997: of
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 1005: '
 1055: s
@@ -15615,7 +15610,7 @@
 1030: @
 5902: admiral
 11895: shi
-3217: ##ro
+2099: ##r
 27006: tak
 3022: ##as
 2226: ##u
@@ -15732,7 +15727,7 @@
 3826: 1943
 1998: and
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 2012: at
 21871: sas
@@ -15797,7 +15792,7 @@
 4397: newly
 2949: completed
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 1996: the
 2206: following
@@ -16025,7 +16020,8 @@
 5902: admiral
 10147: ji
 3736: ##sa
-23670: ##buro
+8569: ##bu
+2099: ##r
 11472: oz
 10830: ##awa
 1998: and
@@ -16304,7 +16300,7 @@
 1016: 2
 1012: .
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 2001: was
 8217: lightly
@@ -16495,7 +16491,7 @@
 2886: attack
 1012: .
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 2001: was
 11551: unsuccessfully
@@ -16583,8 +16579,7 @@
 2005: for
 25933: ama
 4328: ##mi
-9808: os
-16369: ##hima
+24772: ##shima
 1012: .
 2043: when
 2027: they
@@ -16598,7 +16593,7 @@
 4015: transferred
 2000: to
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 1998: and
 27269: hoisted
@@ -16703,7 +16698,7 @@
 27053: indochina
 1998: and
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 2150: became
 10565: flagship
@@ -16739,7 +16734,6 @@
 1996: the
 2422: light
 10844: cruiser
-1051: o
 7677: ##yo
 3527: ##do
 2006: on
@@ -16870,7 +16864,6 @@
 2020: were
 13127: escorted
 2011: by
-1051: o
 7677: ##yo
 3527: ##do
 1998: and
@@ -16984,7 +16977,7 @@
 5388: 58
 1998: and
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 2001: was
 2718: hit
@@ -17126,7 +17119,7 @@
 14107: pumping
 1012: .
 1044: h
-10513: ##yu
+2100: ##y
 3654: ##ga
 2001: was
 1037: a
@@ -43604,7 +43597,6 @@
 1996: the
 2887: japanese
 4290: kong
-2080: ##o
 1998: and
 7632: hi
 7416: ##ei
@@ -47216,7 +47208,7 @@
 3212: navy
 1012: .
 1996: the
-12849: ko
+1047: k
 22513: ##tet
 6342: ##su
 1006: (
@@ -53282,7 +53274,6 @@
 2013: from
 3306: greek
 1174: τ
-29723: ##ε
 29728: ##μ
 16177: ##ν
 29723: ##ε
@@ -53302,7 +53293,6 @@
 1998: and
 1173: σ
 29731: ##π
-29730: ##ο
 16177: ##ν
 29722: ##δ
 29735: ##υ
@@ -100436,7 +100426,6 @@
 1999: in
 1007: )
 1999: in
-1051: o
 6590: ##ita
 7498: prefecture
 1010: ,
@@ -101378,7 +101367,8 @@
 2001: was
 5409: worst
 1999: in
-27603: kochi
+1047: k
+5428: ##chi
 1998: and
 2000: to
 24917: ##kushima
@@ -101796,7 +101786,7 @@
 19808: mina
 4328: ##mi
 21351: ##dai
-3406: ##to
+2102: ##t
 1010: ,
 15052: okinawa
 1012: .
@@ -103992,9 +103982,9 @@
 2345: final
 21042: landfall
 2379: near
-11503: cam
+1039: c
+2213: ##m
 6887: ph
-2050: ##a
 1010: ,
 5148: vietnam
 2006: on
@@ -105809,8 +105799,7 @@
 1998: and
 25933: ama
 4328: ##mi
-9808: os
-16369: ##hima
+24772: ##shima
 2006: on
 2244: september
 2539: 19
@@ -105976,7 +105965,7 @@
 1997: of
 5292: ha
 5428: ##chi
-5558: ##jo
+3501: ##j
 1030: @
 1011: -
 1030: @
@@ -106013,7 +106002,7 @@
 5601: mph
 1007: )
 2012: at
-16480: cho
+10381: ch
 6182: ##shi
 1010: ,
 27368: chiba
@@ -106092,8 +106081,7 @@
 1999: in
 25933: ama
 4328: ##mi
-9808: os
-16369: ##hima
+24772: ##shima
 1010: ,
 1996: the
 4040: storm
@@ -106111,7 +106099,7 @@
 2006: on
 5292: ha
 5428: ##chi
-5558: ##jo
+3501: ##j
 1010: ,
 3612: wind
 26903: gust
@@ -108909,7 +108897,8 @@
 1997: of
 10101: rainfall
 1999: in
-27603: kochi
+1047: k
+5428: ##chi
 1010: ,
 2096: while
 2844: strong
@@ -132935,12 +132924,11 @@
 4351: designated
 1062: z
 2072: ##i
-29731: ##π
 2349: due
 2000: to
 1996: the
 1000: "
-1170: π
+100: [UNK]
 1000: "
 19587: topology
 1012: .
@@ -132953,7 +132941,7 @@
 1000: "
 2030: or
 1000: "
-1170: π
+100: [UNK]
 1000: "
 2930: section
 2003: is
@@ -133054,7 +133042,6 @@
 3372: ##nt
 1062: z
 2072: ##i
-29731: ##π
 1012: .
 2045: there
 2024: are
@@ -133716,7 +133703,7 @@
 1047: k
 1027: =
 1015: 1
-1179: ω
+100: [UNK]
 1012: .
 2023: this
 2003: is
@@ -133882,17 +133869,15 @@
 2003: is
 2170: called
 1037: a
-1170: π
+100: [UNK]
 2930: section
 1012: .
 2073: where
 1062: z
 2072: ##i
-29731: ##π
 5344: faces
 1062: z
 2072: ##i
-29731: ##π
 1996: the
 2930: section
 2061: so
@@ -147284,17 +147269,11 @@
 1693: ア
 30221: ##イ
 30257: ##ラ
-30246: ##フ
-30240: ##ト
 30241: ##ナ
 30259: ##ル
-30240: ##ト
-30235: ##タ
 30237: ##ッ
 30228: ##ク
-1702: ク
 30259: ##ル
-30232: ##シ
 30219: ##ア
 1909: 王
 1671: の
@@ -147314,10 +147293,10 @@
 2226: ##u
 11972: guru
 26541: ##jia
-1051: o
+100: [UNK]
 2053: no
 7632: hi
-6806: ##ho
+2232: ##h
 1007: )
 1010: ,
 2003: is
@@ -188846,7 +188825,8 @@
 1025: ;
 3763: latin
 1024: :
-19212: nero
+11265: ne
+2099: ##r
 25017: claudius
 11604: caesar
 11668: augustus
@@ -200556,7 +200536,9 @@
 29869: ##र
 29879: ##ो
 29863: ##न
+100: [UNK]
 1317: ग
+100: [UNK]
 1000: "
 2029: which
 2003: is
@@ -200633,7 +200615,11 @@
 29836: ##و
 29817: ##ت
 25573: ##ا
-100: [UNK]
+1282: س
+23673: ##ل
+29836: ##و
+15394: ##د
+29836: ##و
 23856: kota
 16183: sal
 6784: ##ud
@@ -226008,27 +225994,29 @@
 9973: pinyin
 1024: :
 2568: mind
-20391: ##ulu
+2140: ##l
 1025: ;
 21877: pe
+100: [UNK]
 1044: h
 1030: @
 1011: -
 1030: @
-1051: o
 2063: ##e
 1030: @
 1011: -
 1030: @
-10147: ji
+1046: j
 1024: :
 8026: bin
 1030: @
 1011: -
 1030: @
 2000: to
+100: [UNK]
 1011: -
 8840: lo
+100: [UNK]
 1007: )
 2003: is
 1037: a
@@ -233749,7 +233737,8 @@
 16107: ##nko
 10882: fi
 15000: ##lip
-9142: ##ovic
+4492: ##ov
+2072: ##i
 2165: took
 2058: over
 2004: as
@@ -233792,7 +233781,7 @@
 1997: of
 2175: go
 13102: ##sp
-2594: ##ic
+2072: ##i
 1998: and
 2379: near
 22889: sl
@@ -233869,7 +233858,7 @@
 21590: ##sko
 6819: vi
 6460: ##je
-3401: ##ce
+2063: ##e
 27885: ob
 18053: ##rane
 1516: –
@@ -234461,7 +234450,8 @@
 16107: ##nko
 10882: fi
 15000: ##lip
-9142: ##ovic
+4492: ##ov
+2072: ##i
 1010: ,
 10655: likewise
 1037: a
@@ -234636,7 +234626,8 @@
 1010: ,
 10882: fi
 15000: ##lip
-9142: ##ovic
+4492: ##ov
+2072: ##i
 2165: took
 2058: over
 3094: command
@@ -234694,7 +234685,7 @@
 2000: to
 2175: go
 13102: ##sp
-2594: ##ic
+2072: ##i
 1010: ,
 2073: where
 2009: it
@@ -234706,7 +234697,7 @@
 2491: control
 2175: go
 13102: ##sp
-2594: ##ic
+2072: ##i
 2114: against
 1996: the
 1046: j
@@ -234719,19 +234710,20 @@
 4123: battalion
 4110: captured
 22827: kan
-21335: ##iza
+2072: ##i
+2050: ##a
 10492: barracks
 1999: in
 2175: go
 13102: ##sp
-2594: ##ic
+2072: ##i
 1012: .
 2076: during
 4337: combat
 1999: in
 2175: go
 13102: ##sp
-2594: ##ic
+2072: ##i
 1010: ,
 2382: 30
 3629: troops
@@ -234744,8 +234736,8 @@
 1010: ,
 7197: assisted
 2011: by
-6735: luck
-2080: ##o
+11320: lu
+3683: ##ko
 11867: sp
 2226: ##u
 1010: ,
@@ -234756,7 +234748,7 @@
 2236: general
 19817: tr
 13006: ##aj
-3401: ##ce
+2063: ##e
 1047: k
 12096: ##rst
 6777: ##ev
@@ -234782,7 +234774,8 @@
 7333: deployed
 2000: to
 2777: met
-14733: ##kovic
+7724: ##kov
+2072: ##i
 2006: on
 2654: 28
 2255: october
@@ -234805,7 +234798,7 @@
 2000: to
 2175: go
 13102: ##sp
-2594: ##ic
+2072: ##i
 1010: ,
 1037: a
 2112: part
@@ -234887,7 +234880,7 @@
 14713: ##ija
 1058: v
 2721: ##la
-19053: ##cic
+2072: ##i
 4123: battalion
 2241: based
 1999: in
@@ -234948,7 +234941,7 @@
 21590: ##sko
 6819: vi
 6460: ##je
-3401: ##ce
+2063: ##e
 27885: ob
 18053: ##rane
 1516: –
@@ -235004,7 +234997,7 @@
 1996: the
 2181: area
 1997: of
-24053: ska
+2912: ##ka
 19892: ##br
 2078: ##n
 3900: ##ja
@@ -235067,20 +235060,20 @@
 7221: ban
 15333: je
 2721: ##la
-19053: ##cic
+2072: ##i
 4123: battalion
 1010: ,
 13523: mat
 14713: ##ija
 1058: v
 2721: ##la
-19053: ##cic
+2072: ##i
 4123: battalion
 1010: ,
 10768: fe
 20683: ##rdo
 10514: su
-19053: ##cic
+2072: ##i
 4123: battalion
 1998: and
 2112: part
@@ -254648,7 +254641,6 @@
 3747: influence
 1997: of
 2332: king
-1097: æ
 10760: ##the
 20850: ##lb
 19058: ##ald
@@ -257962,7 +257954,6 @@
 1999: in
 15522: cyrillic
 1024: :
-1194: п
 2080: ##o
 29742: ##д
 25529: ##в
@cebtenzzre cebtenzzre added the bug Something isn't working label Feb 14, 2024
@hiepxanh
Copy link

hiepxanh commented Feb 20, 2024

@cebtenzzre look like duplicate this issue? #3502

there is a PR here #4868
other watting for PR is drafting here #5613 (comment)

@superchargez
Copy link

so, now BERT based models supported?

@bobqianic
Copy link
Contributor

We likely need to move all the tokenization-related code from llama.cpp to a separate file. Otherwise, the llama.cpp will become too messy.

@cebtenzzre
Copy link
Collaborator Author

@​cebtenzzre look like duplicate this issue? #​3502 #​4868 watting for PR #​4868

Possibly related, but keep in mind that BERT uses an entirely separate tokenizer implementation (wordpiece "WPM") from all other models (SentencePiece "SPM" or GPT-2 "BPE").

@iamlemec
Copy link
Collaborator

Is the SPM preprocessor also replacing accented characters? Seems like we should be able to reuse bits from that. Btw, in case it's useful for folks, I made a little Python function that prints out a color-coded token diff between our results and those from Huggingface (it goes through llama-cpp-python):

https://gist.github.com/iamlemec/52eaa4961762efb9c064b871a67f6cc6

The biggest instance I'm finding there is with dash variants like emdash. But basically still a case of replacing certain complex characters with their base forms.

@cebtenzzre
Copy link
Collaborator Author

A comment regarding this issue from @​apage43:

tokenizers bert normalizer's accent stripping is unicode "NFD" normalization, which transforms any accented chars into the "canonical decomposition" (here's where the lookup table comes in - for tokenizers the table comes from here) - the base char + accent codepoint form instead of the single-codepoint form, then just stripping any accent ("non-spacing mark") characters (another table)

@iamlemec
Copy link
Collaborator

That's very helpful @cebtenzzre! Opening a PR with this in a minute.

@hiepxanh
Copy link

@cebtenzzre can you take a look on new deploy to see the improvment? #5740

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants