Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese word split Error #104

Open
mxazz123 opened this issue Nov 1, 2024 · 1 comment
Open

Chinese word split Error #104

mxazz123 opened this issue Nov 1, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@mxazz123
Copy link

mxazz123 commented Nov 1, 2024

For a Chinese pdf, word split encounters error. It recognizes whole line as one word for highlight and word copy buttons
It would be so good to upgrade the word split sub-module to support Chinese words splitting.
image

@mkucej
Copy link
Owner

mkucej commented Nov 3, 2024

I agree. Unfortunately, we use the Poppler's pdftotext library to process PDFs, and it does not split CJK languages properly. For instance pdftotext -bbox produces this:

<doc>
  <page width="595.300000" height="841.900000">
    <word xMin="215.759000" yMin="82.460000" xMax="384.238700" yMax="124.250000">《论语》</word>
    <word xMin="277.440000" yMin="136.949750" xMax="322.549400" yMax="147.447000">作者:孔子</word>
    <word xMin="265.680000" yMin="168.149750" xMax="286.789400" yMax="178.647000">论语</word>
    <word xMin="292.080000" yMin="168.149750" xMax="334.308200" yMax="178.647000">论语序说</word>
    <word xMin="90.000000" yMin="204.443500" xMax="186.500000" yMax="220.463000">史记世家曰:</word>
    <word xMin="178.320000" yMin="204.443500" xMax="531.379700" yMax="220.463000">“孔子名丘,字仲尼。其先宋人。父叔梁纥,母颜</word>
    <word xMin="68.880000" yMin="235.643500" xMax="531.378200" yMax="251.663000">氏。以鲁襄公二十二年,庚戌之岁,十一月庚子,生孔子于鲁昌平</word>
    <word xMin="68.880000" yMin="266.843500" xMax="535.220000" yMax="282.863000">乡陬邑。为儿嬉戏,常陈俎豆,设礼容。及长,为委吏,料量平;</word>
    <word xMin="68.880000" yMin="298.043500" xMax="531.379700" yMax="314.063000">为司职吏,畜蕃息。适周,问礼于老子,既反,而弟子益进。昭公</word>
    <word xMin="68.880000" yMin="329.243500" xMax="531.380000" yMax="345.263000">二十五年甲申,孔子年三十五,而昭公奔齐,鲁乱。于是适齐,为</word>
    <word xMin="68.880000" yMin="360.443500" xMax="539.300000" yMax="376.463000">高昭子家臣,以通乎景公。公欲封以尼溪之田,晏婴不可,公惑之。</word>
    <word xMin="68.880000" yMin="391.643500" xMax="535.220000" yMax="407.663000">孔子遂行,反乎鲁。定公元年壬辰,孔子年四十三,而季氏强僭,</word>
    <word xMin="68.880000" yMin="422.843500" xMax="531.139400" yMax="438.863000">其臣阳虎作乱专政。故孔子不仕,而退修诗、书、礼、乐,弟子弥</word>
    <word xMin="68.880000" yMin="454.043500" xMax="531.379400" yMax="470.063000">众。九年庚子,孔子年五十一。公山不狃以费畔季氏,召,孔子欲</word>
    <word xMin="68.880000" yMin="485.243500" xMax="539.300000" yMax="501.263000">往,而卒不行。定公以孔子为中都宰,一年,四方则之,遂为司空,</word>
    <word xMin="68.880000" yMin="516.443500" xMax="531.380000" yMax="532.463000">又为大司寇。十年辛丑,相定公会齐侯于夹谷,齐人归鲁侵地。十</word>
    <word xMin="68.880000" yMin="547.643500" xMax="535.220000" yMax="563.663000">二年癸卯,使仲由为季氏宰,堕三都,收其甲兵。孟氏不肯堕成,</word>
    <word xMin="68.880000" yMin="578.843500" xMax="531.380000" yMax="594.863000">围之不克。十四年乙巳,孔子年五十六,摄行相事,诛少正卯,与</word>
    <word xMin="68.880000" yMin="610.043500" xMax="531.379700" yMax="626.063000">闻国政。三月,鲁国大治。齐人归女乐以沮之,季桓子受之。郊又</word>
    <word xMin="68.880000" yMin="641.243500" xMax="531.128950" yMax="657.263000">不致?俎于大夫,孔子行。适卫,主于子路妻兄颜浊邹家。适陈,</word>
    <word xMin="68.880000" yMin="672.443500" xMax="535.220000" yMax="688.463000">过匡,匡人以为阳虎而拘之。既解,还卫,主蘧伯玉家,见南子。</word>
    <word xMin="68.880000" yMin="703.643500" xMax="531.371900" yMax="719.663000">去适宋,司马桓?欲杀之。又去,适陈,主司城贞子家。居三岁而</word>
    <word xMin="68.880000" yMin="734.843500" xMax="531.375500" yMax="750.863000">反于卫,灵公不能用。晋赵氏家臣佛?以中牟畔,召孔子,孔子欲</word>
  </page>
</doc>

We use these coordinates to generate annotation boxes. That's why an annotation often covers the whole line in case of CJK. This issue will have to be solved upstream, or another library would have to be used.

@mkucej mkucej self-assigned this Nov 3, 2024
@mkucej mkucej added the enhancement New feature or request label Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants