Skip to content

Resource and Tool for Writing System Identification -- LREC 2024

License

Notifications You must be signed in to change notification settings

cisnlp/GlotScript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GlotScript

  • GlotScript-Resource: provides a resource displaying the writing systems for various languages.

  • GlotScript-Tool: determines the script (writing system) of input text using ISO 15924.

Resource

What writing system is each language written in?

Example:

Language CORE AUXILLARY
Turkish (tur) Latn Arab, Cyrl, Grek
Thai (tha) Thai Latn
Vietnamese (vie) Latn Hani

See metadata folder for more languages.

Tool

It's a Python library that detects the script (writing system) of text based on ISO 15924.

Special codes

  • Zinh code is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.
  • Zyyy code is the Unicode script for "Common" characters.
  • Zzzz code is for Unicode script for "uncoded" script.

Install

from pip

pip3 install GlotScript

from git

pip3 install GlotScript@git+https://github.com/cisnlp/GlotScript

Usage

Script Detection

from GlotScript import sp
sp('これは日本人です')
>> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})
sp('This is Latin')[:1]
>> ('Latn', 1.0)
sp('මේක සිංහල')[0]
>> 'Sinh'

Script Separation

from GlotScript import sc 
sent = "Hello Salut سلام 你好 こんにちは שלום مرحبا"
sc(sent)
>> {
   "Latn":"Hello Salut     ",
   "Hebr":"     שלום ",
   "Arab":"  سلام    مرحبا",
   "Hani":"   你好   ",
   "Hira":"    こんにちは  "
}

Exploring Unicode Blocks: Related Sources

Click to Exapand

Citation

If you use any part of this our resource or tool in your research, please cite it using the following BibTex entry.

@inproceedings{kargaran-etal-2024-glotscript-resource,
    title = "{G}lot{S}cript: A Resource and Tool for Low Resource Writing System Identification",
    author = {Kargaran, Amir Hossein  and
      Yvon, Fran{\c{c}}ois  and
      Sch{\"u}tze, Hinrich},
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.687",
    pages = "7774--7784"
}