A novel search engine for historical newspapers utilizing ElasticSearch and machine learning methods. Code for the paper https://arxiv.org/abs/2305.19392
The purpose of this research is to build a proof of concept search engine which addresses the two issues: mistakes in the OCR and orthographic variety within language reforms in Bulgarian from 1850s till 1945.
This is a PoC version and can be used for collections of digitised historical documents within the same time span. The tool uses dictionaries for Bulgarian but this can be easily adapted for other languages as well.
This research would be useful for anyone who is interested in search tools in collections of historical documents/newspapaers containing errors and/or linguistic variance. The target user of the engine is a library in Bulgaria, but can be adapted and used by external users as well.