feature_request(books): detect incorrect and poor quality text #62

Kristinita · 2020-06-08T10:17:27Z

1. Summary

It would be nice, if ripgrep-all will show warning, if text in the book is not written incorrect or have a bad quality.

2. Problem

2.1. Summary

Some books have bad OCR layer. It is impossible to search for normal words in them. It would be nice, if ripgrep-all will detect these books.

2.2. Details

Books may have bad quality of searchable text. Reasons:

The user who added OCR layer for the book, add incorrect language for OCR. For example, user may added English OCR layer for Russian text as in my 4.2 example.
Bad quality of scanned book and/or tool which was used to add the OCR layer. See my example 3.

I couldn't find, how I can automatically detect these books in my books list. Currently, I need manually check OCR layer quality for every book. It takes a lot of time.

3. Compact Language Detector

Possibly, Compact Language Detector can solve this problem.

I installed cld2-cffi (yes, CLD3 exists, but I have problems in its installation on my Windows) → I ran this code in my Python interpreter:

>>> import cld2
>>> isReliable, textBytesFound, details = cld2.detect("Here text from examples 4.1—4.3")
>>> print('  details: %s' % str(details))

Possibly, would be possible get similar behavior use Rust tools. For example, see Whatlang and CLD3 langdetect.

4. Example texts

4.1. Normal Russian text

Например, название Полтавы связано с названием речки Лтавы (так раньше называлась Ворскла) и означает, соответственно, «город на Лтаве». Название города Ужгород также образовано от названия реки Уж. Винница обязана своим названием речке Винничке, которая протекает через город. Название реки, в свою очередь, происходит от слова «венок»: когда-то молодые девушки собирались на ее берегу и пускали на воду венки, чтобы узнать о своем будущем. Луганск назван в честь речки Луганки.

cld2-cffi output:

details: (Detection(language_name='RUSSIAN', language_code='ru', percent=99, score=709.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))

4.2. English OCR language for Russian text

IIepBhle HeCKOJI:bKO COT MHJIJIHOHOB JIeT 6h1JIH nOHCTHHe KOWMapHhlMH ,n;JIH nJIaHeThI: OHa HenpephlBHO COTpHC8.JIaC:b no,n; y,n;apaMH KpynHhlx MeTeopHTOB, ChlnaBWHXCH Ha Hee H3 KOCMoca. IIoBepxHOCT:b COBpeMeHHOH JIYHhI, nOKpLITaSi MeTeopHTHhlMH KpaTepaMH, n03BOJIHeT HaM npe,n;CTaBHT:b, KaK MOrJIa BhlrJIH,n;eT:b 3eMJIH npHMepHO 4 MJIp,n; JIeT Ha3�. OqeH:b CKOpO BHyrpH HaweH nJIaHeThl3apa60T8.JI tTenJIOBOH ABHraTeJI:b., rOplOqHM ,n;JIH KOToporo CJIymHJI pacn� p�HoaKTHBHhlX SJIeMeHTOB. B He,n;pax 3eMJIH HaqaJIOCh Me,n;JIeHHOe ,n;BHmeHHe Be�eCTBa, HarpeThle CTPYH KOToporo nOAHHM8.JIHC:b BBepx, a XOJIO.D;Hhle onYCK8.JIHCh BHH3. IIJIaHeTa CT8.JIa noxoma Ha CneJIhlH nepCHK.

details: (Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))

Note: I remove Information Separator One gremlin characters from this text for cld2-cffi, otherwise I get traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python38\lib\site-packages\cld2\__init__.py", line 393, in detect
    raise ValueError("input contains invalid UTF-8 around byte " +
ValueError: input contains invalid UTF-8 around byte 348 (of 792534779)

4.3. Bad OCR

1(этрин ска3ала' что знакома с книгой его )кены 3леоноры 8ирек по лекарстве[1пым расте|{ия!| аляски. ёа мой в3дох по поводу тогц что у нас в библиотеке тодько од'!а книга на эту фамилию, (этрин пообещала прислать книгу о лекарстве!тных расте]|иях аляски. !! действительно' не прошло и месяца' как у меня на столе появилась небольшая по объему эффектвого дизайва книга <а|-а5'(а'5 ш||овпшп$5 мвр1с]ш85> с изображе1|и₠ м ца обложке такого 3вакомого ка}<дому х{ителю нашей о6ласти ольховника. правда, в книге он значился 11од другим видовым названием' чем у ;1ас

details: (Detection(language_name='RUSSIAN', language_code='ru', percent=59, score=503.0), Detection(language_name='SERBIAN', language_code='sr', percent=40, score=468.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))

5. Example of expected behavior

ripgrep-all adapters extract text from books.
CLD (or similar tool) check 2 (4 may be better) random pages for every book.

If percent value is 95 (maybe another value is better; need practical tests) or more → do nothing. Else it below 95 → ripgrep-all user get a warning. Example warning text:

WARNING! Possibly, file {Filename} have a text written not in natural language. The reason for this may be incorrect or poor quality OCR layer. Please, check your {Filename}.

6. Note

Some tools for language recognition may not solve this problem. They don't detect that the text written not in the real natural language.

For example, I tried langdetect, TextBlob, guess_language and langid examples from this Stack Overflow answer → they show, that my 4.2 and 4.3 examples written on the real natural languages.

Thanks.

The text was updated successfully, but these errors were encountered:

phiresky · 2020-06-08T18:43:33Z

I think this is out of scope for this project - currently it doesn't even detect if a PDF does not have any text content at all just images - which in my opinion is also okay. If you need something to filter documents maybe the (not yet implemented) custom extractor/transformer will help.

I also don't think this would be possible to do in a reliable manner that does not contain lots of false positives. Think about pdfs with mostly tabular data / numbers, abbreviations etc.

phiresky · 2020-06-08T18:43:59Z

Thanks for the links though, cld2 is useful to me for a different project :)

Kristinita · 2020-06-09T13:26:05Z

Type: Reply 💬

1. Summary

I still think that detecting incorrect symbols is not out of ripgrep-all scope. I consider the arguments presented is not great.

(It looks like a common situation when a programmer is not personally interested in introducing a feature, and quickly comes up with arguments to close the request)

2. Out of scope

I think this is out of scope for this project

If you need something to filter documents maybe the (not yet implemented) custom extractor/transformer will help.

Expected algorithm of the program that should solve the problem, that described in my issue:

Check all files in directory and subdirectories.
Extract text from all books that have any format (PDF, DjVu, RTF, CHM and so on).
Check random text parts via CLD and return results.

I hope #60 would be implemented. Thereafter, ripgrep-all can solve 1 and 2. Remains solely 3.

New program for solving this problem will take a lot of work for implementing features, which is already in ripgrep-all.

3. False positives

I also don't think this would be possible to do in a reliable manner that does not contain lots of false positives.

Think about pdfs with mostly tabular data / numbers, abbreviations etc.

Have you done any tests to argue “lots of false positives”?

If to speak about me, before posting this issue I tried cld2-cffi for my science books.

See good scanned texts with tables, abbreviations, numbers, formulas.

Chemistry — 96%

Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import cld2
>>> isReliable, textBytesFound, details = cld2.detect("ПОЧЁМ РЕДКИЕ ЗЕМЛИ? Так как у лантанидов наружные электронные оболочки построены одинаково, их химические свойства весьма сходны. Казалось бы, и встречаться в природе, и цениться они должны тоже одинаково. Однако таблица цен и распространённости лантанидов показывает, что это не так: Элемент Z иена Кларк Се 58 2,1 66 Рг 59 5,2 9,1 Nd 60 3,4 40 Pm 61 4Ю-20 Sm 62 4,3 7,0 Ей 63 117 2,1 Gd 64 6,1 6,1 Tb 65 55 1,2 Dy 66 5,0 4,5 Но 67 20 1,4 Er 68 8,2 3,5 Tm 69 110 0,5 Yb 70 15 3,1 Lu 71 134 0,8 Значит, надо учитывать «чётность» не только протонов, но и нейтронов в ядрах элементов. Обратимся теперь к таблице, в которой собраны данные для стабильных нуклидов всех элементов, содержащихся в земной коре (их известно около 300). В ней приведено содержание нуклидов каждого типа (исключая кислород, на который приходится 52 %). Число протонов в ядре Число нейтронов в ядре Обшее содержание, % чётное чётное 21 нечётное чётное 26 чётное нечётное 1 нечётное нечётное 0,03 Во второй строке таблицы приведён порядковый номер элемента Z, в третьей — округлённая цена в долларах за 1 г металла в слитке чистотой 99,9 % (в ценах 2000 г. компании «Олдрич»). В графе прометия стоит прочерк: у этого элемента нет стабильных изотопов, один из самых долгоживуших — прометий-147 (период полураспада 2,62 года) — получают искусственно и используют в миниатюрных атомных батарейках. В 1998 г. 1 г 147Рт стоил примерно 10 млрд долларов. Конечно, никто прометий граммами (и даже микрограммами) не покупает: его количество измеряют единицами активности — беккерелями и мегабеккерелями; для 147Рт 1 МБк соответствует 3 • 10-9 г прометия.")
>>> print('  details: %s' % str(details))
  details: (Detection(language_name='RUSSIAN', language_code='ru', percent=96, score=584.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))

Physics — 96%

>>> isReliable, textBytesFound, details = cld2.detect("Рассмотрим ешё один пример. Вос пользуемся методом анализа размер ностей для установления зависимости периода свободных колебаний матема тического маятника от его параметров. Таковыми являются масса маятника т , длина нити /, а также ускорение сво бодного падения g, характеризующее поле тяжести, в котором маятник со вершает колебания. Представим период колебаний в виде T ~ m x iyg z . Размерности левой и правой ча стей этого выражения должны быть равны с = (кг)Чм)Чм-с 2 ) г или (кг) 0 •(M) 0 · с 1 = (кг)ЧмГЧс)22 . Приравнивая показатели степени при килограммах, метрах и секундах, получаем систему уравнений О = х, 0 = у + ζ, 1 = -2z Решив её, находим: χ = 0, у = 1/2, ζ = -1/2. Таким образом, т ~Ь-")
>>> print('  details: %s' % str(details))
  details: (Detection(language_name='RUSSIAN', language_code='ru', percent=96, score=636.0), Detection(language_name='GREEK', language_code='el', percent=1, score=1024.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))

I think, that CLD (or, possibly, its alternatives) will successfully work for most pages of real books.

We can check it: for example, you can make a representative sample — select random pages of random books of my list and run cld2-cffi for its text.

Thanks.

phiresky · 2020-06-09T20:23:11Z

i'll think about it. by the way, outputting warnings is not currently possible pending BurntSushi/ripgrep#1612

phiresky · 2020-06-09T20:26:23Z

also, I'll probably add "pre / post" processing capability so you can run a second adapter after a first one (since this is already needed for the pdf-add-page-number functionality). Then this could be added as a post-processor and implemented as an external script.

phiresky closed this as completed Jun 8, 2020

Kristinita mentioned this issue Jun 9, 2020

feature_request(debug): detailed debug information #63

Closed

5 tasks

Kristinita mentioned this issue Oct 7, 2020

feature_request(plugin): posthtml-declaring-language posthtml/posthtml-plugins-idea#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature_request(books): detect incorrect and poor quality text #62

feature_request(books): detect incorrect and poor quality text #62

Kristinita commented Jun 8, 2020 •

edited

Loading

phiresky commented Jun 8, 2020

phiresky commented Jun 8, 2020

Kristinita commented Jun 9, 2020

phiresky commented Jun 9, 2020 •

edited

Loading

phiresky commented Jun 9, 2020

feature_request(books): detect incorrect and poor quality text #62

feature_request(books): detect incorrect and poor quality text #62

Comments

Kristinita commented Jun 8, 2020 • edited Loading

1. Summary

2. Problem

2.1. Summary

2.2. Details

3. Compact Language Detector

4. Example texts

4.1. Normal Russian text

4.2. English OCR language for Russian text

4.3. Bad OCR

5. Example of expected behavior

6. Note

phiresky commented Jun 8, 2020

phiresky commented Jun 8, 2020

Kristinita commented Jun 9, 2020

1. Summary

2. Out of scope

3. False positives

phiresky commented Jun 9, 2020 • edited Loading

phiresky commented Jun 9, 2020

Kristinita commented Jun 8, 2020 •

edited

Loading

phiresky commented Jun 9, 2020 •

edited

Loading