-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature_request(books): detect incorrect and poor quality text #62
Comments
I think this is out of scope for this project - currently it doesn't even detect if a PDF does not have any text content at all just images - which in my opinion is also okay. If you need something to filter documents maybe the (not yet implemented) custom extractor/transformer will help. I also don't think this would be possible to do in a reliable manner that does not contain lots of false positives. Think about pdfs with mostly tabular data / numbers, abbreviations etc. |
Thanks for the links though, cld2 is useful to me for a different project :) |
Type: Reply 💬 1. SummaryI still think that detecting incorrect symbols is not out of ripgrep-all scope. I consider the arguments presented is not great. (It looks like a common situation when a programmer is not personally interested in introducing a feature, and quickly comes up with arguments to close the request) 2. Out of scope
Expected algorithm of the program that should solve the problem, that described in my issue:
I hope #60 would be implemented. Thereafter, ripgrep-all can solve 1 and 2. Remains solely 3. New program for solving this problem will take a lot of work for implementing features, which is already in ripgrep-all. 3. False positives
Have you done any tests to argue “lots of false positives”? If to speak about me, before posting this issue I tried cld2-cffi for my science books. See good scanned texts with tables, abbreviations, numbers, formulas.
Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import cld2
>>> isReliable, textBytesFound, details = cld2.detect("ПОЧЁМ РЕДКИЕ ЗЕМЛИ? Так как у лантанидов наружные электронные оболочки построены одинаково, их химические свойства весьма сходны. Казалось бы, и встречаться в природе, и цениться они должны тоже одинаково. Однако таблица цен и распространённости лантанидов показывает, что это не так: Элемент Z иена Кларк Се 58 2,1 66 Рг 59 5,2 9,1 Nd 60 3,4 40 Pm 61 4Ю-20 Sm 62 4,3 7,0 Ей 63 117 2,1 Gd 64 6,1 6,1 Tb 65 55 1,2 Dy 66 5,0 4,5 Но 67 20 1,4 Er 68 8,2 3,5 Tm 69 110 0,5 Yb 70 15 3,1 Lu 71 134 0,8 Значит, надо учитывать «чётность» не только протонов, но и нейтронов в ядрах элементов. Обратимся теперь к таблице, в которой собраны данные для стабильных нуклидов всех элементов, содержащихся в земной коре (их известно около 300). В ней приведено содержание нуклидов каждого типа (исключая кислород, на который приходится 52 %). Число протонов в ядре Число нейтронов в ядре Обшее содержание, % чётное чётное 21 нечётное чётное 26 чётное нечётное 1 нечётное нечётное 0,03 Во второй строке таблицы приведён порядковый номер элемента Z, в третьей — округлённая цена в долларах за 1 г металла в слитке чистотой 99,9 % (в ценах 2000 г. компании «Олдрич»). В графе прометия стоит прочерк: у этого элемента нет стабильных изотопов, один из самых долгоживуших — прометий-147 (период полураспада 2,62 года) — получают искусственно и используют в миниатюрных атомных батарейках. В 1998 г. 1 г 147Рт стоил примерно 10 млрд долларов. Конечно, никто прометий граммами (и даже микрограммами) не покупает: его количество измеряют единицами активности — беккерелями и мегабеккерелями; для 147Рт 1 МБк соответствует 3 • 10-9 г прометия.")
>>> print(' details: %s' % str(details))
details: (Detection(language_name='RUSSIAN', language_code='ru', percent=96, score=584.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))
>>> isReliable, textBytesFound, details = cld2.detect("Рассмотрим ешё один пример. Вос пользуемся методом анализа размер ностей для установления зависимости периода свободных колебаний матема тического маятника от его параметров. Таковыми являются масса маятника т , длина нити /, а также ускорение сво бодного падения g, характеризующее поле тяжести, в котором маятник со вершает колебания. Представим период колебаний в виде T ~ m x iyg z . Размерности левой и правой ча стей этого выражения должны быть равны с = (кг)Чм)Чм-с 2 ) г или (кг) 0 •(M) 0 · с 1 = (кг)ЧмГЧс)22 . Приравнивая показатели степени при килограммах, метрах и секундах, получаем систему уравнений О = х, 0 = у + ζ, 1 = -2z Решив её, находим: χ = 0, у = 1/2, ζ = -1/2. Таким образом, т ~Ь-")
>>> print(' details: %s' % str(details))
details: (Detection(language_name='RUSSIAN', language_code='ru', percent=96, score=636.0), Detection(language_name='GREEK', language_code='el', percent=1, score=1024.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0)) I think, that CLD (or, possibly, its alternatives) will successfully work for most pages of real books. We can check it: for example, you can make a representative sample — select random pages of random books of my list and run cld2-cffi for its text. Thanks. |
i'll think about it. by the way, outputting warnings is not currently possible pending BurntSushi/ripgrep#1612 |
also, I'll probably add "pre / post" processing capability so you can run a second adapter after a first one (since this is already needed for the pdf-add-page-number functionality). Then this could be added as a post-processor and implemented as an external script. |
1. Summary
It would be nice, if ripgrep-all will show warning, if text in the book is not written incorrect or have a bad quality.
2. Problem
2.1. Summary
Some books have bad OCR layer. It is impossible to search for normal words in them. It would be nice, if ripgrep-all will detect these books.
2.2. Details
Books may have bad quality of searchable text. Reasons:
I couldn't find, how I can automatically detect these books in my books list. Currently, I need manually check OCR layer quality for every book. It takes a lot of time.
3. Compact Language Detector
Possibly, Compact Language Detector can solve this problem.
I installed cld2-cffi (yes, CLD3 exists, but I have problems in its installation on my Windows) → I ran this code in my Python interpreter:
Possibly, would be possible get similar behavior use Rust tools. For example, see Whatlang and CLD3 langdetect.
4. Example texts
4.1. Normal Russian text
4.2. English OCR language for Russian text
Note: I remove Information Separator One gremlin characters from this text for cld2-cffi, otherwise I get traceback:
4.3. Bad OCR
5. Example of expected behavior
ripgrep-all adapters extract text from books.
CLD (or similar tool) check 2 (4 may be better) random pages for every book.
If percent value is 95 (maybe another value is better; need practical tests) or more → do nothing. Else it below 95 → ripgrep-all user get a warning. Example warning text:
6. Note
Some tools for language recognition may not solve this problem. They don't detect that the text written not in the real natural language.
For example, I tried langdetect, TextBlob, guess_language and langid examples from this Stack Overflow answer → they show, that my 4.2 and 4.3 examples written on the real natural languages.
Thanks.
The text was updated successfully, but these errors were encountered: