Encoding problems are not logged correctly #259

eshellman · 2025-02-05T18:59:13Z

for example:

2025-02-05 06:56:44,872 INFO      #11557  rdf made in 0:00:00.029474
2025-02-05 06:56:44,880 INFO      #11557 Got charset ASCII from pg header
2025-02-05 06:56:44,881 ERROR     #11557 Text not in charset ascii ('ascii' codec can't decode byte 0xef in position 46303: ordinal not in range(128))
2025-02-05 06:56:44,986 INFO      #11558  txt.utf-8 made in 0:00:00.012022
2025-02-05 06:56:44,993 INFO      #11558 using an HTML5 parser

reports a problem in a text file for #11558 where the word naïad is interpreted as naļad because the header declares ascii but the file is latin-1

To efficiently correct these, the error should be reported with the filename and the correct book number

The text was updated successfully, but these errors were encountered:

asylumcs · 2025-02-05T23:33:15Z

There is a information line in the 11558.txt file that says: "Character set encoding: ASCII".
That line is not consulted by any editor or program that I know of. Yes, it is incorrect to claim ASCII but that won't matter. Whatever opens it will make its best guess. If it doesn't guess Latin-1, then you get what is being reported.

We have a convention that 11558.txt should be ASCII, but ASCII cannot represent the character needed. The file is Latin-1, as stated, so it should be named 11558-8.txt if it is to exist at all. Turns out if we have a valid UTF-8 version (and we do, as 11558-0.txt), then we have been retiring the -8.txt (Latin-1) file.

The "fix" for this, seems to me, is to remove 11558.zip and 11558.txt since it is already in old/.

eshellman · 2025-02-06T15:11:58Z

@asylumcs Here's where ebookmaker looks at the metadata header (only for txt files)

ebookmaker/src/ebookmaker/parsers/GutenbergTextParser.py

Line 474 in cbfe67b

def get_charset_from_meta(self):

the fallback in use is cchardet, which works most of the time, but is a bit expensive. the current error rate on #10000-14000 is 0.2%, and typically only one or two characters are misinterpreted. With better logging, I think that fixing these errors should be feasible, especially if all we have to do is retire a file.

There is another oddity here: 115588-0.txt is not in the PG database. I see it was added in late October. Thus it is ignored by ebookconverter. I'm assuming that many files were added at the same time, they are all invisible to ebookconverter.

Conventionally pushed files are added to the database as part of the "dopush" process. So it is not enough to remove the old file, because the new one will not be used. The module that adds files to the database is EbookConverter.FileInfo

asylumcs · 2025-02-06T18:34:15Z

It is alarming to me if some books are posted but not in the database. Many books "invisible to ebookmaker" is serious. My question: what database has them missing? For 11558 I show an entry in what I thought was the database you would be referring to:

gutenberg=> select title, author from v_books where fk_books=11558;
title | author
-------+---------------------------------------
Poems | Goodrich, Samuel G. (Samuel Griswold)

Please tell me what database is missing posted books so I can repair it so the -0.txt book would be used on the sweep and the 11558.txt can be retired.

eshellman · 2025-02-06T18:44:22Z

The book is in the database, but the -0.txt file is not. It should be in the files table of the same database. Don't add it by hand, that has a high probability of error, use FileInfo.

asylumcs · 2025-02-06T22:49:18Z

I've pushed a new -0.txt and -h.htm. I've asked Greg to remove the faulty "ASCII" file and the two zips, all of which are already in 'old'. If there are other books without a -0.txt in the database that have the -0.txt in 1/2/3, I'd like to know about it so I can figure out how that could have happened. I don't send anything out of pglaf without it going through dopush, so I don't think it should happen.

eshellman · 2025-02-07T00:20:54Z

Wasn't there some issue with files having the wrong permissions? that could explain it

asylumcs · 2025-02-07T17:13:03Z

I don't recall a permissions problem, but then again I wouldn't have been the one who processed it. I believe that 11558 is now fixed. I don't see that fixing one book closes this issue "Encoding problems are not logged correctly". Let me know if there is anything else I need to do.

eshellman · 2025-02-11T17:20:24Z

On testing, the files not in database problem was only the first layer of this problem. The bigger problem is that the -0.txt files are the least preferred by ebookconverter: gutenbergtools/ebookconverter#57

asylumcs · 2025-02-12T08:53:55Z

Seems to me the -0.txt files should be the most preferred among the text files, since the "-0.txt" is the UTF-8 file.

eshellman · 2025-02-12T16:00:40Z

that change went into production last night. When this logging issue is addressed (a bit more complicated) we should be able to see if there's a 4th bug.

eshellman · 2025-02-12T16:35:56Z

So the fallout from the code change will be to flush out our bad utf8 txt files. For example: files/28789/28789-0.txt

asylumcs · 2025-02-12T19:06:52Z

Interesting. The file you mentioned is a UTF-8 file except for the header, which is mojibake. It's not clear how that happened or how prevalent it is. Is something in your process identifying the faulty files or should I start a scan of 1/2/3?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding problems are not logged correctly #259

Encoding problems are not logged correctly #259

eshellman commented Feb 5, 2025

asylumcs commented Feb 5, 2025

eshellman commented Feb 6, 2025

asylumcs commented Feb 6, 2025

eshellman commented Feb 6, 2025

asylumcs commented Feb 6, 2025

eshellman commented Feb 7, 2025

asylumcs commented Feb 7, 2025

eshellman commented Feb 11, 2025

asylumcs commented Feb 12, 2025

eshellman commented Feb 12, 2025

eshellman commented Feb 12, 2025

asylumcs commented Feb 12, 2025

Encoding problems are not logged correctly #259

Encoding problems are not logged correctly #259

Comments

eshellman commented Feb 5, 2025

asylumcs commented Feb 5, 2025

eshellman commented Feb 6, 2025

asylumcs commented Feb 6, 2025

eshellman commented Feb 6, 2025

asylumcs commented Feb 6, 2025

eshellman commented Feb 7, 2025

asylumcs commented Feb 7, 2025

eshellman commented Feb 11, 2025

asylumcs commented Feb 12, 2025

eshellman commented Feb 12, 2025

eshellman commented Feb 12, 2025

asylumcs commented Feb 12, 2025