Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problems are not logged correctly #259

Open
eshellman opened this issue Feb 5, 2025 · 12 comments
Open

Encoding problems are not logged correctly #259

eshellman opened this issue Feb 5, 2025 · 12 comments

Comments

@eshellman
Copy link
Collaborator

for example:

2025-02-05 06:56:44,872 INFO      #11557  rdf made in 0:00:00.029474
2025-02-05 06:56:44,880 INFO      #11557 Got charset ASCII from pg header
2025-02-05 06:56:44,881 ERROR     #11557 Text not in charset ascii ('ascii' codec can't decode byte 0xef in position 46303: ordinal not in range(128))
2025-02-05 06:56:44,986 INFO      #11558  txt.utf-8 made in 0:00:00.012022
2025-02-05 06:56:44,993 INFO      #11558 using an HTML5 parser

reports a problem in a text file for #11558 where the word naïad is interpreted as naļad because the header declares ascii but the file is latin-1

To efficiently correct these, the error should be reported with the filename and the correct book number

@asylumcs
Copy link
Contributor

asylumcs commented Feb 5, 2025

There is a information line in the 11558.txt file that says: "Character set encoding: ASCII".
That line is not consulted by any editor or program that I know of. Yes, it is incorrect to claim ASCII but that won't matter. Whatever opens it will make its best guess. If it doesn't guess Latin-1, then you get what is being reported.

We have a convention that 11558.txt should be ASCII, but ASCII cannot represent the character needed. The file is Latin-1, as stated, so it should be named 11558-8.txt if it is to exist at all. Turns out if we have a valid UTF-8 version (and we do, as 11558-0.txt), then we have been retiring the -8.txt (Latin-1) file.

The "fix" for this, seems to me, is to remove 11558.zip and 11558.txt since it is already in old/.

@eshellman
Copy link
Collaborator Author

@asylumcs Here's where ebookmaker looks at the metadata header (only for txt files)

def get_charset_from_meta(self):

the fallback in use is cchardet, which works most of the time, but is a bit expensive. the current error rate on #10000-14000 is 0.2%, and typically only one or two characters are misinterpreted. With better logging, I think that fixing these errors should be feasible, especially if all we have to do is retire a file.

There is another oddity here: 115588-0.txt is not in the PG database. I see it was added in late October. Thus it is ignored by ebookconverter. I'm assuming that many files were added at the same time, they are all invisible to ebookconverter.

Conventionally pushed files are added to the database as part of the "dopush" process. So it is not enough to remove the old file, because the new one will not be used. The module that adds files to the database is EbookConverter.FileInfo

@asylumcs
Copy link
Contributor

asylumcs commented Feb 6, 2025

It is alarming to me if some books are posted but not in the database. Many books "invisible to ebookmaker" is serious. My question: what database has them missing? For 11558 I show an entry in what I thought was the database you would be referring to:

gutenberg=> select title, author from v_books where fk_books=11558;
title | author
-------+---------------------------------------
Poems | Goodrich, Samuel G. (Samuel Griswold)

Please tell me what database is missing posted books so I can repair it so the -0.txt book would be used on the sweep and the 11558.txt can be retired.

@eshellman
Copy link
Collaborator Author

The book is in the database, but the -0.txt file is not. It should be in the files table of the same database. Don't add it by hand, that has a high probability of error, use FileInfo.

@asylumcs
Copy link
Contributor

asylumcs commented Feb 6, 2025

I've pushed a new -0.txt and -h.htm. I've asked Greg to remove the faulty "ASCII" file and the two zips, all of which are already in 'old'. If there are other books without a -0.txt in the database that have the -0.txt in 1/2/3, I'd like to know about it so I can figure out how that could have happened. I don't send anything out of pglaf without it going through dopush, so I don't think it should happen.

@eshellman
Copy link
Collaborator Author

Wasn't there some issue with files having the wrong permissions? that could explain it

@asylumcs
Copy link
Contributor

asylumcs commented Feb 7, 2025

I don't recall a permissions problem, but then again I wouldn't have been the one who processed it. I believe that 11558 is now fixed. I don't see that fixing one book closes this issue "Encoding problems are not logged correctly". Let me know if there is anything else I need to do.

@eshellman
Copy link
Collaborator Author

On testing, the files not in database problem was only the first layer of this problem. The bigger problem is that the -0.txt files are the least preferred by ebookconverter: gutenbergtools/ebookconverter#57

@asylumcs
Copy link
Contributor

Seems to me the -0.txt files should be the most preferred among the text files, since the "-0.txt" is the UTF-8 file.

@eshellman
Copy link
Collaborator Author

that change went into production last night. When this logging issue is addressed (a bit more complicated) we should be able to see if there's a 4th bug.

@eshellman
Copy link
Collaborator Author

So the fallout from the code change will be to flush out our bad utf8 txt files. For example: files/28789/28789-0.txt

@asylumcs
Copy link
Contributor

Interesting. The file you mentioned is a UTF-8 file except for the header, which is mojibake. It's not clear how that happened or how prevalent it is. Is something in your process identifying the faulty files or should I start a scan of 1/2/3?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants