-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding problems are not logged correctly #259
Comments
There is a information line in the 11558.txt file that says: "Character set encoding: ASCII". We have a convention that 11558.txt should be ASCII, but ASCII cannot represent the character needed. The file is Latin-1, as stated, so it should be named 11558-8.txt if it is to exist at all. Turns out if we have a valid UTF-8 version (and we do, as 11558-0.txt), then we have been retiring the -8.txt (Latin-1) file. The "fix" for this, seems to me, is to remove 11558.zip and 11558.txt since it is already in old/. |
@asylumcs Here's where ebookmaker looks at the metadata header (only for txt files)
the fallback in use is cchardet, which works most of the time, but is a bit expensive. the current error rate on #10000-14000 is 0.2%, and typically only one or two characters are misinterpreted. With better logging, I think that fixing these errors should be feasible, especially if all we have to do is retire a file. There is another oddity here: 115588-0.txt is not in the PG database. I see it was added in late October. Thus it is ignored by ebookconverter. I'm assuming that many files were added at the same time, they are all invisible to ebookconverter. Conventionally pushed files are added to the database as part of the "dopush" process. So it is not enough to remove the old file, because the new one will not be used. The module that adds files to the database is EbookConverter.FileInfo |
It is alarming to me if some books are posted but not in the database. Many books "invisible to ebookmaker" is serious. My question: what database has them missing? For 11558 I show an entry in what I thought was the database you would be referring to: gutenberg=> select title, author from v_books where fk_books=11558; Please tell me what database is missing posted books so I can repair it so the -0.txt book would be used on the sweep and the 11558.txt can be retired. |
The book is in the database, but the -0.txt file is not. It should be in the files table of the same database. Don't add it by hand, that has a high probability of error, use FileInfo. |
I've pushed a new -0.txt and -h.htm. I've asked Greg to remove the faulty "ASCII" file and the two zips, all of which are already in 'old'. If there are other books without a -0.txt in the database that have the -0.txt in 1/2/3, I'd like to know about it so I can figure out how that could have happened. I don't send anything out of pglaf without it going through dopush, so I don't think it should happen. |
Wasn't there some issue with files having the wrong permissions? that could explain it |
I don't recall a permissions problem, but then again I wouldn't have been the one who processed it. I believe that 11558 is now fixed. I don't see that fixing one book closes this issue "Encoding problems are not logged correctly". Let me know if there is anything else I need to do. |
On testing, the files not in database problem was only the first layer of this problem. The bigger problem is that the -0.txt files are the least preferred by ebookconverter: gutenbergtools/ebookconverter#57 |
Seems to me the -0.txt files should be the most preferred among the text files, since the "-0.txt" is the UTF-8 file. |
that change went into production last night. When this logging issue is addressed (a bit more complicated) we should be able to see if there's a 4th bug. |
So the fallout from the code change will be to flush out our bad utf8 txt files. For example: files/28789/28789-0.txt |
Interesting. The file you mentioned is a UTF-8 file except for the header, which is mojibake. It's not clear how that happened or how prevalent it is. Is something in your process identifying the faulty files or should I start a scan of 1/2/3? |
for example:
reports a problem in a text file for #11558 where the word
naïad
is interpreted asnaļad
because the header declares ascii but the file is latin-1To efficiently correct these, the error should be reported with the filename and the correct book number
The text was updated successfully, but these errors were encountered: