-
Notifications
You must be signed in to change notification settings - Fork 76
Invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued #132
Comments
Yes, this is due to buffering: since character decoding is separate from tokenization for CSV backend (unlike with JSON where the two are integrated, for performance reasons, but also helps with exact error reporting) decoding proceeds block-by-block, ahead of tokenization. I wonder if it might actually be possible to improve UTF8 reader to postpone error reporting, such that if there is at least one already decoded character, that (and whatever else was successfully decoded) would be returned; and exception only thrown if the problem occurs with the first character to decode. What do you think? |
it would be nice to postpone error-reporting if possible, like you said. Say, you have already buffered line 50-100 of which line 80 has a character which can not be decoded. So yes, i think what you say sounds good. But i could also understand a viewpoint if you say that you want to throw an error as early as possible, but for my use-case it would be good to postpone. It would be even better, if the reader could recover from the error and continue with the next lines in the file, but this may not be possible. |
@flappingeagle I agree with "synchronized" failure, and think too-early failure is not beneficial for most (or perhaps any) cases. So question is just whether I can figure out how to make this work without adding processing overhead. I think that is possible, just need to find time to play with the code. Thank you again for reporting this: I think this would be great improvement -- and with CSV module, similar improvements have been made to allow dealing with occasional malformed/mismapping rows, and all in all vastly improving developer experience. |
Will be included in 2.7.7, 2.8.2, when released. |
following code-example can be tested with the attached file (test8.csv). The file is in ISO-8859 format and contains an UTF8 character, which is: é
the parsing crashes in line 152 at the call of "nextValue()". But the problematic UTF8 character is in line 185. So the parsing does not crash at the position of the problematic character but much earlier... (must be because of buffering?)
i just ask, because if the parsing would crash at the exact position of the UTF8 character, we may simple ignore this line and continue with the next line. But this way the parsing crashes earlier and can not be recovered/continued.
Following parse-exception is output:
The problematic character in the file test8.csv can be found in VI-Editor with ":goto 4861"
test8.csv.zip
The text was updated successfully, but these errors were encountered: