-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CharsetNormalizer alternative for Chardet #112
Conversation
sagars729
commented
Mar 29, 2021
- Uses the charset_normalizer library instead of the chardet UniversalDetector
- Pros: More robust than chardet UniversalDetector
- Cons: Much slower
- Chardet on guns.csv file: 0.000698089599609375 seconds
- Charset Normalizer on guns.csv file: 0.3633871078491211 seconds
Pull Changes
Pull Upstream Main
Pull Upstream Changes
Improve delimiter detection (capitalone#75)
Pull Upstream Changes
try: | ||
from charset_normalizer import CharsetNormalizerMatches as CnM | ||
result = CnM.from_path(file_path, steps=max_lines, chunk_size=buffer_size) | ||
encoding = result.best().first().encoding | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used try except rather than adding new requirement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test that fails without this and works with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me look into that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a test needs to be added.
In terms of detection, does it need the whole file?
And I hate to say it, but would this be better?
Just not sure on the differences in support and what not
https://github.com/chomechome/charamel
Charamel seems fast and accurate, going to measure the ram though
try: | ||
from charset_normalizer import CharsetNormalizerMatches as CnM | ||
detected_encoding = \ | ||
data_utils.detect_file_encoding(file_path=os.path.join(test_dir, 'txt/utf8.txt')) | ||
self.assertEqual("utf-8", detected_ecnoding.lower()) | ||
except: | ||
detected_encoding = \ | ||
data_utils.detect_file_encoding(file_path=os.path.join(test_dir, 'txt/utf8.txt')) | ||
self.assertEqual("windows-1254", detected_encoding.lower()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lettergram I added a test that uses the utf8.txt file from https://github.com/stain/encoding-test-files. Currently, I have it set to use CharsetNormalizer if it can and assert that it's detected as utf-8. Otherwise, it should be incorrectly detected as a windows-1254 file by chardet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave a comment on this case to explain that one of the detections is incorrect, but it is knowingly incorrect in order to show that the CharsetNormalizer is superior and should be tried first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment to clarify that we are looking to verify that chardet returns an incorrect detection as compared to charset_normalizer
It doesn't seem like it does, both chardet and charset_normalizer support detecting in chunks of bytes and only using a max number of chunks. But, both libraries also provide a simple command to run on the entire file. I don't think charamel supports reading in chunks though as far as I can tell. |
@sagars729 it appears you can read in a set number of bytes - https://github.com/chomechome/charamel/blob/f5d664f19b8be70c2361f4037a4f065959e050bd/scripts/benchmark.py#L385 All you'd have to do is wrap it in a light weight function& load say 5kb or something. I know it's more work, but would you mind trying that out? The reason I ask is really performance, taking half a second to just sample for encoding seems expensive. I'd like to avoid it if we can. |
Sure! I will take a look @lettergram charamel is much faster at getting its predicted encoding (0.0047) seconds but there is still the overhead of creating the detector object (0.5468 seconds). If the detector object is created every time, I believe the time taken would be the same. |
Also, one thing to note is that charset_normalizer may be failing on the iris-utf-8.csv file - it is being recognized as "gb18030" because of the last line in the file. I'm not sure if this is an error or not though. I'll see what happens with charamel |
With charamel, the guns.csv file is recognized as CP_775 encoding and the iris-utf-8.csv is recognized as TIS_620. Again, this may not be an error, but it is not the expected ascii or utf-8 encoding |
@sagars729 what is the current implementation doing? |
@JGSweets moved the Windows-1254 bug check before CnM |
@@ -301,7 +300,46 @@ def detect_file_encoding(file_path, buffer_size=1024, max_lines=20): | |||
encoding = detector.result["encoding"] | |||
|
|||
# Typical file representation is utf-8 instead of ascii, treat as such. | |||
if not encoding or encoding == 'ascii': | |||
if not encoding or encoding == 'ascii' or encoding == 'Windows-1254': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is ascii not okay? ignore me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest using lower here as well and using an in statement to clean up the if statement:
if not encoding or encoding.lower() in ['ascii', 'windows-1254']:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in new commit
# Check if encoding can be used to decode without throwing an error | ||
def _decode_is_valid(encoding): | ||
try: | ||
open(file_path, "rb").read().decode(encoding.lower()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're going to use this, you should limit the number of bytes (could be 1000Gb). Recommend no more than 1Mb.
I recommend opening it utilizing the encoding type (vs decoding & encoding)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read in 1Mb = 1024*1024 bytes rather the entire file and switched to using encoding=encoding in the open command in the new commit
|
||
# Try with small sample | ||
with open(file_path, 'rb') as input_file: | ||
raw_data = input_file.read(2560) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd read in 10k samples, above you're already reading the entire file
# Failing Test | ||
#dict(path=os.path.join(test_dir, 'csv/zomato.csv'), | ||
# encoding='ISO-8859-1'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here were the results I got,
Chardet 20L, 1024B - Windows-1254 => utf-8; Fails to Decode
Chardet Full File - Windows-1254 => utf-8; Fails to Decode
Charset Normalizer 10k - iso8859_10; Decodes but content is different
Charset Normalizer Full File - cp1125; Decodes but content is different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Charset Normalizer with 10k samples is very close though. It gets 99.987 % of characters correctly decoded in a 1 MB sample of zomato.csv or in other words 135/1048576 were decoded incorrectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could modify the test to check for getting 99.9% of characters, but it would still fail with Chardet (and cause the GitHub actions to fail too) as Charset Normalizer isn't a required package. That's why I just commented out the test and added a comment for now
Hi there! Thanks for considering Charset-Normalizer. Since your initial merge, a lot has happened. Version 2.0+ is now faster than Chardet. (Twice in avg). Would be awesome to collect your throughs. Regards, |
@Ousret I tried running The code I tried: import charset_normalizer
file_path = 'dataprofiler/tests/data/csv/zomato.csv'
print(str(charset_normalizer.from_path(file_path).best()))
# output: None If this is not the appropriate usage, please LMK. |
Hi @JGSweets Thanks for bringing that to my attention. I will look into it as soon as possible. |
I have a patch ready. See jawah/charset_normalizer#72 |
Fixed in version 2.0.4, feel free to give us feedback. @JGSweets |
* Add Manual Build Test * Use CharsetNormalizer * Remove Manual Job Run * Add test for chardet vs charset_normalizer * Fix typo * Add comment to describe invalid check * Move detect_encoding outside of try except block * Change utf-8 to utf_8 * Run normalizer only if None or known error * Use CharsetNormalizer if decode fails * Check results=None * Modify test to pass w/o CnM * Faster default to utf8 for Windows-1254 * Fix reads and decoding, use lowercase encoding * Add zomato.csv file * Add zomato.csv test * Add reddit_wsb test * Fix Incorrect Open * Verify decoded content rather than encoding * Add CnM to requirements, Use threshold match accuracy