Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CharsetNormalizer alternative for Chardet #112

Merged
merged 38 commits into from
Apr 12, 2021

Conversation

sagars729
Copy link
Contributor

  • Uses the charset_normalizer library instead of the chardet UniversalDetector
  • Pros: More robust than chardet UniversalDetector
  • Cons: Much slower
    • Chardet on guns.csv file: 0.000698089599609375 seconds
    • Charset Normalizer on guns.csv file: 0.3633871078491211 seconds

Comment on lines 291 to 295
try:
from charset_normalizer import CharsetNormalizerMatches as CnM
result = CnM.from_path(file_path, steps=max_lines, chunk_size=buffer_size)
encoding = result.best().first().encoding
except:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used try except rather than adding new requirement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that fails without this and works with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me look into that

Copy link
Contributor

@lettergram lettergram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a test needs to be added.

In terms of detection, does it need the whole file?

And I hate to say it, but would this be better?

translate/translate#4039

Just not sure on the differences in support and what not

https://github.com/chomechome/charamel

Charamel seems fast and accurate, going to measure the ram though

Comment on lines 41 to 49
try:
from charset_normalizer import CharsetNormalizerMatches as CnM
detected_encoding = \
data_utils.detect_file_encoding(file_path=os.path.join(test_dir, 'txt/utf8.txt'))
self.assertEqual("utf-8", detected_ecnoding.lower())
except:
detected_encoding = \
data_utils.detect_file_encoding(file_path=os.path.join(test_dir, 'txt/utf8.txt'))
self.assertEqual("windows-1254", detected_encoding.lower())
Copy link
Contributor Author

@sagars729 sagars729 Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lettergram I added a test that uses the utf8.txt file from https://github.com/stain/encoding-test-files. Currently, I have it set to use CharsetNormalizer if it can and assert that it's detected as utf-8. Otherwise, it should be incorrectly detected as a windows-1254 file by chardet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave a comment on this case to explain that one of the detections is incorrect, but it is knowingly incorrect in order to show that the CharsetNormalizer is superior and should be tried first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to clarify that we are looking to verify that chardet returns an incorrect detection as compared to charset_normalizer

@sagars729
Copy link
Contributor Author

sagars729 commented Mar 29, 2021

In terms of detection, does it need the whole file?

It doesn't seem like it does, both chardet and charset_normalizer support detecting in chunks of bytes and only using a max number of chunks. But, both libraries also provide a simple command to run on the entire file.

I don't think charamel supports reading in chunks though as far as I can tell.

@lettergram
Copy link
Contributor

@sagars729 it appears you can read in a set number of bytes - https://github.com/chomechome/charamel/blob/f5d664f19b8be70c2361f4037a4f065959e050bd/scripts/benchmark.py#L385

All you'd have to do is wrap it in a light weight function& load say 5kb or something.

I know it's more work, but would you mind trying that out? The reason I ask is really performance, taking half a second to just sample for encoding seems expensive. I'd like to avoid it if we can.

@sagars729
Copy link
Contributor Author

sagars729 commented Mar 29, 2021

@sagars729 it appears you can read in a set number of bytes - https://github.com/chomechome/charamel/blob/f5d664f19b8be70c2361f4037a4f065959e050bd/scripts/benchmark.py#L385

All you'd have to do is wrap it in a light weight function& load say 5kb or something.

I know it's more work, but would you mind trying that out? The reason I ask is really performance, taking half a second to just sample for encoding seems expensive. I'd like to avoid it if we can.

Sure! I will take a look and add a new commit for it

@lettergram charamel is much faster at getting its predicted encoding (0.0047) seconds but there is still the overhead of creating the detector object (0.5468 seconds). If the detector object is created every time, I believe the time taken would be the same.

@sagars729
Copy link
Contributor Author

Also, one thing to note is that charset_normalizer may be failing on the iris-utf-8.csv file - it is being recognized as "gb18030" because of the last line in the file. I'm not sure if this is an error or not though. I'll see what happens with charamel

@sagars729
Copy link
Contributor Author

sagars729 commented Mar 29, 2021

Also, one thing to note is that charset_normalizer may be failing on the iris-utf-8.csv file - it is being recognized as "gb18030" because of the last line in the file. I'm not sure if this is an error or not though. I'll see what happens with charamel

With charamel, the guns.csv file is recognized as CP_775 encoding and the iris-utf-8.csv is recognized as TIS_620. Again, this may not be an error, but it is not the expected ascii or utf-8 encoding

@lettergram
Copy link
Contributor

@sagars729 what is the current implementation doing?

@sagars729
Copy link
Contributor Author

@JGSweets moved the Windows-1254 bug check before CnM

@@ -301,7 +300,46 @@ def detect_file_encoding(file_path, buffer_size=1024, max_lines=20):
encoding = detector.result["encoding"]

# Typical file representation is utf-8 instead of ascii, treat as such.
if not encoding or encoding == 'ascii':
if not encoding or encoding == 'ascii' or encoding == 'Windows-1254':
Copy link
Contributor

@lettergram lettergram Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is ascii not okay? ignore me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using lower here as well and using an in statement to clean up the if statement:

if not encoding or encoding.lower() in ['ascii', 'windows-1254']:
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in new commit

# Check if encoding can be used to decode without throwing an error
def _decode_is_valid(encoding):
try:
open(file_path, "rb").read().decode(encoding.lower())
Copy link
Contributor

@lettergram lettergram Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to use this, you should limit the number of bytes (could be 1000Gb). Recommend no more than 1Mb.

I recommend opening it utilizing the encoding type (vs decoding & encoding)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read in 1Mb = 1024*1024 bytes rather the entire file and switched to using encoding=encoding in the open command in the new commit


# Try with small sample
with open(file_path, 'rb') as input_file:
raw_data = input_file.read(2560)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd read in 10k samples, above you're already reading the entire file

lettergram
lettergram previously approved these changes Apr 7, 2021
Comment on lines 36 to 38
# Failing Test
#dict(path=os.path.join(test_dir, 'csv/zomato.csv'),
# encoding='ISO-8859-1'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here were the results I got,
Chardet 20L, 1024B - Windows-1254 => utf-8; Fails to Decode
Chardet Full File - Windows-1254 => utf-8; Fails to Decode
Charset Normalizer 10k - iso8859_10; Decodes but content is different
Charset Normalizer Full File - cp1125; Decodes but content is different

Copy link
Contributor Author

@sagars729 sagars729 Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Charset Normalizer with 10k samples is very close though. It gets 99.987 % of characters correctly decoded in a 1 MB sample of zomato.csv or in other words 135/1048576 were decoded incorrectly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could modify the test to check for getting 99.9% of characters, but it would still fail with Chardet (and cause the GitHub actions to fail too) as Charset Normalizer isn't a required package. That's why I just commented out the test and added a comment for now

@JGSweets JGSweets enabled auto-merge (squash) April 12, 2021 16:38
@JGSweets JGSweets merged commit b5b963b into capitalone:main Apr 12, 2021
@Ousret
Copy link

Ousret commented Jul 21, 2021

Hi there!
Sorry to dig up this closed PR.

Thanks for considering Charset-Normalizer. Since your initial merge, a lot has happened. Version 2.0+ is now faster than Chardet. (Twice in avg). Would be awesome to collect your throughs.
Do you have any feedback that I could use for future improvement?

Regards,

@JGSweets
Copy link
Contributor

@Ousret I tried running Charset-Normalizer on the zomato.csv dataset in our library. However, it is not detecting an encoding.

The code I tried:

import charset_normalizer

file_path = 'dataprofiler/tests/data/csv/zomato.csv'
print(str(charset_normalizer.from_path(file_path).best()))
# output: None

If this is not the appropriate usage, please LMK.

@Ousret
Copy link

Ousret commented Jul 22, 2021

Hi @JGSweets

Thanks for bringing that to my attention. I will look into it as soon as possible.
In the meantime, you may increase the default maximum chaos/mess threshold. Indeed using this file should return != None.

@Ousret
Copy link

Ousret commented Jul 23, 2021

I have a patch ready. See jawah/charset_normalizer#72

@Ousret
Copy link

Ousret commented Jul 30, 2021

Fixed in version 2.0.4, feel free to give us feedback. @JGSweets

stevensecreti pushed a commit to stevensecreti/DataProfiler that referenced this pull request Jun 15, 2022
* Add Manual Build Test

* Use CharsetNormalizer

* Remove Manual Job Run

* Add test for chardet vs charset_normalizer

* Fix typo

* Add comment to describe invalid check

* Move detect_encoding outside of try except block

* Change utf-8 to utf_8

* Run normalizer only if None or known error

* Use CharsetNormalizer if decode fails

* Check results=None

* Modify test to pass w/o CnM

* Faster default to utf8 for Windows-1254

* Fix reads and decoding, use lowercase encoding

* Add zomato.csv file

* Add zomato.csv test

* Add reddit_wsb test

* Fix Incorrect Open

* Verify decoded content rather than encoding

* Add CnM to requirements, Use threshold match accuracy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Low Priority Cosmetic improvement or minor bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants