Use CharsetNormalizer alternative for Chardet #112

sagars729 · 2021-03-29T15:02:06Z

Uses the charset_normalizer library instead of the chardet UniversalDetector
Pros: More robust than chardet UniversalDetector
Cons: Much slower
- Chardet on guns.csv file: 0.000698089599609375 seconds
- Charset Normalizer on guns.csv file: 0.3633871078491211 seconds

Pull Changes

Pull Upstream Main

Pull Upstream Changes

Improve delimiter detection (capitalone#75)

Pull Upstream Changes

sagars729 · 2021-03-29T15:05:20Z

dataprofiler/data_readers/data_utils.py

+    try:
+        from charset_normalizer import CharsetNormalizerMatches as CnM
+        result = CnM.from_path(file_path, steps=max_lines, chunk_size=buffer_size)
+        encoding = result.best().first().encoding
+    except:


Used try except rather than adding new requirement

Can you add a test that fails without this and works with this?

Sure, let me look into that

lettergram

I think a test needs to be added.

In terms of detection, does it need the whole file?

And I hate to say it, but would this be better?

translate/translate#4039

Just not sure on the differences in support and what not

https://github.com/chomechome/charamel

Charamel seems fast and accurate, going to measure the ram though

sagars729 · 2021-03-29T17:10:20Z

dataprofiler/tests/data_readers/test_data_utils.py

+        try:
+            from charset_normalizer import CharsetNormalizerMatches as CnM
+            detected_encoding = \
+                data_utils.detect_file_encoding(file_path=os.path.join(test_dir, 'txt/utf8.txt'))
+            self.assertEqual("utf-8", detected_ecnoding.lower()) 
+        except:
+            detected_encoding = \
+                data_utils.detect_file_encoding(file_path=os.path.join(test_dir, 'txt/utf8.txt'))
+            self.assertEqual("windows-1254", detected_encoding.lower()) 


@lettergram I added a test that uses the utf8.txt file from https://github.com/stain/encoding-test-files. Currently, I have it set to use CharsetNormalizer if it can and assert that it's detected as utf-8. Otherwise, it should be incorrectly detected as a windows-1254 file by chardet.

I would leave a comment on this case to explain that one of the detections is incorrect, but it is knowingly incorrect in order to show that the CharsetNormalizer is superior and should be tried first.

Added a comment to clarify that we are looking to verify that chardet returns an incorrect detection as compared to charset_normalizer

sagars729 · 2021-03-29T17:21:41Z

In terms of detection, does it need the whole file?

It doesn't seem like it does, both chardet and charset_normalizer support detecting in chunks of bytes and only using a max number of chunks. But, both libraries also provide a simple command to run on the entire file.

I don't think charamel supports reading in chunks though as far as I can tell.

lettergram · 2021-03-29T17:31:07Z

@sagars729 it appears you can read in a set number of bytes - https://github.com/chomechome/charamel/blob/f5d664f19b8be70c2361f4037a4f065959e050bd/scripts/benchmark.py#L385

All you'd have to do is wrap it in a light weight function& load say 5kb or something.

I know it's more work, but would you mind trying that out? The reason I ask is really performance, taking half a second to just sample for encoding seems expensive. I'd like to avoid it if we can.

sagars729 · 2021-03-29T17:34:26Z

@sagars729 it appears you can read in a set number of bytes - https://github.com/chomechome/charamel/blob/f5d664f19b8be70c2361f4037a4f065959e050bd/scripts/benchmark.py#L385

All you'd have to do is wrap it in a light weight function& load say 5kb or something.

I know it's more work, but would you mind trying that out? The reason I ask is really performance, taking half a second to just sample for encoding seems expensive. I'd like to avoid it if we can.

Sure! I will take a look ~~and add a new commit for it~~

@lettergram charamel is much faster at getting its predicted encoding (0.0047) seconds but there is still the overhead of creating the detector object (0.5468 seconds). If the detector object is created every time, I believe the time taken would be the same.

sagars729 · 2021-03-29T17:36:20Z

Also, one thing to note is that charset_normalizer may be failing on the iris-utf-8.csv file - it is being recognized as "gb18030" because of the last line in the file. I'm not sure if this is an error or not though. I'll see what happens with charamel

sagars729 · 2021-03-29T17:53:49Z

Also, one thing to note is that charset_normalizer may be failing on the iris-utf-8.csv file - it is being recognized as "gb18030" because of the last line in the file. I'm not sure if this is an error or not though. I'll see what happens with charamel

With charamel, the guns.csv file is recognized as CP_775 encoding and the iris-utf-8.csv is recognized as TIS_620. Again, this may not be an error, but it is not the expected ascii or utf-8 encoding

lettergram · 2021-03-29T18:13:35Z

@sagars729 what is the current implementation doing?

sagars729 · 2021-04-05T16:52:41Z

@JGSweets moved the Windows-1254 bug check before CnM

lettergram · 2021-04-06T06:20:52Z

dataprofiler/data_readers/data_utils.py

@@ -301,7 +300,46 @@ def detect_file_encoding(file_path, buffer_size=1024, max_lines=20):
    encoding = detector.result["encoding"]

    # Typical file representation is utf-8 instead of ascii, treat as such.
-    if not encoding or encoding == 'ascii':
+    if not encoding or encoding == 'ascii' or encoding == 'Windows-1254':


~~Why is ascii not okay?~~ ignore me

I would suggest using lower here as well and using an in statement to clean up the if statement:

if not encoding or encoding.lower() in ['ascii', 'windows-1254']: ...

Added in new commit

lettergram · 2021-04-06T06:22:36Z

dataprofiler/data_readers/data_utils.py

+    # Check if encoding can be used to decode without throwing an error
+    def _decode_is_valid(encoding):
+        try: 
+            open(file_path, "rb").read().decode(encoding.lower())


If you're going to use this, you should limit the number of bytes (could be 1000Gb). Recommend no more than 1Mb.

I recommend opening it utilizing the encoding type (vs decoding & encoding)

I read in 1Mb = 1024*1024 bytes rather the entire file and switched to using encoding=encoding in the open command in the new commit

lettergram · 2021-04-06T06:23:53Z

dataprofiler/data_readers/data_utils.py

+
+            # Try with small sample 
+            with open(file_path, 'rb') as input_file:
+                raw_data = input_file.read(2560)


I'd read in 10k samples, above you're already reading the entire file

into FTR-chardet

lettergram · 2021-04-12T04:35:00Z

dataprofiler/tests/data_readers/test_data_utils.py

+            # Failing Test
+            #dict(path=os.path.join(test_dir, 'csv/zomato.csv'),
+            #     encoding='ISO-8859-1'), 


This should work, correct?

Here were the results I got,
Chardet 20L, 1024B - Windows-1254 => utf-8; Fails to Decode
Chardet Full File - Windows-1254 => utf-8; Fails to Decode
Charset Normalizer 10k - iso8859_10; Decodes but content is different
Charset Normalizer Full File - cp1125; Decodes but content is different

Charset Normalizer with 10k samples is very close though. It gets 99.987 % of characters correctly decoded in a 1 MB sample of zomato.csv or in other words 135/1048576 were decoded incorrectly.

I could modify the test to check for getting 99.9% of characters, but it would still fail with Chardet (and cause the GitHub actions to fail too) as Charset Normalizer isn't a required package. That's why I just commented out the test and added a comment for now

Ousret · 2021-07-21T10:25:27Z

Hi there!
Sorry to dig up this closed PR.

Thanks for considering Charset-Normalizer. Since your initial merge, a lot has happened. Version 2.0+ is now faster than Chardet. (Twice in avg). Would be awesome to collect your throughs.
Do you have any feedback that I could use for future improvement?

Regards,

JGSweets · 2021-07-21T16:07:27Z

@Ousret I tried running Charset-Normalizer on the zomato.csv dataset in our library. However, it is not detecting an encoding.

The code I tried:

import charset_normalizer

file_path = 'dataprofiler/tests/data/csv/zomato.csv'
print(str(charset_normalizer.from_path(file_path).best()))
# output: None

If this is not the appropriate usage, please LMK.

Ousret · 2021-07-22T15:23:09Z

Hi @JGSweets

Thanks for bringing that to my attention. I will look into it as soon as possible.
In the meantime, you may increase the default maximum chaos/mess threshold. Indeed using this file should return != None.

Ousret · 2021-07-23T21:56:29Z

I have a patch ready. See jawah/charset_normalizer#72

Ousret · 2021-07-30T21:49:18Z

Fixed in version 2.0.4, feel free to give us feedback. @JGSweets

* Add Manual Build Test * Use CharsetNormalizer * Remove Manual Job Run * Add test for chardet vs charset_normalizer * Fix typo * Add comment to describe invalid check * Move detect_encoding outside of try except block * Change utf-8 to utf_8 * Run normalizer only if None or known error * Use CharsetNormalizer if decode fails * Check results=None * Modify test to pass w/o CnM * Faster default to utf8 for Windows-1254 * Fix reads and decoding, use lowercase encoding * Add zomato.csv file * Add zomato.csv test * Add reddit_wsb test * Fix Incorrect Open * Verify decoded content rather than encoding * Add CnM to requirements, Use threshold match accuracy

sagars729 and others added 8 commits February 19, 2021 15:31

Merge pull request #1 from capitalone/main

9c813dc

Pull Changes

Merge pull request #2 from capitalone/main

20af4af

Pull Upstream Main

Merge pull request #3 from capitalone/main

9a3c5a3

Pull Upstream Changes

Merge pull request #4 from capitalone/main

f489f04

Improve delimiter detection (capitalone#75)

Add Manual Build Test

a53d9a4

Merge pull request #5 from capitalone/main

c85b300

Pull Upstream Changes

Use CharsetNormalizer

12be245

Remove Manual Job Run

0b12fab

sagars729 requested review from AnhTruong, ChrisWallace2020, grant-eden, JGSweets and lettergram as code owners March 29, 2021 15:02

sagars729 commented Mar 29, 2021

View reviewed changes

sagars729 mentioned this pull request Mar 29, 2021

Chardet d has trouble identifying some file encodings #66

Closed

Merge branch 'main' into FTR-chardet

4c6acf4

lettergram suggested changes Mar 29, 2021

View reviewed changes

lettergram assigned AnhTruong Mar 29, 2021

Add test for chardet vs charset_normalizer

723b40e

sagars729 commented Mar 29, 2021

View reviewed changes

sagars729 added 2 commits March 29, 2021 13:16

Fix typo

93621f4

Add comment to describe invalid check

33c210b

Merge branch 'main' into FTR-chardet

4a286c9

Merge branch 'main' into FTR-chardet

2d734d6

Merge branch 'main' into FTR-chardet

2d59061

lettergram reviewed Apr 6, 2021

View reviewed changes

sagars729 added 2 commits April 7, 2021 09:19

Fix reads and decoding, use lowercase encoding

19a99e2

Merge branch 'main' into FTR-chardet

60448b1

lettergram previously approved these changes Apr 7, 2021

View reviewed changes

sagars729 added 2 commits April 9, 2021 09:39

Add zomato.csv file

ff2ac61

Add zomato.csv test

d02e6c3

sagars729 dismissed lettergram’s stale review via d02e6c3 April 9, 2021 13:54

sagars729 added 5 commits April 9, 2021 09:59

Add reddit_wsb test

59034f2

Merge branch 'main' into FTR-chardet

40ff2a1

Fix Incorrect Open

af37a42

Merge branch 'FTR-chardet' of https://github.com/sagars729/data-profiler

4fe4a18

into FTR-chardet

Verify decoded content rather than encoding

8054a5f

lettergram reviewed Apr 12, 2021

View reviewed changes

Add CnM to requirements, Use threshold match accuracy

d54af66

lettergram approved these changes Apr 12, 2021

View reviewed changes

JGSweets enabled auto-merge (squash) April 12, 2021 16:38

JGSweets approved these changes Apr 12, 2021

View reviewed changes

Merge branch 'main' into FTR-chardet

9f6b668

JGSweets merged commit b5b963b into capitalone:main Apr 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CharsetNormalizer alternative for Chardet #112

Use CharsetNormalizer alternative for Chardet #112

sagars729 commented Mar 29, 2021

sagars729 Mar 29, 2021

lettergram Mar 29, 2021

sagars729 Mar 29, 2021

lettergram left a comment •

edited

Loading

sagars729 Mar 29, 2021 •

edited

Loading

ChrisWallace2020 Mar 29, 2021

sagars729 Mar 29, 2021

sagars729 commented Mar 29, 2021 •

edited

Loading

lettergram commented Mar 29, 2021

sagars729 commented Mar 29, 2021 •

edited

Loading

sagars729 commented Mar 29, 2021

sagars729 commented Mar 29, 2021 •

edited

Loading

lettergram commented Mar 29, 2021

sagars729 commented Apr 5, 2021

lettergram Apr 6, 2021 •

edited

Loading

JGSweets Apr 6, 2021

sagars729 Apr 7, 2021

lettergram Apr 6, 2021 •

edited

Loading

sagars729 Apr 7, 2021

lettergram Apr 6, 2021

lettergram Apr 12, 2021

sagars729 Apr 12, 2021

sagars729 Apr 12, 2021 •

edited

Loading

sagars729 Apr 12, 2021

Ousret commented Jul 21, 2021

JGSweets commented Jul 21, 2021

Ousret commented Jul 22, 2021

Ousret commented Jul 23, 2021

Ousret commented Jul 30, 2021 •

edited

Loading

Use CharsetNormalizer alternative for Chardet #112

Use CharsetNormalizer alternative for Chardet #112

Conversation

sagars729 commented Mar 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lettergram left a comment • edited Loading

Choose a reason for hiding this comment

sagars729 Mar 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sagars729 commented Mar 29, 2021 • edited Loading

lettergram commented Mar 29, 2021

sagars729 commented Mar 29, 2021 • edited Loading

sagars729 commented Mar 29, 2021

sagars729 commented Mar 29, 2021 • edited Loading

lettergram commented Mar 29, 2021

sagars729 commented Apr 5, 2021

lettergram Apr 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lettergram Apr 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sagars729 Apr 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ousret commented Jul 21, 2021

JGSweets commented Jul 21, 2021

Ousret commented Jul 22, 2021

Ousret commented Jul 23, 2021

Ousret commented Jul 30, 2021 • edited Loading

lettergram left a comment •

edited

Loading

sagars729 Mar 29, 2021 •

edited

Loading

sagars729 commented Mar 29, 2021 •

edited

Loading

sagars729 commented Mar 29, 2021 •

edited

Loading

sagars729 commented Mar 29, 2021 •

edited

Loading

lettergram Apr 6, 2021 •

edited

Loading

lettergram Apr 6, 2021 •

edited

Loading

sagars729 Apr 12, 2021 •

edited

Loading

Ousret commented Jul 30, 2021 •

edited

Loading