-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect endlines handling with compressed input #269
Comments
I think the culprit here is the Unicode line separator character hiding in your data. $ python
Python 3.7.1 (default, Oct 22 2018, 11:21:55)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import collections
>>> collections.Counter(open('1.txt').read())
Counter({' ': 874, 'e': 519, 'n': 444, 't': 388, 'o': 366, 'i': 360, 'a': 338, 'r': 263, 's': 258, 'l': 213, 'd': 143, 'c': 140, 'h': 139, 'm': 134, 'u': 94, 'y': 94, 'f': 93, 'p': 88, ',': 76, 'b': 75, 'g': 70, 'w': 63, '\\': 54, 'v': 48, '.': 41, 'E': 35, 'A': 35, 'T': 33, 'I': 27, 'k': 26, 'L': 26, 'N': 26, 'S': 25, 'M': 20, 'D': 17, 'R': 16, 'P': 14, 'C': 14, 'O': 13, 'B': 12, '/': 11, ':': 9, 'F': 8, 'W': 7, 'H': 6, '-': 6, '"': 5, 'x': 5, '0': 5, 'V': 4, '2': 4, '“': 4, 'q': 4, 'Y': 4, '”': 4, 'G': 4, '(': 3, ')': 3, '1': 3, '3': 3, '’': 3, '\u2028': 3, 'U': 3, '_': 2, '{': 1, '|': 1, '&': 1, '←': 1, '→': 1, 'J': 1, '?': 1, '=': 1, 'z': 1, 'K': 1, '!': 1, '—': 1, '}': 1, '\n': 1}) The reason why this affects smart_open is: we use the codecs module of the standard library to perform byte-to-text decoding. Here's how that module performs: >>> import codecs
>>> fin_bin = open('1.txt', 'rb')
>>> fin_txt = codecs.getreader('utf-8')(fin_bin)
>>> sum([1 for _ in fin_txt])
4 |
This may be relevant: https://stackoverflow.com/questions/17273598/python-codecs-line-ending I suspect that gzip does this:
With the codecs route, I think what's happening is:
So, the first method is unaware of the Unicode line separator, and happily ignores it. I'm not sure what the desired behavior should be - do we want to apply effort to mimic the standard library's open here? @piskvorky |
My slight preference would be to mimic |
I just got bit by this as well. smart_open reads a /u2028 character fine off a text file however as soon as that same file with the same contents is compressed, the line with the /u2028 character gets split up across multiple lines. I suspect it's because /u2028 is somehow evaluated as a newline rather than representing the unicode value for one. Wouldn't correct default behaviour be to mimic open of the uncompressed text file for consistency? |
I'm wondering if there is a way to bypass the issue with something like this: I've looked through the source but can't find a good place to try this out. Any ideas? |
Problem: I tried to read gzipped jsonl with
smart_open
and stuck with a problem:smart_open
incorrectly handle end lines (pick them from json, instead of "real" endline) -> breaks lines -> breaks jsonl.Input data:
Code:
The text was updated successfully, but these errors were encountered: