Incorrect endlines handling with compressed input #269

menshikh-iv · 2019-03-07T06:18:25Z

Problem: I tried to read gzipped jsonl with smart_open and stuck with a problem: smart_open incorrectly handle end lines (pick them from json, instead of "real" endline) -> breaks lines -> breaks jsonl.

Input data:

echo '{"content":"Reptilian and Dragon like Encounters | EveLorgen.com\nEveLorgen.com\n& The Alien Love Bite\nSearch\nMain menu\nSkip to primary content\nHome\nNews\nArticles\nAlien Abduction\nAlien Love Bite Related\nAlien or Demonic\nAnomalous Trauma\nEmotional and Psychic Vampirism\nMedical and Scientific Aspects of Alien Abduction\nMilitary Abduction (MILABS) and Reptilians\nMind Control\nMiscellaneous\nPoetry and Mystic Prose\nPsychology and Relationships\nSpiritual Warfare and the Human Soul\nBooks\nDrawings\nRadio\nVideos\nSubscribe\nBio and Colleagues\nContact\nTestimonials\nPost navigation\n← Previous Next →\nReptilian and Dragon like Encounters\nPosted on May 10, 2013 by eve\t\nThe first article link is another interview with Matt R, on his reptilian encounters and DNA activations. The reptilians are very particular about pedigrees and will follow these bloodlines like hound dogs. Here is an exerpt of Matt’s article:\n“Reptilians often inform their abductees that they are descended from reptilian bloodlines. They are very specific about the nature of this pedigree. So specific, that they are able to determine which abductees is from what reptilian family line. Joe Montaldo actually did a show on this topic a few years ago http://www.youtube.com/watch?v=EiuTqCDE0zY.”\nThe full article can be found here: http://naturalplane.blogspot.com/2013/03/here-be-dragons-new-katrina-abductions.html\nThis next article , “A Summons to Appear: Blood of Dragons, Part 2” is the sequel to Ken Bakeman’s encounter with several entities who engaged him in a forced baptism like ritual. The primary beings involved in this baptism encounter is a toad like reptilian, a mantis type creature and eyeless reptilians in robes, as well as the royal large tall Dragon like beings with wings.\nhttp://www.kenbakeman.com/reptilian_baptism_p2.html\nBe Sociable, Share!\nTweet\nThis entry was posted in Alien Abduction, Military Abduction (MILABS) and Reptilians, News and tagged dragon bloodlines, dragons, milabs, reptilians by eve. Bookmark the permalink.\t\nTerms and Conditions of www.evelorgen.com Website Material:\nThe content written by myself or other authors, and people I have interviewed are for information only, intended for the benefit of people seeking truth, freedom, personal growth and expansion of awareness. I may not agree with all content or opinions of other contributing authors or interviewees.\nThis website as an independent “entity” shall not permit and is protected from any malevolent intended attack, to undermine, subvert, harm or intent of any strategy of attack to be permitted to affect myself, family members, colleagues, contributors to my web site or any loved ones that have ties to me on all levels, and all dimensions of time.\nAnyone, or group who goes to my web site to partake in reading the information can not use it to harm anyone or anything, or use it in any way whatsoever for purposes of deception, harm or any agreements of entrapment or snares of any kind. I hold this to be in effect on all levels and dimensions of time.\nDeclaration of NON CONSENT FOR INTERFERENCE:\nLet it be known, I do not consent to any agreement of entrapment that bears intention to deceive, misinform, manipulate, exploit, control, steal, harvest, seduce, harm or negatively influence my being, in mind, soul, spirit, body and physical place of habitation, business, website or published works in any way across all levels, dimensions and time, whether they are fabricated linear or synthetic creations or times on all levels and dimensions. Through my not consenting, I intend protection from harm and maintain neutrality, so that my presence of being honors Truth, compassion, wisdom, harmony, healing, constant awakening and life, so as to not be trapped, to the best of my ability in every situation.\nI do not consent to false limiting beliefs or false soul “programs” driving my body and consciousness, but rather my highest Spirit’s truth within without limitation as a Creator as integrated mind, soul and spirit of original Primordial consciousness. Let it be known that by my choice to NOT CONSENT to any agreement of entrapment on any level, on all levels, across all dimensions and for all time, it is in effect now and forevermore. I hold that such is true and in effect, that any such agreement of entrapment, deception, and harmful intention, now be DEEMED null and void based on the intention of its creator to harm and not honor my life, my sovereign being and free will. No singular or collective entity, or artificial intelligence is under any circumstances given permission (of malintent) to enter my Universe, life, dimensions, levels or time. If there are such attempts to ignore the LAW, they are responsible for one thousand times the consequences of that breach in self-destruction—and are fully legally responsible for their choices. The choice given is to not interfere or accept the consequences as stated. Should you choose to override our LAW, knowing the full terms and conditions stated, I in no way can be held responsible or harmed for any choice that breaches my LAW on any level, on all dimensions across all times and future cycles of time. I claim the Law and I Am the Law. I forbid any singular or collective entities to attempt to breach my Law and Not Consent to my LAW, and therefore am protected from entering any Game, or ANY and ALL Games set out to ensnare me out of my own SOVEREIGN BEING. They will bring upon themselves their own intention in harm.\nI HOLD THIS TO BE IN EFFECT IMMEDIATELY ON ALL LEVELS AND ALL DIMENSIONS OF TIME AND SPACE, PAST PRESENT AND FOR THE FUTURE CYCLES OF TIME.\"\nI do not offer legal, medical, psychiatric or clinical psychological diagnosis and therefore am not liable for any claims against such.\nProudly powered by WordPress\n"}' > 1.txt
cat 1.txt | gzip > 1.txt.gz

Code:

from smart_open import smart_open
import gzip

with smart_open("1.txt", "r") as infile:
    num_lines = sum(1 for _ in infile) 
    assert num_lines == 1  # correct

with gzip.open("1.txt.gz", "r") as infile:
    num_lines = sum(1 for _ in infile) 
    assert num_lines == 1  # correct

with smart_open("1.txt.gz", "r") as infile:
    num_lines = sum(1 for _ in infile) 
    assert num_lines == 1,  num_lines  # wrong, num_lines=4

The text was updated successfully, but these errors were encountered:

mpenkov · 2019-03-07T07:35:27Z

I think the culprit here is the Unicode line separator character hiding in your data.

$ python
Python 3.7.1 (default, Oct 22 2018, 11:21:55) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import collections
>>> collections.Counter(open('1.txt').read())
Counter({' ': 874, 'e': 519, 'n': 444, 't': 388, 'o': 366, 'i': 360, 'a': 338, 'r': 263, 's': 258, 'l': 213, 'd': 143, 'c': 140, 'h': 139, 'm': 134, 'u': 94, 'y': 94, 'f': 93, 'p': 88, ',': 76, 'b': 75, 'g': 70, 'w': 63, '\\': 54, 'v': 48, '.': 41, 'E': 35, 'A': 35, 'T': 33, 'I': 27, 'k': 26, 'L': 26, 'N': 26, 'S': 25, 'M': 20, 'D': 17, 'R': 16, 'P': 14, 'C': 14, 'O': 13, 'B': 12, '/': 11, ':': 9, 'F': 8, 'W': 7, 'H': 6, '-': 6, '"': 5, 'x': 5, '0': 5, 'V': 4, '2': 4, '“': 4, 'q': 4, 'Y': 4, '”': 4, 'G': 4, '(': 3, ')': 3, '1': 3, '3': 3, '’': 3, '\u2028': 3, 'U': 3, '_': 2, '{': 1, '|': 1, '&': 1, '←': 1, '→': 1, 'J': 1, '?': 1, '=': 1, 'z': 1, 'K': 1, '!': 1, '—': 1, '}': 1, '\n': 1})

The reason why this affects smart_open is: we use the codecs module of the standard library to perform byte-to-text decoding. Here's how that module performs:

>>> import codecs
>>> fin_bin = open('1.txt', 'rb')
>>> fin_txt = codecs.getreader('utf-8')(fin_bin)
>>> sum([1 for _ in fin_txt])
4

mpenkov · 2019-03-10T08:06:55Z

This may be relevant: https://stackoverflow.com/questions/17273598/python-codecs-line-ending

I suspect that gzip does this:

Split the bytes by end-of-line characters (ASCII values)
Decode each resulting line separately

With the codecs route, I think what's happening is:

Decode everything as from bytes to Unicode on the fly
Split the Unicode by end-of-line characters (Unicode)

So, the first method is unaware of the Unicode line separator, and happily ignores it. I'm not sure what the desired behavior should be - do we want to apply effort to mimic the standard library's open here? @piskvorky

piskvorky · 2019-03-10T09:46:42Z

My slight preference would be to mimic open, yes. But not important either way (not worth any massive refactoring IMO).

wordtracker · 2019-12-01T19:32:57Z

I just got bit by this as well. smart_open reads a /u2028 character fine off a text file however as soon as that same file with the same contents is compressed, the line with the /u2028 character gets split up across multiple lines.

I suspect it's because /u2028 is somehow evaluated as a newline rather than representing the unicode value for one. Wouldn't correct default behaviour be to mimic open of the uncompressed text file for consistency?

wordtracker · 2019-12-01T20:38:26Z

I'm wondering if there is a way to bypass the issue with something like this:
file_contents = fh.read().replace('\u2028',' ')

I've looked through the source but can't find a good place to try this out. Any ideas?

menshikh-iv added the bug label Mar 7, 2019

menshikh-iv assigned mpenkov Mar 7, 2019

mpenkov added hard help wanted We can't figure this out, if you can, then please help! labels Sep 28, 2019

mpenkov mentioned this issue Nov 30, 2019

inconsistent with bz2.open on files containing vertical tab ^K #394

Closed

3 tasks

markopy mentioned this issue Jan 10, 2021

Line splitting on \u2028 from S3 #557

Closed

3 tasks

markopy mentioned this issue Jan 17, 2021

Replace codecs with TextIOWrapper to fix newline issues when reading text files #578

Merged

mpenkov closed this as completed in #578 Jan 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect endlines handling with compressed input #269

Incorrect endlines handling with compressed input #269

menshikh-iv commented Mar 7, 2019 •

edited

Loading

mpenkov commented Mar 7, 2019 •

edited

Loading

mpenkov commented Mar 10, 2019 •

edited

Loading

piskvorky commented Mar 10, 2019

wordtracker commented Dec 1, 2019 •

edited

Loading

wordtracker commented Dec 1, 2019

Incorrect endlines handling with compressed input #269

Incorrect endlines handling with compressed input #269

Comments

menshikh-iv commented Mar 7, 2019 • edited Loading

mpenkov commented Mar 7, 2019 • edited Loading

mpenkov commented Mar 10, 2019 • edited Loading

piskvorky commented Mar 10, 2019

wordtracker commented Dec 1, 2019 • edited Loading

wordtracker commented Dec 1, 2019

menshikh-iv commented Mar 7, 2019 •

edited

Loading

mpenkov commented Mar 7, 2019 •

edited

Loading

mpenkov commented Mar 10, 2019 •

edited

Loading

wordtracker commented Dec 1, 2019 •

edited

Loading