Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect endlines handling with compressed input #269

Closed
menshikh-iv opened this issue Mar 7, 2019 · 5 comments · Fixed by #578
Closed

Incorrect endlines handling with compressed input #269

menshikh-iv opened this issue Mar 7, 2019 · 5 comments · Fixed by #578
Assignees
Labels
bug hard help wanted We can't figure this out, if you can, then please help!

Comments

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 7, 2019

Problem: I tried to read gzipped jsonl with smart_open and stuck with a problem: smart_open incorrectly handle end lines (pick them from json, instead of "real" endline) -> breaks lines -> breaks jsonl.

Input data:

echo '{"content":"Reptilian and Dragon like Encounters | EveLorgen.com\nEveLorgen.com\n& The Alien Love Bite\nSearch\nMain menu\nSkip to primary content\nHome\nNews\nArticles\nAlien Abduction\nAlien Love Bite Related\nAlien or Demonic\nAnomalous Trauma\nEmotional and Psychic Vampirism\nMedical and Scientific Aspects of Alien Abduction\nMilitary Abduction (MILABS) and Reptilians\nMind Control\nMiscellaneous\nPoetry and Mystic Prose\nPsychology and Relationships\nSpiritual Warfare and the Human Soul\nBooks\nDrawings\nRadio\nVideos\nSubscribe\nBio and Colleagues\nContact\nTestimonials\nPost navigation\n← Previous Next →\nReptilian and Dragon like Encounters\nPosted on May 10, 2013 by eve\t\nThe first article link is another interview with Matt R, on his reptilian encounters and DNA activations. The reptilians are very particular about pedigrees and will follow these bloodlines like hound dogs. Here is an exerpt of Matt’s article:\n“Reptilians often inform their abductees that they are descended from reptilian bloodlines. They are very specific about the nature of this pedigree. So specific, that they are able to determine which abductees is from what reptilian family line. Joe Montaldo actually did a show on this topic a few years ago http://www.youtube.com/watch?v=EiuTqCDE0zY.”\nThe full article can be found here: http://naturalplane.blogspot.com/2013/03/here-be-dragons-new-katrina-abductions.html\nThis next article , “A Summons to Appear: Blood of Dragons, Part 2” is the sequel to Ken Bakeman’s encounter with several entities who engaged him in a forced baptism like ritual. The primary beings involved in this baptism encounter is a toad like reptilian, a mantis type creature and eyeless reptilians in robes, as well as the royal large tall Dragon like beings with wings.\nhttp://www.kenbakeman.com/reptilian_baptism_p2.html\nBe Sociable, Share!\nTweet\nThis entry was posted in Alien Abduction, Military Abduction (MILABS) and Reptilians, News and tagged dragon bloodlines, dragons, milabs, reptilians by eve. Bookmark the permalink.\t\nTerms and Conditions of www.evelorgen.com Website Material:\nThe content written by myself or other authors, and people I have interviewed are for information only, intended for the benefit of people seeking truth, freedom, personal growth and expansion of awareness. I may not agree with all content or opinions of other contributing authors or interviewees.\nThis website as an independent “entity” shall not permit and is protected from any malevolent intended attack, to undermine, subvert, harm or intent of any strategy of attack to be permitted to affect myself, family members, colleagues, contributors to my web site or any loved ones that have ties to me on all levels, and all dimensions of time.\nAnyone, or group who goes to my web site to partake in reading the information can not use it to harm anyone or anything, or use it in any way whatsoever for purposes of deception, harm or any agreements of entrapment or snares of any kind. I hold this to be in effect on all levels and dimensions of time.\nDeclaration of NON CONSENT FOR INTERFERENCE:\nLet it be known, I do not consent to any agreement of entrapment that bears intention to deceive, misinform, manipulate, exploit, control, steal, harvest, seduce, harm or negatively influence my being, in mind, soul, spirit, body and physical place of habitation, business, website or published works in any way across all levels, dimensions and time, whether they are fabricated linear or synthetic creations or times on all levels and dimensions.
Through my not consenting, I intend protection from harm and maintain neutrality, so that my presence of being honors Truth, compassion, wisdom, harmony, healing, constant awakening and life, so as to not be trapped, to the best of my ability in every situation.\nI do not consent to false limiting beliefs or false soul “programs” driving my body and consciousness, but rather my highest Spirit’s truth within without limitation as a Creator as integrated mind, soul and spirit of original Primordial consciousness.
Let it be known that by my choice to NOT CONSENT to any agreement of entrapment on any level, on all levels, across all dimensions and for all time, it is in effect now and forevermore. I hold that such is true and in effect, that any such agreement of entrapment, deception, and harmful intention, now be DEEMED null and void based on the intention of its creator to harm and not honor my life, my sovereign being and free will.
No singular or collective entity, or artificial intelligence is under any circumstances given permission (of malintent) to enter my Universe, life, dimensions, levels or time. If there are such attempts to ignore the LAW, they are responsible for one thousand times the consequences of that breach in self-destruction—and are fully legally responsible for their choices. The choice given is to not interfere or accept the consequences as stated. Should you choose to override our LAW, knowing the full terms and conditions stated, I in no way can be held responsible or harmed for any choice that breaches my LAW on any level, on all dimensions across all times and future cycles of time. I claim the Law and I Am the Law. I forbid any singular or collective entities to attempt to breach my Law and Not Consent to my LAW, and therefore am protected from entering any Game, or ANY and ALL Games set out to ensnare me out of my own SOVEREIGN BEING. They will bring upon themselves their own intention in harm.\nI HOLD THIS TO BE IN EFFECT IMMEDIATELY ON ALL LEVELS AND ALL DIMENSIONS OF TIME AND SPACE, PAST PRESENT AND FOR THE FUTURE CYCLES OF TIME.\"\nI do not offer legal, medical, psychiatric or clinical psychological diagnosis and therefore am not liable for any claims against such.\nProudly powered by WordPress\n"}' > 1.txt
cat 1.txt | gzip > 1.txt.gz

Code:

from smart_open import smart_open
import gzip

with smart_open("1.txt", "r") as infile:
    num_lines = sum(1 for _ in infile) 
    assert num_lines == 1  # correct

with gzip.open("1.txt.gz", "r") as infile:
    num_lines = sum(1 for _ in infile) 
    assert num_lines == 1  # correct

with smart_open("1.txt.gz", "r") as infile:
    num_lines = sum(1 for _ in infile) 
    assert num_lines == 1,  num_lines  # wrong, num_lines=4
@mpenkov
Copy link
Collaborator

mpenkov commented Mar 7, 2019

I think the culprit here is the Unicode line separator character hiding in your data.

$ python
Python 3.7.1 (default, Oct 22 2018, 11:21:55) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import collections
>>> collections.Counter(open('1.txt').read())
Counter({' ': 874, 'e': 519, 'n': 444, 't': 388, 'o': 366, 'i': 360, 'a': 338, 'r': 263, 's': 258, 'l': 213, 'd': 143, 'c': 140, 'h': 139, 'm': 134, 'u': 94, 'y': 94, 'f': 93, 'p': 88, ',': 76, 'b': 75, 'g': 70, 'w': 63, '\\': 54, 'v': 48, '.': 41, 'E': 35, 'A': 35, 'T': 33, 'I': 27, 'k': 26, 'L': 26, 'N': 26, 'S': 25, 'M': 20, 'D': 17, 'R': 16, 'P': 14, 'C': 14, 'O': 13, 'B': 12, '/': 11, ':': 9, 'F': 8, 'W': 7, 'H': 6, '-': 6, '"': 5, 'x': 5, '0': 5, 'V': 4, '2': 4, '“': 4, 'q': 4, 'Y': 4, '”': 4, 'G': 4, '(': 3, ')': 3, '1': 3, '3': 3, '’': 3, '\u2028': 3, 'U': 3, '_': 2, '{': 1, '|': 1, '&': 1, '←': 1, '→': 1, 'J': 1, '?': 1, '=': 1, 'z': 1, 'K': 1, '!': 1, '—': 1, '}': 1, '\n': 1})

The reason why this affects smart_open is: we use the codecs module of the standard library to perform byte-to-text decoding. Here's how that module performs:

>>> import codecs
>>> fin_bin = open('1.txt', 'rb')
>>> fin_txt = codecs.getreader('utf-8')(fin_bin)
>>> sum([1 for _ in fin_txt])
4

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 10, 2019

This may be relevant: https://stackoverflow.com/questions/17273598/python-codecs-line-ending

I suspect that gzip does this:

  1. Split the bytes by end-of-line characters (ASCII values)
  2. Decode each resulting line separately

With the codecs route, I think what's happening is:

  1. Decode everything as from bytes to Unicode on the fly
  2. Split the Unicode by end-of-line characters (Unicode)

So, the first method is unaware of the Unicode line separator, and happily ignores it. I'm not sure what the desired behavior should be - do we want to apply effort to mimic the standard library's open here? @piskvorky

@piskvorky
Copy link
Owner

My slight preference would be to mimic open, yes. But not important either way (not worth any massive refactoring IMO).

@mpenkov mpenkov added hard help wanted We can't figure this out, if you can, then please help! labels Sep 28, 2019
@wordtracker
Copy link

wordtracker commented Dec 1, 2019

I just got bit by this as well. smart_open reads a /u2028 character fine off a text file however as soon as that same file with the same contents is compressed, the line with the /u2028 character gets split up across multiple lines.

I suspect it's because /u2028 is somehow evaluated as a newline rather than representing the unicode value for one. Wouldn't correct default behaviour be to mimic open of the uncompressed text file for consistency?

@wordtracker
Copy link

I'm wondering if there is a way to bypass the issue with something like this:
file_contents = fh.read().replace('\u2028',' ')

I've looked through the source but can't find a good place to try this out. Any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug hard help wanted We can't figure this out, if you can, then please help!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants