Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent with bz2.open on files containing vertical tab ^K #394

Closed
2 of 3 tasks
yunjiangster opened this issue Nov 30, 2019 · 6 comments · Fixed by #578
Closed
2 of 3 tasks

inconsistent with bz2.open on files containing vertical tab ^K #394

yunjiangster opened this issue Nov 30, 2019 · 6 comments · Fixed by #578

Comments

@yunjiangster
Copy link

yunjiangster commented Nov 30, 2019

Problem description

Be sure your description clearly answers the following questions:

  • What are you trying to achieve?
    Trying to use smart_open to replace bz2.open
  • What is the expected result?
    same behavior as bz2.open wrt recognizing line breaks
  • What are you seeing instead?
    a long line got truncated due to the presence of non-line break symbol ^K

Steps/code to reproduce the problem

In order for us to be able to solve your problem, we have to be able to reproduce it on our end.
Without reproducing the problem, it is unlikely that we'll be able to help you.

Include full tracebacks, logs and datasets if necessary.
Please keep the examples minimal (minimal reproducible example).

take for instance the following binary uncompressed text. compress with bz2. The numbers of columns before and after bz2 as recognize with smart_open(..).readline() are different.

\xe5\x93\x81\x0b\xe3\x80\n

Versions

Please provide the output of:

import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)

print("smart_open", smart_open.version)
Traceback (most recent call last):
File "", line 1, in
AttributeError: module 'smart_open' has no attribute 'version'

Instead

pip show smart_open
Name: smart-open
Version: 1.7.1

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software
@yunjiangster
Copy link
Author

Tried with 1.9.0 as well and same problem.

with smart_open.smart_open('debug.tsv.bz2', mode='r') as f:
... tmp = f.readline().strip('\r\n').split('\t')

len(tmp)
444

with smart_open.smart_open('debug.tsv', mode='r') as f:
... tmp = f.readline().strip('\r\n').split('\t')

len(tmp)
852

with open('debug.tsv', mode='r') as f:
... tmp = f.readline().strip('\r\n').split('\t')

len(tmp)
852

with bz2.open('debug.tsv.bz2', mode='rt') as f:
... tmp = f.readline().strip('\r\n').split('\t')

len(tmp)
852

@piskvorky
Copy link
Owner

piskvorky commented Nov 30, 2019

Is it OK to open binary files in text mode (mode='r') and without encoding?

Neither of your outputs (444 and 852) seem to match the line length of your example (7). I don't understand what you're trying to show.

@mpenkov
Copy link
Collaborator

mpenkov commented Nov 30, 2019

I think this is a duplicate of #269

@yunjiangster
Copy link
Author

Is it OK to open binary files in text mode (mode='r') and without encoding?

Neither of your outputs (444 and 852) seem to match the line length of your example (7). I don't understand what you're trying to show.

Two different examples. The input contains special tokens so I quoted an excerpt of the binary representation.

@yunjiangster
Copy link
Author

I think this is a duplicate of #269

I think the culprit here is the vertical tab character \x0b. Not sure why it gets confused with the line return character.

@piskvorky
Copy link
Owner

piskvorky commented Nov 30, 2019

Please post a minimal reproducible example, including any required data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants