Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change from of 'c' to 'python' engine in read_csv gives unexpected side-effect #21131

Closed
olajir opened this issue May 19, 2018 · 7 comments · Fixed by #29424
Closed

Change from of 'c' to 'python' engine in read_csv gives unexpected side-effect #21131

olajir opened this issue May 19, 2018 · 7 comments · Fixed by #29424
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@olajir
Copy link

olajir commented May 19, 2018

Code Sample

>>> import pandas as pd
>>>
>>> csv_file_contents=(
... """File: small.csv,,
... 10010010233,0123,654
... foo,,bar
... 01001000155,4530,898""")
>>> with open('small.csv', 'w') as fp:
...     fp.write(csv_file_contents)
>>>
>>> pd.read_csv('small.csv', header=None, engine='c',
...             names=['col1', 'col2', 'col3'],
...             dtype={'col1': str, 'col2': str, 'col3': str})
              col1  col2 col3
0  File: small.csv   NaN  NaN
1      10010010233  0123  654
2              foo   NaN  bar
3      01001000155  4530  898
>>> pd.read_csv('small.csv', header=None, engine='c',
...             names=['col1', 'col2', 'col3'],
...             dtype={'col1': str, 'col2': str, 'col3': str}).dropna()
          col1  col2 col3
1  10010010233  0123  654
3  01001000155  4530  898
>>> pd.read_csv('small.csv', header=None, engine='python',
...             names=['col1', 'col2', 'col3'],
...             dtype={'col1': str, 'col2': str, 'col3': str}).dropna()
              col1  col2 col3
0  File: small.csv   nan  nan
1      10010010233  0123  654
2              foo   nan  bar
3      01001000155  4530  898

Problem description

Reading an empty field from a csv-file returns NaN (dtype float) if read with the 'c' engine and 'nan' (string representation) if read with the 'python' engine.
In the example above the digits that shall be treated as strings and lines with empty data-fields shall be dropped. This works fine with the above code snippet if the 'c' engine is used. However if the 'python' engine is used the 'nan' fields are not dropped. This may break working code if one needs to switch from the 'c' engine to the 'python' engine.

Expected Output

It is expected that the two engines gives consistent output.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.5.1
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@olajir olajir changed the title Change from of _c_ to _python_ engine in read_csv gives unexpected side-effect Change from of _c to _python_ engine in read_csv gives unexpected side-effect May 19, 2018
@olajir olajir changed the title Change from of _c to _python_ engine in read_csv gives unexpected side-effect Change from of 'c' to 'python' engine in read_csv gives unexpected side-effect May 19, 2018
@WillAyd
Copy link
Member

WillAyd commented May 19, 2018

Is the issue just that the Python engine returns a string representation of NaN whereas the C parser returns the actually NaN value? If so, can you update your example to reflect that?

@WillAyd WillAyd added IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue labels May 19, 2018
@olajir
Copy link
Author

olajir commented May 21, 2018

Yes, that's the issue. The example has been updated.

@WillAyd WillAyd added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Info Clarification about behavior needed to assess issue labels May 22, 2018
@WillAyd WillAyd added this to the Contributions Welcome milestone Oct 9, 2018
@mroeschke
Copy link
Member

mroeschke commented Oct 21, 2019

This looks to work on master now. Could use a test.

In [72]: >>> pd.read_csv('small.csv', header=None, engine='c',
    ...: ...             names=['col1', 'col2', 'col3'],
    ...: ...             dtype={'col1': str, 'col2': str, 'col3': str}).dropna()
Out[72]:
          col1  col2 col3
1  10010010233  0123  654
3  01001000155  4530  898

In [73]: >>> pd.read_csv('small.csv', header=None, engine='python',
    ...: ...             names=['col1', 'col2', 'col3'],
    ...: ...             dtype={'col1': str, 'col2': str, 'col3': str}).dropna()

Out[73]:
          col1  col2 col3
1  10010010233  0123  654
3  01001000155  4530  898

In [74]: pd.__version__
Out[74]: '0.26.0.dev0+593.g9d45934af'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Oct 21, 2019
@aditya-hari
Copy link

aditya-hari commented Oct 22, 2019

@mroeschke Hey, this is my first time contributing to pandas. The only thing left to do in this is writing a test, right?

@mroeschke
Copy link
Member

@aditya-hari correct.

@aditya-hari
Copy link

@mroeschke Would it be enough for the test to assert that the output using both engines is the same? And where will this test go?

@mroeschke
Copy link
Member

Best to construct the result manually result = pd.DataFrame(...), and compare both read_csv engines to that.

The test should go in a file in pandas/tests/io/parser

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 5, 2019
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 5, 2019
@gfyoung gfyoung modified the milestones: Contributions Welcome, 1.0 Nov 5, 2019
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 5, 2019
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 6, 2019
Reksbril pushed a commit to Reksbril/pandas that referenced this issue Nov 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants