Change from of 'c' to 'python' engine in read_csv gives unexpected side-effect #21131

olajir · 2018-05-19T08:21:17Z

Code Sample

>>> import pandas as pd
>>>
>>> csv_file_contents=(
... """File: small.csv,,
... 10010010233,0123,654
... foo,,bar
... 01001000155,4530,898""")
>>> with open('small.csv', 'w') as fp:
...     fp.write(csv_file_contents)
>>>
>>> pd.read_csv('small.csv', header=None, engine='c',
...             names=['col1', 'col2', 'col3'],
...             dtype={'col1': str, 'col2': str, 'col3': str})

              col1  col2 col3
0  File: small.csv   NaN  NaN
1      10010010233  0123  654
2              foo   NaN  bar
3      01001000155  4530  898

>>> pd.read_csv('small.csv', header=None, engine='c',
...             names=['col1', 'col2', 'col3'],
...             dtype={'col1': str, 'col2': str, 'col3': str}).dropna()

          col1  col2 col3
1  10010010233  0123  654
3  01001000155  4530  898

>>> pd.read_csv('small.csv', header=None, engine='python',
...             names=['col1', 'col2', 'col3'],
...             dtype={'col1': str, 'col2': str, 'col3': str}).dropna()

              col1  col2 col3
0  File: small.csv   nan  nan
1      10010010233  0123  654
2              foo   nan  bar
3      01001000155  4530  898

Problem description

Reading an empty field from a csv-file returns NaN (dtype float) if read with the 'c' engine and 'nan' (string representation) if read with the 'python' engine.
In the example above the digits that shall be treated as strings and lines with empty data-fields shall be dropped. This works fine with the above code snippet if the 'c' engine is used. However if the 'python' engine is used the 'nan' fields are not dropped. This may break working code if one needs to switch from the 'c' engine to the 'python' engine.

Expected Output

It is expected that the two engines gives consistent output.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.5.1
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-05-19T17:14:13Z

Is the issue just that the Python engine returns a string representation of NaN whereas the C parser returns the actually NaN value? If so, can you update your example to reflect that?

olajir · 2018-05-21T15:01:36Z

Yes, that's the issue. The example has been updated.

mroeschke · 2019-10-21T04:01:30Z

This looks to work on master now. Could use a test.

In [72]: >>> pd.read_csv('small.csv', header=None, engine='c',
    ...: ...             names=['col1', 'col2', 'col3'],
    ...: ...             dtype={'col1': str, 'col2': str, 'col3': str}).dropna()
Out[72]:
          col1  col2 col3
1  10010010233  0123  654
3  01001000155  4530  898

In [73]: >>> pd.read_csv('small.csv', header=None, engine='python',
    ...: ...             names=['col1', 'col2', 'col3'],
    ...: ...             dtype={'col1': str, 'col2': str, 'col3': str}).dropna()

Out[73]:
          col1  col2 col3
1  10010010233  0123  654
3  01001000155  4530  898

In [74]: pd.__version__
Out[74]: '0.26.0.dev0+593.g9d45934af'

aditya-hari · 2019-10-22T03:19:19Z

@mroeschke Hey, this is my first time contributing to pandas. The only thing left to do in this is writing a test, right?

mroeschke · 2019-10-22T03:52:56Z

@aditya-hari correct.

aditya-hari · 2019-10-22T07:28:28Z

@mroeschke Would it be enough for the test to assert that the output using both engines is the same? And where will this test go?

mroeschke · 2019-10-22T15:15:52Z

Best to construct the result manually result = pd.DataFrame(...), and compare both read_csv engines to that.

The test should go in a file in pandas/tests/io/parser

Closes pandas-dev#21131

Closes #21131

Closes pandas-dev#21131

olajir changed the title ~~Change from of _c_ to _python_ engine in read_csv gives unexpected side-effect~~ Change from of _c to _python_ engine in read_csv gives unexpected side-effect May 19, 2018

olajir changed the title ~~Change from of _c to _python_ engine in read_csv gives unexpected side-effect~~ Change from of 'c' to 'python' engine in read_csv gives unexpected side-effect May 19, 2018

WillAyd added IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue labels May 19, 2018

WillAyd added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Info Clarification about behavior needed to assess issue labels May 22, 2018

WillAyd added this to the Contributions Welcome milestone Oct 9, 2018

This was referenced Oct 9, 2018

Inconsistent behaviour of read_csv() between python and c engine with null values. #23056

Closed

BUG: Don't parse NaN as 'nan' in Data IO #23162

Merged

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Oct 21, 2019

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 5, 2019

TST: consistent result in dropping NA from CSV

87186c9

Closes pandas-dev#21131

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 5, 2019

TST: consistent result in dropping NA from CSV

850bd6b

Closes pandas-dev#21131

gfyoung mentioned this issue Nov 5, 2019

TST: consistent result in dropping NA from CSV #29424

Merged

gfyoung modified the milestones: Contributions Welcome, 1.0 Nov 5, 2019

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 5, 2019

TST: consistent result in dropping NA from CSV

68e870c

Closes pandas-dev#21131

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 6, 2019

TST: consistent result in dropping NA from CSV

54bd6ae

Closes pandas-dev#21131

jreback closed this as completed in #29424 Nov 6, 2019

jreback pushed a commit that referenced this issue Nov 6, 2019

TST: consistent result in dropping NA from CSV (#29424)

cacc9e2

Closes #21131

Reksbril pushed a commit to Reksbril/pandas that referenced this issue Nov 18, 2019

TST: consistent result in dropping NA from CSV (pandas-dev#29424)

30800ee

Closes pandas-dev#21131

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

TST: consistent result in dropping NA from CSV (pandas-dev#29424)

40fa182

Closes pandas-dev#21131

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

TST: consistent result in dropping NA from CSV (pandas-dev#29424)

e7074e9

Closes pandas-dev#21131

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change from of 'c' to 'python' engine in read_csv gives unexpected side-effect #21131

Change from of 'c' to 'python' engine in read_csv gives unexpected side-effect #21131

olajir commented May 19, 2018 •

edited

Loading

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

WillAyd commented May 19, 2018

olajir commented May 21, 2018

mroeschke commented Oct 21, 2019 •

edited

Loading

aditya-hari commented Oct 22, 2019 •

edited

Loading

mroeschke commented Oct 22, 2019

aditya-hari commented Oct 22, 2019

mroeschke commented Oct 22, 2019

Change from of 'c' to 'python' engine in read_csv gives unexpected side-effect #21131

Change from of 'c' to 'python' engine in read_csv gives unexpected side-effect #21131

Comments

olajir commented May 19, 2018 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

WillAyd commented May 19, 2018

olajir commented May 21, 2018

mroeschke commented Oct 21, 2019 • edited Loading

aditya-hari commented Oct 22, 2019 • edited Loading

mroeschke commented Oct 22, 2019

aditya-hari commented Oct 22, 2019

mroeschke commented Oct 22, 2019

olajir commented May 19, 2018 •

edited

Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

mroeschke commented Oct 21, 2019 •

edited

Loading

aditya-hari commented Oct 22, 2019 •

edited

Loading