Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv in Pandas 0.14 loads NaNs when namedtuple is used for column names #7589

Closed
JackKelly opened this issue Jun 27, 2014 · 8 comments · Fixed by #21994
Closed

read_csv in Pandas 0.14 loads NaNs when namedtuple is used for column names #7589

JackKelly opened this issue Jun 27, 2014 · 8 comments · Fixed by #21994
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions good first issue MultiIndex Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@JackKelly
Copy link
Contributor

Using a namedtuple as a column name for read_csv in Pandas 0.14 results NaNs being loaded.

Here is a simple demonstration of the problem (this code works in Pandas 0.13.1):

import pandas as pd
from collections import namedtuple
from StringIO import StringIO

TestTuple = namedtuple('test', ['a'])

CSV = """10
20
30"""

pd.read_csv(StringIO(CSV), header=None, names=[TestTuple('foo')], 
            tupleize_cols=True)

Pandas 0.14, this is the output:

     (foo,)
0    NaN
1    NaN
2    NaN

Strangely enough, Pandas 0.14 works fine if we used a tuple instead of a namedtuple:

pd.read_csv(StringIO(CSV), header=None, names=[('foo')], tupleize_cols=False)

Here is the output:

   foo
0   10
1   20
2   30

So, for some reason, read_csv in Pandas 0.14 doesn't like using a namedtuple as a column name. (The ugly fix is to not pass any column names to read_csv and then, once the DataFrame is loaded, replace the column names with df.columns = [TestTuple('foo')])

(Really love Pandas by the way, thanks so much for all your work!)

My software versions:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-30-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.14.0
nose: 1.3.3
Cython: 0.20
numpy: 1.8.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
bq: None
apiclient: None
rpy2: 2.3.8
sqlalchemy: None
pymysql: None
psycopg2: 2.5.3 (dt dec pq3 ext)
@JackKelly JackKelly changed the title read_csv loads NaNs when namedtuple is used for column names read_csv in Pandas 0.14 loads NaNs when namedtuple is used for column names Jun 27, 2014
@jreback
Copy link
Contributor

jreback commented Jun 27, 2014

see here about the change in set_index: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#whatsnew-0140-api

I don't think this is a bug, using tuples (or tuple-like) as a name of a column is just asking trouble as these represent multi-indexes.

@JackKelly
Copy link
Contributor Author

OK, thanks loads for your help! I guess we'll move to using hierarchical indexes and, on each level, we'll just use strings.

@JackKelly
Copy link
Contributor Author

For our own project, we are now using MultiIndex for our columns (e.g. [('power', 'active'), ('power', 'reactive'), ('energy', 'apparent'), ('voltage', '')]).

But I am a bit confused about whether or not it's OK to use tuples as column names in Pandas DataFrames. The Pandas docs say:

It’s worth keeping in mind that there’s nothing preventing
you from using tuples as atomic labels on an axis:

In [326]: Series(randn(8), index=tuples)
Out[326]: 
(bar, one)   -0.557549
(bar, two)    0.126204
(baz, one)    1.643615
(baz, two)   -0.067716
(foo, one)    0.127064
(foo, two)    0.396144
(qux, one)    1.043289
(qux, two)   -0.229627
dtype: float64

And, in my testing, vanilla tuples do seem fine as column names (in both v0.13.1 and v0.14). And namedtuples worked in Pandas 0.13.1

On the other hand, you said that "using tuples (or tuple-like) as a name of a column is just asking trouble as these represent multi-indexes." and namedtuples don't work in 0.14.

It sounds like tuples can work but they're unsafe and they might become totally unusable as column names in the future of Pandas. Is that the correct conclusion?

@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

Nothing wrong from you using tuples. They just IMHO don't offer any benefit over multi-indexes. If they work for you, then great. I don't mean to imply they are unsafe, that's my 2c in that selection is just VERY confusing in written code

e.g. df.loc[:,('foo',1)] looks to a code reader like its a multi-index as that is what you would do. This WILL work with a tuple column name, just its confusing IMHO.

I don't think the usage will change in the future. I'll mark this issue as a bug. Pls feel free to submit a pull-request to fix!.

@jreback jreback reopened this Jul 1, 2014
@jreback jreback added this to the 0.15.0 milestone Jul 1, 2014
@armaganthis3
Copy link

This was a breaking change in 0.14. See discussion in #3323.
tuple labels do clash with the new multiindex slicing syntax
as @jreback notes, but that's a bit of a dodge since that change
was made several months earlier and was unrelated to the new syntax.

as for "these... just asking trouble as these represent multi-indexes." -
not true either for pandas up to 0.14. tuples did not "represente"
multiindexes before that, your formerly working code and the documentation
you mention are clear enough on that point.

@JackKelly
Copy link
Contributor Author

Thanks loads for the replies, @jreback and @armaganthis3 . For our own project, I think we will stick with using MultiIndex instead of namedtuples. It is, quite probably, a better solution than using namedtuples anyway (as @jreback points out). So I'm afraid I'm unlikely to find time to hack away at Pandas to try to explore this bug with namedtuples, I'm sorry.

@jreback
Copy link
Contributor

jreback commented Jul 6, 2018

this looks working in master:

In [15]: pd.__version__
Out[15]: '0.24.0.dev0+243.g30eb48cc4'

for both versions, so can close with a test

@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Jul 6, 2018
@jreback jreback added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 6, 2018
@dahlbaek
Copy link
Contributor

dahlbaek commented Jul 19, 2018

@jreback: Just to be clear, you would like tests to ensure that both tuples and namedtuples give rise to multiindices, right? So something like

from collections import namedtuple
from io import StringIO

import pandas.util.testing as tm
from pandas import DataFrame, MultiIndex, read_csv


TestTuple = namedtuple('columns', ['first', 'second'])
CSV = """foo,bar
baz,baz
1,2
3,4"""

expected_columns = MultiIndex(
    levels=[['foo', 'bar'], ['baz']],
    labels=[[0, 1], [0, 0]]
)
expected_df = DataFrame(data=[[1, 2], [3, 4]], columns=expected_columns)

multi_df = read_csv(StringIO(CSV), header=[0, 1])
tm.assert_frame_equal(expected_df, multi_df, check_column_type=True)

tuple_df = read_csv(
    StringIO(CSV),
    header=None,
    skiprows=2,
    names=[('foo', 'baz'), ('bar', 'baz')]
)
tm.assert_frame_equal(expected_df, tuple_df, check_column_type=True)

namedtuple_df = read_csv(
    StringIO(CSV),
    header=None,
    skiprows=2,
    names=[TestTuple('foo', 'baz'), TestTuple('bar', 'baz')]
)
tm.assert_frame_equal(expected_df, namedtuple_df, check_column_type=True)

would work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions good first issue MultiIndex Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants