read_csv in Pandas 0.14 loads NaNs when namedtuple is used for column names #7589

JackKelly · 2014-06-27T14:20:31Z

Using a namedtuple as a column name for read_csv in Pandas 0.14 results NaNs being loaded.

Here is a simple demonstration of the problem (this code works in Pandas 0.13.1):

import pandas as pd
from collections import namedtuple
from StringIO import StringIO

TestTuple = namedtuple('test', ['a'])

CSV = """10
20
30"""

pd.read_csv(StringIO(CSV), header=None, names=[TestTuple('foo')], 
            tupleize_cols=True)

Pandas 0.14, this is the output:

     (foo,)
0    NaN
1    NaN
2    NaN

Strangely enough, Pandas 0.14 works fine if we used a tuple instead of a namedtuple:

pd.read_csv(StringIO(CSV), header=None, names=[('foo')], tupleize_cols=False)

Here is the output:

So, for some reason, read_csv in Pandas 0.14 doesn't like using a namedtuple as a column name. (The ugly fix is to not pass any column names to read_csv and then, once the DataFrame is loaded, replace the column names with df.columns = [TestTuple('foo')])

(Really love Pandas by the way, thanks so much for all your work!)

My software versions:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-30-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.14.0
nose: 1.3.3
Cython: 0.20
numpy: 1.8.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
bq: None
apiclient: None
rpy2: 2.3.8
sqlalchemy: None
pymysql: None
psycopg2: 2.5.3 (dt dec pq3 ext)

The text was updated successfully, but these errors were encountered:

jreback · 2014-06-27T16:01:44Z

see here about the change in set_index: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#whatsnew-0140-api

I don't think this is a bug, using tuples (or tuple-like) as a name of a column is just asking trouble as these represent multi-indexes.

JackKelly · 2014-06-27T16:05:08Z

OK, thanks loads for your help! I guess we'll move to using hierarchical indexes and, on each level, we'll just use strings.

JackKelly · 2014-07-01T08:13:23Z

For our own project, we are now using MultiIndex for our columns (e.g. [('power', 'active'), ('power', 'reactive'), ('energy', 'apparent'), ('voltage', '')]).

But I am a bit confused about whether or not it's OK to use tuples as column names in Pandas DataFrames. The Pandas docs say:

It’s worth keeping in mind that there’s nothing preventing
you from using tuples as atomic labels on an axis:

In [326]: Series(randn(8), index=tuples)
Out[326]: 
(bar, one)   -0.557549
(bar, two)    0.126204
(baz, one)    1.643615
(baz, two)   -0.067716
(foo, one)    0.127064
(foo, two)    0.396144
(qux, one)    1.043289
(qux, two)   -0.229627
dtype: float64

And, in my testing, vanilla tuples do seem fine as column names (in both v0.13.1 and v0.14). And namedtuples worked in Pandas 0.13.1

On the other hand, you said that "using tuples (or tuple-like) as a name of a column is just asking trouble as these represent multi-indexes." and namedtuples don't work in 0.14.

It sounds like tuples can work but they're unsafe and they might become totally unusable as column names in the future of Pandas. Is that the correct conclusion?

jreback · 2014-07-01T10:07:32Z

Nothing wrong from you using tuples. They just IMHO don't offer any benefit over multi-indexes. If they work for you, then great. I don't mean to imply they are unsafe, that's my 2c in that selection is just VERY confusing in written code

e.g. df.loc[:,('foo',1)] looks to a code reader like its a multi-index as that is what you would do. This WILL work with a tuple column name, just its confusing IMHO.

I don't think the usage will change in the future. I'll mark this issue as a bug. Pls feel free to submit a pull-request to fix!.

armaganthis3 · 2014-07-01T11:30:28Z

This was a breaking change in 0.14. See discussion in #3323.
tuple labels do clash with the new multiindex slicing syntax
as @jreback notes, but that's a bit of a dodge since that change
was made several months earlier and was unrelated to the new syntax.

as for "these... just asking trouble as these represent multi-indexes." -
not true either for pandas up to 0.14. tuples did not "represente"
multiindexes before that, your formerly working code and the documentation
you mention are clear enough on that point.

JackKelly · 2014-07-01T16:02:28Z

Thanks loads for the replies, @jreback and @armaganthis3 . For our own project, I think we will stick with using MultiIndex instead of namedtuples. It is, quite probably, a better solution than using namedtuples anyway (as @jreback points out). So I'm afraid I'm unlikely to find time to hack away at Pandas to try to explore this bug with namedtuples, I'm sorry.

jreback · 2018-07-06T22:42:01Z

this looks working in master:

In [15]: pd.__version__
Out[15]: '0.24.0.dev0+243.g30eb48cc4'

for both versions, so can close with a test

dahlbaek · 2018-07-19T12:30:28Z

@jreback: Just to be clear, you would like tests to ensure that both tuples and namedtuples give rise to multiindices, right? So something like

from collections import namedtuple
from io import StringIO

import pandas.util.testing as tm
from pandas import DataFrame, MultiIndex, read_csv


TestTuple = namedtuple('columns', ['first', 'second'])
CSV = """foo,bar
baz,baz
1,2
3,4"""

expected_columns = MultiIndex(
    levels=[['foo', 'bar'], ['baz']],
    labels=[[0, 1], [0, 0]]
)
expected_df = DataFrame(data=[[1, 2], [3, 4]], columns=expected_columns)

multi_df = read_csv(StringIO(CSV), header=[0, 1])
tm.assert_frame_equal(expected_df, multi_df, check_column_type=True)

tuple_df = read_csv(
    StringIO(CSV),
    header=None,
    skiprows=2,
    names=[('foo', 'baz'), ('bar', 'baz')]
)
tm.assert_frame_equal(expected_df, tuple_df, check_column_type=True)

namedtuple_df = read_csv(
    StringIO(CSV),
    header=None,
    skiprows=2,
    names=[TestTuple('foo', 'baz'), TestTuple('bar', 'baz')]
)
tm.assert_frame_equal(expected_df, namedtuple_df, check_column_type=True)

would work?

JackKelly changed the title ~~read_csv loads NaNs when namedtuple is used for column names~~ read_csv in Pandas 0.14 loads NaNs when namedtuple is used for column names Jun 27, 2014

JackKelly mentioned this issue Jun 27, 2014

Update NILMTK to be compatible with Pandas v0.14 nilmtk/nilmtk#128

Closed

JackKelly closed this as completed Jun 27, 2014

jreback reopened this Jul 1, 2014

jreback added Bug labels Jul 1, 2014

jreback added this to the 0.15.0 milestone Jul 1, 2014

JackKelly mentioned this issue Jul 1, 2014

How best to represent Measurements nilmtk/nilmtk#122

Closed

ischwabacher mentioned this issue Jul 15, 2014

BUG/ERR: read_csv(header=[0]) should raise/warn #7757

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

toobaz mentioned this issue Mar 3, 2016

read_csv not respecting MultiIndex names #12518

Closed

jreback modified the milestones: Contributions Welcome, 0.24.0 Jul 6, 2018

jreback added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 6, 2018

dahlbaek mentioned this issue Jul 20, 2018

TST: tuple and namedtuple multiindex tests for read_csv #21994

Merged

3 tasks

jreback closed this as completed in #21994 Jul 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv in Pandas 0.14 loads NaNs when namedtuple is used for column names #7589

read_csv in Pandas 0.14 loads NaNs when namedtuple is used for column names #7589

JackKelly commented Jun 27, 2014

jreback commented Jun 27, 2014

JackKelly commented Jun 27, 2014

JackKelly commented Jul 1, 2014

jreback commented Jul 1, 2014

armaganthis3 commented Jul 1, 2014

JackKelly commented Jul 1, 2014

jreback commented Jul 6, 2018

dahlbaek commented Jul 19, 2018 •

edited

Loading

read_csv in Pandas 0.14 loads NaNs when namedtuple is used for column names #7589

read_csv in Pandas 0.14 loads NaNs when namedtuple is used for column names #7589

Comments

JackKelly commented Jun 27, 2014

jreback commented Jun 27, 2014

JackKelly commented Jun 27, 2014

JackKelly commented Jul 1, 2014

jreback commented Jul 1, 2014

armaganthis3 commented Jul 1, 2014

JackKelly commented Jul 1, 2014

jreback commented Jul 6, 2018

dahlbaek commented Jul 19, 2018 • edited Loading

dahlbaek commented Jul 19, 2018 •

edited

Loading