read_csv in combination with index_col and usecols #2654

floux · 2013-01-07T17:39:54Z

Starting point:

http://pandas.pydata.org/pandas-docs/stable/io.html#index-columns-and-trailing-delimiters

If there is one more column of data than there are colum names, usecols exhibits some (at least for me) unintuitive behavior:

>>> data = 'a,b,c\n4,apple,bat,5.7\n8,orange,cow,10'
>>> pd.read_csv(StringIO(data))
        a    b     c
4   apple  bat   5.7
8  orange  cow  10.0
>>> pd.read_csv(StringIO(data), usecols=['a', 'b'])
   a       b
0  4   apple
1  8  orange
>>>

I was expecting it to be equal to

>>> pd.read_csv(StringIO(data))[['a', 'b']]
        a    b
4   apple  bat
8  orange  cow

I am not sure if my expectation is unfounded, though, and that this behavior is indeed intentional?

wesm · 2013-01-07T18:47:40Z

This feels buggy or at minimum not intuitive to me. I think it's just an edge case that's not addressed in the test suite. I'll have a look

Version 0.10.1 * tag 'v0.10.1': (195 commits) RLS: set released to true RLS: Version 0.10.1 TST: skip problematic xlrd test Merging in MySQL support pandas-dev#2482 Revert "Merging in MySQL support pandas-dev#2482" BUG: don't let np.prod overflow int64 RLS: note changed return type in DatetimeIndex.unique RLS: more what's new for 0.10.1 RLS: some what's new for 0.10.1 API: restore inplace=TRue returns self, add FutureWarnings. re pandas-dev#1893 Merging in MySQL support pandas-dev#2482 BUG: fix python 3 dtype issue DOC: fix what's new 0.10 doc bug re pandas-dev#2651 BUG: fix C parser thread safety. verify gil release close pandas-dev#2608 BUG: usecols bug with implicit first index column. close pandas-dev#2654 BUG: plotting bug when base is nonzero pandas-dev#2571 BUG: period resampling bug when all values fall into a single bin. close pandas-dev#2070 BUG: fix memory error in sortlevel when many multiindex levels. close pandas-dev#2684 STY: CRLF BUG: perf_HEAD reports wrong vbench name when an exception is raised ...

JamesRamm · 2013-09-24T13:01:02Z

Hi
New to github so apologies if I'm out of place here. The issue has been closed, but the fix does not return expected behaviour?

When using use_cols with an implicit index column, the index column is now ignored and pandas returns it's own indexing (0, 1, 2, 3 etc). There is no way to specify that the index column should be used, as it doesn't have a header...

jreback · 2013-09-24T14:56:37Z

pls provide an explicity example (on a new issue) that shows your result

laserson · 2017-10-06T18:06:40Z

Here is a related problem where read_csv in combination with index_col and usecols is broken:

These commands works fine

>>> data = 'a,b,c\napple,bat,5.7\norange,cow,10'
        a    b     c
0   apple  bat   5.7
1  orange  cow  10.0

>>> pd.read_csv(StringIO(data), index_col=[0, 1])
               c
a      b        
apple  bat   5.7
orange cow  10.0

>>> pd.read_csv(StringIO(data), usecols=[2])
      c
0   5.7
1  10.0

But combining index_col and usecols breaks

>>> pd.read_csv(StringIO(data), index_col=[0, 1], usecols=[2])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-95-c02d1a70fe51> in <module>()
----> 1 pd.read_csv(StringIO(data), index_col=[0, 1], usecols=[2])

~/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    653                     skip_blank_lines=skip_blank_lines)
    654 
--> 655         return _read(filepath_or_buffer, kwds)
    656 
    657     parser_f.__name__ = name

~/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    403 
    404     # Create the parser.
--> 405     parser = TextFileReader(filepath_or_buffer, **kwds)
    406 
    407     if chunksize or iterator:

~/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    762             self.options['has_index_names'] = kwds['has_index_names']
    763 
--> 764         self._make_engine(self.engine)
    765 
    766     def close(self):

~/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    983     def _make_engine(self, engine='c'):
    984         if engine == 'c':
--> 985             self._engine = CParserWrapper(self.f, **self.options)
    986         else:
    987             if engine == 'python':

~/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1668                 (index_names, self.names,
   1669                  self.index_col) = _clean_index_names(self.names,
-> 1670                                                       self.index_col)
   1671 
   1672                 if self.index_names is None:

~/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _clean_index_names(columns, index_col)
   3081                     break
   3082         else:
-> 3083             name = cp_cols[c]
   3084             columns.remove(name)
   3085             index_names.append(name)

IndexError: list index out of range

Same is true if I do usecols=[0], usecols=[1], or usecols=['c']

gfyoung · 2017-10-06T19:09:04Z

That's not a bug. index_cols is relative to usecols. In this, you only have one column that you want to extract from the CSV, but you want two columns for the index.

laserson · 2017-10-06T19:25:46Z

I see. Could be worth clarifying in the docstring. Thanks!

gfyoung · 2017-10-06T19:57:35Z

Exactly. That's part of what I was proposing you do in #9098.

ghost assigned wesm Jan 21, 2013

wesm closed this as completed in bbfb95d Jan 21, 2013

garaud mentioned this issue Jan 23, 2013

read_csv: usecols doesn't work if separator is not "," #2733

Closed

laserson mentioned this issue Oct 6, 2017

index_col and usecols do not work reliably together in read_csv #9098

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv in combination with index_col and usecols #2654

read_csv in combination with index_col and usecols #2654

floux commented Jan 7, 2013

wesm commented Jan 7, 2013

JamesRamm commented Sep 24, 2013

jreback commented Sep 24, 2013

laserson commented Oct 6, 2017

gfyoung commented Oct 6, 2017 •

edited

Loading

laserson commented Oct 6, 2017 •

edited

Loading

gfyoung commented Oct 6, 2017

read_csv in combination with index_col and usecols #2654

read_csv in combination with index_col and usecols #2654

Comments

floux commented Jan 7, 2013

wesm commented Jan 7, 2013

JamesRamm commented Sep 24, 2013

jreback commented Sep 24, 2013

laserson commented Oct 6, 2017

gfyoung commented Oct 6, 2017 • edited Loading

laserson commented Oct 6, 2017 • edited Loading

gfyoung commented Oct 6, 2017

gfyoung commented Oct 6, 2017 •

edited

Loading

laserson commented Oct 6, 2017 •

edited

Loading