Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv in combination with index_col and usecols #2654

Closed
floux opened this issue Jan 7, 2013 · 7 comments
Closed

read_csv in combination with index_col and usecols #2654

floux opened this issue Jan 7, 2013 · 7 comments
Assignees
Labels
Bug IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@floux
Copy link

floux commented Jan 7, 2013

Starting point:

http://pandas.pydata.org/pandas-docs/stable/io.html#index-columns-and-trailing-delimiters

If there is one more column of data than there are colum names, usecols exhibits some (at least for me) unintuitive behavior:

>>> data = 'a,b,c\n4,apple,bat,5.7\n8,orange,cow,10'
>>> pd.read_csv(StringIO(data))
        a    b     c
4   apple  bat   5.7
8  orange  cow  10.0
>>> pd.read_csv(StringIO(data), usecols=['a', 'b'])
   a       b
0  4   apple
1  8  orange
>>>

I was expecting it to be equal to

>>> pd.read_csv(StringIO(data))[['a', 'b']]
        a    b
4   apple  bat
8  orange  cow

I am not sure if my expectation is unfounded, though, and that this behavior is indeed intentional?

@wesm
Copy link
Member

wesm commented Jan 7, 2013

This feels buggy or at minimum not intuitive to me. I think it's just an edge case that's not addressed in the test suite. I'll have a look

@ghost ghost assigned wesm Jan 21, 2013
@wesm wesm closed this as completed in bbfb95d Jan 21, 2013
yarikoptic added a commit to neurodebian/pandas that referenced this issue Jan 23, 2013
Version 0.10.1

* tag 'v0.10.1': (195 commits)
  RLS: set released to true
  RLS: Version 0.10.1
  TST: skip problematic xlrd test
  Merging in MySQL support pandas-dev#2482
  Revert "Merging in MySQL support pandas-dev#2482"
  BUG: don't let np.prod overflow int64
  RLS: note changed return type in DatetimeIndex.unique
  RLS: more what's new for 0.10.1
  RLS: some what's new for 0.10.1
  API: restore inplace=TRue returns self, add FutureWarnings. re pandas-dev#1893
  Merging in MySQL support pandas-dev#2482
  BUG: fix python 3 dtype issue
  DOC: fix what's new 0.10 doc bug re pandas-dev#2651
  BUG: fix C parser thread safety. verify gil release close pandas-dev#2608
  BUG: usecols bug with implicit first index column. close pandas-dev#2654
  BUG: plotting bug when base is nonzero pandas-dev#2571
  BUG: period resampling bug when all values fall into a single bin. close pandas-dev#2070
  BUG: fix memory error in sortlevel when many multiindex levels. close pandas-dev#2684
  STY: CRLF
  BUG: perf_HEAD reports wrong vbench name when an exception is raised
  ...
@JamesRamm
Copy link

Hi
New to github so apologies if I'm out of place here. The issue has been closed, but the fix does not return expected behaviour?

When using use_cols with an implicit index column, the index column is now ignored and pandas returns it's own indexing (0, 1, 2, 3 etc). There is no way to specify that the index column should be used, as it doesn't have a header...

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

pls provide an explicity example (on a new issue) that shows your result

@laserson
Copy link

laserson commented Oct 6, 2017

Here is a related problem where read_csv in combination with index_col and usecols is broken:

These commands works fine

>>> data = 'a,b,c\napple,bat,5.7\norange,cow,10'
        a    b     c
0   apple  bat   5.7
1  orange  cow  10.0

>>> pd.read_csv(StringIO(data), index_col=[0, 1])
               c
a      b        
apple  bat   5.7
orange cow  10.0

>>> pd.read_csv(StringIO(data), usecols=[2])
      c
0   5.7
1  10.0

But combining index_col and usecols breaks

>>> pd.read_csv(StringIO(data), index_col=[0, 1], usecols=[2])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-95-c02d1a70fe51> in <module>()
----> 1 pd.read_csv(StringIO(data), index_col=[0, 1], usecols=[2])

~/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    653                     skip_blank_lines=skip_blank_lines)
    654 
--> 655         return _read(filepath_or_buffer, kwds)
    656 
    657     parser_f.__name__ = name

~/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    403 
    404     # Create the parser.
--> 405     parser = TextFileReader(filepath_or_buffer, **kwds)
    406 
    407     if chunksize or iterator:

~/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    762             self.options['has_index_names'] = kwds['has_index_names']
    763 
--> 764         self._make_engine(self.engine)
    765 
    766     def close(self):

~/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    983     def _make_engine(self, engine='c'):
    984         if engine == 'c':
--> 985             self._engine = CParserWrapper(self.f, **self.options)
    986         else:
    987             if engine == 'python':

~/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1668                 (index_names, self.names,
   1669                  self.index_col) = _clean_index_names(self.names,
-> 1670                                                       self.index_col)
   1671 
   1672                 if self.index_names is None:

~/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _clean_index_names(columns, index_col)
   3081                     break
   3082         else:
-> 3083             name = cp_cols[c]
   3084             columns.remove(name)
   3085             index_names.append(name)

IndexError: list index out of range

Same is true if I do usecols=[0], usecols=[1], or usecols=['c']

@gfyoung
Copy link
Member

gfyoung commented Oct 6, 2017

That's not a bug. index_cols is relative to usecols. In this, you only have one column that you want to extract from the CSV, but you want two columns for the index.

@laserson
Copy link

laserson commented Oct 6, 2017

I see. Could be worth clarifying in the docstring. Thanks!

@gfyoung
Copy link
Member

gfyoung commented Oct 6, 2017

Exactly. That's part of what I was proposing you do in #9098.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

6 participants