Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser #6889

Merged
merged 1 commit into from
Apr 23, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,8 @@ They can take a number of arguments:
- ``dialect``: string or :class:`python:csv.Dialect` instance to expose more
ways to specify the file format
- ``dtype``: A data type name or a dict of column name to data type. If not
specified, data types will be inferred.
specified, data types will be inferred. (Unsupported with
``engine='python'``)
- ``header``: row number(s) to use as the column names, and the start of the
data. Defaults to 0 if no ``names`` passed, otherwise ``None``. Explicitly
pass ``header=0`` to be able to replace existing names. The header can be
Expand Down Expand Up @@ -154,6 +155,7 @@ They can take a number of arguments:
pieces. Will cause an ``TextFileReader`` object to be returned. More on this
below in the section on :ref:`iterating and chunking <io.chunking>`
- ``skip_footer``: number of lines to skip at bottom of file (default 0)
(Unsupported with ``engine='c'``)
- ``converters``: a dictionary of functions for converting values in certain
columns, where keys are either integers or column labels
- ``encoding``: a string representing the encoding to use for decoding
Expand Down Expand Up @@ -275,6 +277,11 @@ individual columns:
df = pd.read_csv(StringIO(data), dtype={'b': object, 'c': np.float64})
df.dtypes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update the skip_footer/dtype in io.parsers.rst (at the top where the arguments are laid out) for the same way you did in the doc-strings?

.. note::
The ``dtype`` option is currently only supported by the C engine.
Specifying ``dtype`` with ``engine`` other than 'c' raises a
``ValueError``.

.. _io.headers:

Handling column names
Expand Down Expand Up @@ -1029,6 +1036,22 @@ Specifying ``iterator=True`` will also return the ``TextFileReader`` object:
os.remove('tmp.sv')
os.remove('tmp2.sv')

Specifying the parser engine
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Under the hood pandas uses a fast and efficient parser implemented in C as well
as a python implementation which is currently more feature-complete. Where
possible pandas uses the C parser (specified as ``engine='c'``), but may fall
back to python if C-unsupported options are specified. Currently, C-unsupported
options include:

- ``sep`` other than a single character (e.g. regex separators)
- ``skip_footer``
- ``sep=None`` with ``delim_whitespace=False``

Specifying any of the above options will produce a ``ParserWarning`` unless the
python engine is selected explicitly using ``engine='python'``.

.. _io.store_in_csv:

Writing to CSV format
Expand Down
16 changes: 16 additions & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,8 @@ API Changes
- ``.quantile`` on a ``datetime[ns]`` series now returns ``Timestamp`` instead
of ``np.datetime64`` objects (:issue:`6810`)
- change ``AssertionError`` to ``TypeError`` for invalid types passed to ``concat`` (:issue:`6583`)
- Add :class:`~pandas.io.parsers.ParserWarning` class for fallback and option
validation warnings in :func:`read_csv`/:func:`read_table` (:issue:`6607`)

Deprecations
~~~~~~~~~~~~
Expand Down Expand Up @@ -280,6 +282,9 @@ Improvements to existing features
- Added ``how`` option to rolling-moment functions to dictate how to handle resampling; :func:``rolling_max`` defaults to max,
:func:``rolling_min`` defaults to min, and all others default to mean (:issue:`6297`)
- ``pd.stats.moments.rolling_var`` now uses Welford's method for increased numerical stability (:issue:`6817`)
- Translate ``sep='\s+'`` to ``delim_whitespace=True`` in
:func:`read_csv`/:func:`read_table` if no other C-unsupported options
specified (:issue:`6607`)

.. _release.bug_fixes-0.14.0:

Expand Down Expand Up @@ -402,6 +407,17 @@ Bug Fixes
- Bug in `DataFrame.plot` and `Series.plot` legend behave inconsistently when plotting to the same axes repeatedly (:issue:`6678`)
- Internal tests for patching ``__finalize__`` / bug in merge not finalizing (:issue:`6923`, :issue:`6927`)
- accept ``TextFileReader`` in ``concat``, which was affecting a common user idiom (:issue:`6583`)
- Raise :class:`ValueError` when ``sep`` specified with
``delim_whitespace=True`` in :func:`read_csv`/:func:`read_table`
(:issue:`6607`)
- Raise :class:`ValueError` when `engine='c'` specified with unsupported
options (:issue:`6607`)
- Raise :class:`ValueError` when fallback to python parser causes options to be
ignored (:issue:`6607`)
- Produce :class:`~pandas.io.parsers.ParserWarning` on fallback to python
parser when no options are ignored (:issue:`6607`)
- Bug in C parser with leading whitespace (:issue:`3374`)
- Bug in C parser with ``delim_whitespace=True`` and ``\r``-delimited lines

pandas 0.13.1
-------------
Expand Down
102 changes: 81 additions & 21 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from pandas import compat
import re
import csv
import warnings

import numpy as np

Expand All @@ -24,6 +25,8 @@
import pandas.tslib as tslib
import pandas.parser as _parser

class ParserWarning(Warning):
pass

_parser_params = """Also supports optionally iterating or breaking of the file
into chunks.
Expand All @@ -50,6 +53,7 @@
One-character string used to escape delimiter when quoting is QUOTE_NONE.
dtype : Type name or dict of column -> type
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
(Unsupported with engine='python')
compression : {'gzip', 'bz2', None}, default None
For on-the-fly decompression of on-disk data
dialect : string or csv.Dialect instance, default None
Expand Down Expand Up @@ -113,7 +117,7 @@
chunksize : int, default None
Return TextFileReader object for iteration
skipfooter : int, default 0
Number of line at bottom of file to skip
Number of lines at bottom of file to skip (Unsupported with engine='c')
converters : dict. optional
Dict of functions for converting values in certain columns. Keys can either
be integers or column labels
Expand All @@ -125,24 +129,24 @@
Encoding to use for UTF when reading/writing (ex. 'utf-8')
squeeze : boolean, default False
If the parsed data only contains one column then return a Series
na_filter: boolean, default True
na_filter : boolean, default True
Detect missing value markers (empty strings and the value of na_values). In
data without any NAs, passing na_filter=False can improve the performance
of reading a large file
usecols : array-like
Return a subset of the columns.
Results in much faster parsing time and lower memory usage.
mangle_dupe_cols: boolean, default True
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
tupleize_cols: boolean, default False
tupleize_cols : boolean, default False
Leave a list of tuples on columns as is (default is to convert to
a Multi Index on the columns)
error_bad_lines: boolean, default True
error_bad_lines : boolean, default True
Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned.
If False, then these "bad lines" will dropped from the DataFrame that is
returned. (Only valid with C parser).
warn_bad_lines: boolean, default True
returned. (Only valid with C parser)
warn_bad_lines : boolean, default True
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
"bad line" will be output. (Only valid with C parser).
infer_datetime_format : boolean, default False
Expand All @@ -154,25 +158,30 @@
result : DataFrame or TextParser
"""

_csv_sep = """sep : string, default ','
_csv_params = """sep : string, default ','
Delimiter to use. If sep is None, will try to automatically determine
this. Regular expressions are accepted.
"""
engine : {'c', 'python'}
Parser engine to use. The C engine is faster while the python engine is
currently more feature-complete."""

_table_sep = """sep : string, default \\t (tab-stop)
Delimiter to use. Regular expressions are accepted."""
_table_params = """sep : string, default \\t (tab-stop)
Delimiter to use. Regular expressions are accepted.
engine : {'c', 'python'}
Parser engine to use. The C engine is faster while the python engine is
currently more feature-complete."""

_read_csv_doc = """
Read CSV (comma-separated) file into DataFrame

%s
""" % (_parser_params % _csv_sep)
""" % (_parser_params % _csv_params)

_read_table_doc = """
Read general delimited file into DataFrame

%s
""" % (_parser_params % _table_sep)
""" % (_parser_params % _table_params)

_fwf_widths = """\
colspecs : list of pairs (int, int) or 'infer'. optional
Expand Down Expand Up @@ -297,6 +306,8 @@ def _read(filepath_or_buffer, kwds):

def _make_parser_function(name, sep=','):

default_sep = sep

def parser_f(filepath_or_buffer,
sep=sep,
dialect=None,
Expand Down Expand Up @@ -325,7 +336,7 @@ def parser_f(filepath_or_buffer,
dtype=None,
usecols=None,

engine='c',
engine=None,
delim_whitespace=False,
as_recarray=False,
na_filter=True,
Expand Down Expand Up @@ -362,10 +373,21 @@ def parser_f(filepath_or_buffer,
if delimiter is None:
delimiter = sep

if delim_whitespace and delimiter is not default_sep:
raise ValueError("Specified a delimiter with both sep and"\
" delim_whitespace=True; you can only specify one.")

if engine is not None:
engine_specified = True
else:
engine = 'c'
engine_specified = False

kwds = dict(delimiter=delimiter,
engine=engine,
dialect=dialect,
compression=compression,
engine_specified=engine_specified,

doublequote=doublequote,
escapechar=escapechar,
Expand Down Expand Up @@ -468,10 +490,18 @@ class TextFileReader(object):

"""

def __init__(self, f, engine='python', **kwds):
def __init__(self, f, engine=None, **kwds):

self.f = f

if engine is not None:
engine_specified = True
else:
engine = 'python'
engine_specified = False

self._engine_specified = kwds.get('engine_specified', engine_specified)

if kwds.get('dialect') is not None:
dialect = kwds['dialect']
kwds['delimiter'] = dialect.delimiter
Expand Down Expand Up @@ -530,30 +560,60 @@ def _get_options_with_defaults(self, engine):
def _clean_options(self, options, engine):
result = options.copy()

engine_specified = self._engine_specified
fallback_reason = None

sep = options['delimiter']
delim_whitespace = options['delim_whitespace']

# C engine not supported yet
if engine == 'c':
if options['skip_footer'] > 0:
fallback_reason = "the 'c' engine does not support"\
" skip_footer"
engine = 'python'

if sep is None and not delim_whitespace:
if engine == 'c':
fallback_reason = "the 'c' engine does not support"\
" sep=None with delim_whitespace=False"
engine = 'python'
elif sep is not None and len(sep) > 1:
# wait until regex engine integrated
if engine not in ('python', 'python-fwf'):
if engine == 'c' and sep == '\s+':
result['delim_whitespace'] = True
del result['delimiter']
elif engine not in ('python', 'python-fwf'):
# wait until regex engine integrated
fallback_reason = "the 'c' engine does not support"\
" regex separators"
engine = 'python'

# C engine not supported yet
if engine == 'c':
if options['skip_footer'] > 0:
engine = 'python'
if fallback_reason and engine_specified:
raise ValueError(fallback_reason)

if engine == 'c':
for arg in _c_unsupported:
del result[arg]

if 'python' in engine:
for arg in _python_unsupported:
if fallback_reason and result[arg] != _c_parser_defaults[arg]:
msg = ("Falling back to the 'python' engine because"
" {reason}, but this causes {option!r} to be"
" ignored as it is not supported by the 'python'"
" engine.").format(reason=fallback_reason, option=arg)
if arg == 'dtype':
msg += " (Note the 'converters' option provides"\
" similar functionality.)"
raise ValueError(msg)
del result[arg]

if fallback_reason:
warnings.warn(("Falling back to the 'python' engine because"
" {0}; you can avoid this warning by specifying"
" engine='python'.").format(fallback_reason),
ParserWarning)

index_col = options['index_col']
names = options['names']
converters = options['converters']
Expand Down
3 changes: 3 additions & 0 deletions pandas/io/tests/test_cparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,9 @@ def _test(text, **kwargs):
data = 'A B C\r 2 3\r4 5 6'
_test(data, delim_whitespace=True)

data = 'A B C\r2 3\r4 5 6'
_test(data, delim_whitespace=True)

def test_empty_field_eof(self):
data = 'a,b,c\n1,2,3\n4,,'

Expand Down
Loading