Skip to content

Latest commit

 

History

History
883 lines (699 loc) · 60.6 KB

v1.2.0.rst

File metadata and controls

883 lines (699 loc) · 60.6 KB

What's new in 1.2.0 (December 26, 2020)

These are the changes in pandas 1.2.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Warning

The xlwt package for writing old-style .xls excel files is no longer maintained. The xlrd package is now only for reading old-style .xls files.

Previously, the default argument engine=None to :func:`~pandas.read_excel` would result in using the xlrd engine in many cases, including new Excel 2007+ (.xlsx) files. If openpyxl is installed, many of these cases will now default to using the openpyxl engine. See the :func:`read_excel` documentation for more details.

Thus, it is strongly encouraged to install openpyxl to read Excel 2007+ (.xlsx) files. Please do not report issues when using ``xlrd`` to read ``.xlsx`` files. This is no longer supported, switch to using openpyxl instead.

Attempting to use the xlwt engine will raise a FutureWarning unless the option :attr:`io.excel.xls.writer` is set to "xlwt". While this option is now deprecated and will also raise a FutureWarning, it can be globally set and the warning suppressed. Users are recommended to write .xlsx files using the openpyxl engine instead.

Enhancements

Optionally disallow duplicate labels

:class:`Series` and :class:`DataFrame` can now be created with allows_duplicate_labels=False flag to control whether the index or columns can contain duplicate labels (:issue:`28394`). This can be used to prevent accidental introduction of duplicate labels, which can affect downstream operations.

By default, duplicates continue to be allowed.

In [1]: pd.Series([1, 2], index=['a', 'a'])
Out[1]:
a    1
a    2
Length: 2, dtype: int64

In [2]: pd.Series([1, 2], index=['a', 'a']).set_flags(allows_duplicate_labels=False)
...
DuplicateLabelError: Index has duplicates.
      positions
label
a        [0, 1]

pandas will propagate the allows_duplicate_labels property through many operations.

In [3]: a = (
   ...:     pd.Series([1, 2], index=['a', 'b'])
   ...:       .set_flags(allows_duplicate_labels=False)
   ...: )

In [4]: a
Out[4]:
a    1
b    2
Length: 2, dtype: int64

# An operation introducing duplicates
In [5]: a.reindex(['a', 'b', 'a'])
...
DuplicateLabelError: Index has duplicates.
      positions
label
a        [0, 2]

[1 rows x 1 columns]

Warning

This is an experimental feature. Currently, many methods fail to propagate the allows_duplicate_labels value. In future versions it is expected that every method taking or returning one or more DataFrame or Series objects will propagate allows_duplicate_labels.

See :ref:`duplicates` for more.

The allows_duplicate_labels flag is stored in the new :attr:`DataFrame.flags` attribute. This stores global attributes that apply to the pandas object. This differs from :attr:`DataFrame.attrs`, which stores information that applies to the dataset.

Passing arguments to fsspec backends

Many read/write functions have acquired the storage_options optional argument, to pass a dictionary of parameters to the storage backend. This allows, for example, for passing credentials to S3 and GCS storage. The details of what parameters can be passed to which backends can be found in the documentation of the individual storage backends (detailed from the fsspec docs for builtin implementations and linked to external ones). See Section :ref:`io.remote`.

:issue:`35655` added fsspec support (including storage_options) for reading excel files.

Support for binary file handles in to_csv

:meth:`to_csv` supports file handles in binary mode (:issue:`19827` and :issue:`35058`) with encoding (:issue:`13068` and :issue:`23854`) and compression (:issue:`22555`). If pandas does not automatically detect whether the file handle is opened in binary or text mode, it is necessary to provide mode="wb".

For example:

.. ipython:: python

   import io

   data = pd.DataFrame([0, 1, 2])
   buffer = io.BytesIO()
   data.to_csv(buffer, encoding="utf-8", compression="gzip")

Support for short caption and table position in to_latex

:meth:`DataFrame.to_latex` now allows one to specify a floating table position (:issue:`35281`) and a short caption (:issue:`36267`).

The keyword position has been added to set the position.

.. ipython:: python
   :okwarning:

   data = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
   table = data.to_latex(position='ht')
   print(table)

Usage of the keyword caption has been extended. Besides taking a single string as an argument, one can optionally provide a tuple (full_caption, short_caption) to add a short caption macro.

.. ipython:: python
   :okwarning:

   data = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
   table = data.to_latex(caption=('the full long caption', 'short caption'))
   print(table)

Change in default floating precision for read_csv and read_table

For the C parsing engine, the methods :meth:`read_csv` and :meth:`read_table` previously defaulted to a parser that could read floating point numbers slightly incorrectly with respect to the last bit in precision. The option floating_precision="high" has always been available to avoid this issue. Beginning with this version, the default is now to use the more accurate parser by making floating_precision=None correspond to the high precision parser, and the new option floating_precision="legacy" to use the legacy parser. The change to using the higher precision parser by default should have no impact on performance. (:issue:`17154`)

Experimental nullable data types for float data

We've added :class:`Float32Dtype` / :class:`Float64Dtype` and :class:`~arrays.FloatingArray`. These are extension data types dedicated to floating point data that can hold the pd.NA missing value indicator (:issue:`32265`, :issue:`34307`).

While the default float data type already supports missing values using np.nan, these new data types use pd.NA (and its corresponding behavior) as the missing value indicator, in line with the already existing nullable :ref:`integer <integer_na>` and :ref:`boolean <boolean>` data types.

One example where the behavior of np.nan and pd.NA is different is comparison operations:

.. ipython:: python

  # the default NumPy float64 dtype
  s1 = pd.Series([1.5, None])
  s1
  s1 > 1

.. ipython:: python

  # the new nullable float64 dtype
  s2 = pd.Series([1.5, None], dtype="Float64")
  s2
  s2 > 1

See the :ref:`missing_data.NA` doc section for more details on the behavior when using the pd.NA missing value indicator.

As shown above, the dtype can be specified using the "Float64" or "Float32" string (capitalized to distinguish it from the default "float64" data type). Alternatively, you can also use the dtype object:

.. ipython:: python

   pd.Series([1.5, None], dtype=pd.Float32Dtype())

Operations with the existing integer or boolean nullable data types that give float results will now also use the nullable floating data types (:issue:`38178`).

Warning

Experimental: the new floating data types are currently experimental, and their behavior or API may still change without warning. Especially the behavior regarding NaN (distinct from NA missing values) is subject to change.

Index/column name preservation when aggregating

When aggregating using :meth:`concat` or the :class:`DataFrame` constructor, pandas will now attempt to preserve index and column names whenever possible (:issue:`35847`). In the case where all inputs share a common name, this name will be assigned to the result. When the input names do not all agree, the result will be unnamed. Here is an example where the index name is preserved:

.. ipython:: python

    idx = pd.Index(range(5), name='abc')
    ser = pd.Series(range(5, 10), index=idx)
    pd.concat({'x': ser[1:], 'y': ser[:-1]}, axis=1)

The same is true for :class:`MultiIndex`, but the logic is applied separately on a level-by-level basis.

GroupBy supports EWM operations directly

:class:`.DataFrameGroupBy` now supports exponentially weighted window operations directly (:issue:`16037`).

.. ipython:: python

    df = pd.DataFrame({'A': ['a', 'b', 'a', 'b'], 'B': range(4)})
    df
    df.groupby('A').ewm(com=1.0).mean()

Additionally mean supports execution via Numba with the engine and engine_kwargs arguments. Numba must be installed as an optional dependency to use this feature.

Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

Consistency of DataFrame Reductions

:meth:`DataFrame.any` and :meth:`DataFrame.all` with bool_only=True now determines whether to exclude object-dtype columns on a column-by-column basis, instead of checking if all object-dtype columns can be considered boolean.

This prevents pathological behavior where applying the reduction on a subset of columns could result in a larger Series result. See (:issue:`37799`).

.. ipython:: python

    df = pd.DataFrame({"A": ["foo", "bar"], "B": [True, False]}, dtype=object)
    df["C"] = pd.Series([True, True])


Previous behavior:

In [5]: df.all(bool_only=True)
Out[5]:
C    True
dtype: bool

In [6]: df[["B", "C"]].all(bool_only=True)
Out[6]:
B    False
C    True
dtype: bool

New behavior:

.. ipython:: python
   :okwarning:

    In [5]: df.all(bool_only=True)

    In [6]: df[["B", "C"]].all(bool_only=True)


Other DataFrame reductions with numeric_only=None will also avoid this pathological behavior (:issue:`37827`):

.. ipython:: python

    df = pd.DataFrame({"A": [0, 1, 2], "B": ["a", "b", "c"]}, dtype=object)


Previous behavior:

In [3]: df.mean()
Out[3]: Series([], dtype: float64)

In [4]: df[["A"]].mean()
Out[4]:
A    1.0
dtype: float64

New behavior:

In [3]: df.mean()
Out[3]:
A    1.0
dtype: float64

In [4]: df[["A"]].mean()
Out[4]:
A    1.0
dtype: float64

Moreover, DataFrame reductions with numeric_only=None will now be consistent with their Series counterparts. In particular, for reductions where the Series method raises TypeError, the DataFrame reduction will now consider that column non-numeric instead of casting to a NumPy array which may have different semantics (:issue:`36076`, :issue:`28949`, :issue:`21020`).

.. ipython:: python
   :okwarning:

    ser = pd.Series([0, 1], dtype="category", name="A")
    df = ser.to_frame()


Previous behavior:

In [5]: df.any()
Out[5]:
A    True
dtype: bool

New behavior:

In [5]: df.any()
Out[5]: Series([], dtype: bool)

Increased minimum version for Python

pandas 1.2.0 supports Python 3.7.1 and higher (:issue:`35214`).

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated (:issue:`35214`). If installed, we now require:

Package Minimum Version Required Changed
numpy 1.16.5 X X
pytz 2017.3 X X
python-dateutil 2.7.3 X  
bottleneck 1.2.1    
numexpr 2.6.8   X
pytest (dev) 5.0.1   X
mypy (dev) 0.782   X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
beautifulsoup4 4.6.0  
fastparquet 0.3.2  
fsspec 0.7.4  
gcsfs 0.6.0  
lxml 4.3.0 X
matplotlib 2.2.3 X
numba 0.46.0  
openpyxl 2.6.0 X
pyarrow 0.15.0 X
pymysql 0.7.11 X
pytables 3.5.1 X
s3fs 0.4.0  
scipy 1.2.0  
sqlalchemy 1.2.8 X
xarray 0.12.3 X
xlrd 1.2.0 X
xlsxwriter 1.0.2 X
xlwt 1.3.0 X
pandas-gbq 0.12.0  

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Other API changes

Deprecations

Calling NumPy ufuncs on non-aligned DataFrames

Calling NumPy ufuncs on non-aligned DataFrames changed behaviour in pandas 1.2.0 (to align the inputs before calling the ufunc), but this change is reverted in pandas 1.2.1. The behaviour to not align is now deprecated instead, see the :ref:`the 1.2.1 release notes <whatsnew_121.ufunc_deprecation>` for more details.

Performance improvements

Bug fixes

Categorical

Datetime-like

Timedelta

Timezones

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Period

Plotting

Styler

Groupby/resample/rolling

Reshaping

ExtensionArray

Other

Contributors

.. contributors:: v1.1.5..v1.2.0