merge

harisbal · Feb 6, 2018 · 2fb23d6 · 2fb23d6
2 parents af37225 + 93c86aa
commit 2fb23d6
Show file tree

Hide file tree

Showing 59 changed files with 1,521 additions and 932 deletions.
diff --git a/asv_bench/benchmarks/index_object.py b/asv_bench/benchmarks/index_object.py
@@ -147,6 +147,11 @@ def setup(self, dtype):
         self.idx = getattr(tm, 'make{}Index'.format(dtype))(N)
         self.array_mask = (np.arange(N) % 3) == 0
         self.series_mask = Series(self.array_mask)
+        self.sorted = self.idx.sort_values()
+        half = N // 2
+        self.non_unique = self.idx[:half].append(self.idx[:half])
+        self.non_unique_sorted = self.sorted[:half].append(self.sorted[:half])
+        self.key = self.sorted[N // 4]
 
     def time_boolean_array(self, dtype):
         self.idx[self.array_mask]
@@ -163,6 +168,18 @@ def time_slice(self, dtype):
     def time_slice_step(self, dtype):
         self.idx[::2]
 
+    def time_get_loc(self, dtype):
+        self.idx.get_loc(self.key)
+
+    def time_get_loc_sorted(self, dtype):
+        self.sorted.get_loc(self.key)
+
+    def time_get_loc_non_unique(self, dtype):
+        self.non_unique.get_loc(self.key)
+
+    def time_get_loc_non_unique_sorted(self, dtype):
+        self.non_unique_sorted.get_loc(self.key)
+
 
 class Float64IndexMethod(object):
     # GH 13166

diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst
@@ -672,7 +672,7 @@ The ``CategoricalIndex`` is **preserved** after indexing:
    df2.loc['a'].index
 
 Sorting the index will sort by the order of the categories (Recall that we 
-created the index with with ``CategoricalDtype(list('cab'))``, so the sorted 
+created the index with ``CategoricalDtype(list('cab'))``, so the sorted 
 order is ``cab``.). 
 
 .. ipython:: python

diff --git a/doc/source/comparison_with_sas.rst b/doc/source/comparison_with_sas.rst
@@ -279,7 +279,7 @@ date/datetime columns.
 
 The equivalent pandas operations are shown below.  In addition to these
 functions pandas supports other Time Series features
-not available in Base SAS (such as resampling and and custom offsets) -
+not available in Base SAS (such as resampling and custom offsets) -
 see the :ref:`timeseries documentation<timeseries>` for more details.
 
 .. ipython:: python
@@ -584,7 +584,7 @@ For example, in SAS you could do this to filter missing values.
        if value_x ^= .;
    run;
 
-Which doesn't work in in pandas.  Instead, the ``pd.isna`` or ``pd.notna`` functions
+Which doesn't work in pandas.  Instead, the ``pd.isna`` or ``pd.notna`` functions
 should be used for comparisons.
 
 .. ipython:: python

diff --git a/doc/source/computation.rst b/doc/source/computation.rst
@@ -512,7 +512,7 @@ a same sized result as the input.
 
 When using ``.resample()`` with an offset. Construct a new index that is the frequency of the offset. For each frequency
 bin, aggregate points from the input within a backwards-in-time looking window that fall in that bin. The result of this
-aggregation is the output for that frequency point. The windows are fixed size size in the frequency space. Your result
+aggregation is the output for that frequency point. The windows are fixed size in the frequency space. Your result
 will have the shape of a regular frequency between the min and the max of the original input object.
 
 To summarize, ``.rolling()`` is a time-based window operation, while ``.resample()`` is a frequency-based window operation.

diff --git a/doc/source/groupby.rst b/doc/source/groupby.rst
@@ -1219,8 +1219,8 @@ see :ref:`here <basics.pipe>`.
 Combining ``.groupby`` and ``.pipe`` is often useful when you need to reuse
 GroupBy objects.
 
-For an example, imagine having a DataFrame with columns for stores, products,
-revenue and sold quantity. We'd like to do a groupwise calculation of *prices*
+As an example, imagine having a DataFrame with columns for stores, products,
+revenue and quantity sold. We'd like to do a groupwise calculation of *prices*
 (i.e. revenue/quantity) per store and per product. We could do this in a
 multi-step operation, but expressing it in terms of piping can make the
 code more readable. First we set the data:
@@ -1230,7 +1230,8 @@ code more readable. First we set the data:
    import numpy as np
    n = 1000
    df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n),
-                      'Product': np.random.choice(['Product_1', 'Product_2', 'Product_3'], n),
+                      'Product': np.random.choice(['Product_1',
+                                                   'Product_2'], n),
                       'Revenue': (np.random.random(n)*50+10).round(2),
                       'Quantity': np.random.randint(1, 10, size=n)})
    df.head(2)

diff --git a/doc/source/io.rst b/doc/source/io.rst
@@ -4529,7 +4529,7 @@ Several caveats.
   on an attempt at serialization.
 
 You can specify an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``.
-If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``, then
+If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``, 
 then ``pyarrow`` is tried, and falling back to ``fastparquet``.
 
 See the documentation for `pyarrow <http://arrow.apache.org/docs/python/>`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__

diff --git a/doc/source/release.rst b/doc/source/release.rst
@@ -406,7 +406,7 @@ of all enhancements and bugs that have been fixed in 0.20.1.
 
 .. note::
 
-   This is a combined release for 0.20.0 and and 0.20.1.
+   This is a combined release for 0.20.0 and 0.20.1.
    Version 0.20.1 contains one additional change for backwards-compatibility with downstream projects using pandas' ``utils`` routines. (:issue:`16250`)
 
 Thanks
@@ -2918,7 +2918,7 @@ Improvements to existing features
 - clipboard functions use pyperclip (no dependencies on Windows, alternative
   dependencies offered for Linux) (:issue:`3837`).
 - Plotting functions now raise a ``TypeError`` before trying to plot anything
-  if the associated objects have have a dtype of ``object`` (:issue:`1818`,
+  if the associated objects have a dtype of ``object`` (:issue:`1818`,
   :issue:`3572`, :issue:`3911`, :issue:`3912`), but they will try to convert object
   arrays to numeric arrays if possible so that you can still plot, for example, an
   object array with floats. This happens before any drawing takes place which
@@ -4082,7 +4082,7 @@ Bug Fixes
   columns (:issue:`1943`)
 - Fix time zone localization bug causing improper fields (e.g. hours) in time
   zones that have not had a UTC transition in a long time (:issue:`1946`)
-- Fix errors when parsing and working with with fixed offset timezones
+- Fix errors when parsing and working with fixed offset timezones
   (:issue:`1922`, :issue:`1928`)
 - Fix text parser bug when handling UTC datetime objects generated by
   dateutil (:issue:`1693`)
@@ -4383,7 +4383,7 @@ Bug Fixes
   error (:issue:`1090`)
 - Consistently set name on groupby pieces (:issue:`184`)
 - Treat dict return values as Series in GroupBy.apply (:issue:`823`)
-- Respect column selection for DataFrame in in GroupBy.transform (:issue:`1365`)
+- Respect column selection for DataFrame in GroupBy.transform (:issue:`1365`)
 - Fix MultiIndex partial indexing bug (:issue:`1352`)
 - Enable assignment of rows in mixed-type DataFrame via .ix (:issue:`1432`)
 - Reset index mapping when grouping Series in Cython (:issue:`1423`)
@@ -5040,7 +5040,7 @@ New Features
 - Add `melt` function to `pandas.core.reshape`
 - Add `level` parameter to group by level in Series and DataFrame
   descriptive statistics (:issue:`313`)
-- Add `head` and `tail` methods to Series, analogous to to DataFrame (PR
+- Add `head` and `tail` methods to Series, analogous to DataFrame (PR
   :issue:`296`)
 - Add `Series.isin` function which checks if each value is contained in a
   passed sequence (:issue:`289`)

diff --git a/doc/source/text.rst b/doc/source/text.rst
@@ -218,7 +218,8 @@ Extract first match in each subject (extract)
    ``DataFrame``, depending on the subject and regular expression
    pattern (same behavior as pre-0.18.0). When ``expand=True`` it
    always returns a ``DataFrame``, which is more consistent and less
-   confusing from the perspective of a user.
+   confusing from the perspective of a user. ``expand=True`` is the
+   default since version 0.23.0.
 
 The ``extract`` method accepts a `regular expression
 <https://docs.python.org/3/library/re.html>`__ with at least one

diff --git a/doc/source/tutorials.rst b/doc/source/tutorials.rst
@@ -19,7 +19,7 @@ pandas Cookbook
 The goal of this cookbook (by `Julia Evans <http://jvns.ca>`_) is to
 give you some concrete examples for getting started with pandas. These
 are examples with real-world data, and all the bugs and weirdness that
-that entails.
+entails.
 
 Here are links to the v0.1 release. For an up-to-date table of contents, see the `pandas-cookbook GitHub
 repository <http://github.com/jvns/pandas-cookbook>`_. To run the examples in this tutorial, you'll need to

diff --git a/doc/source/whatsnew/v0.23.0.txt b/doc/source/whatsnew/v0.23.0.txt
@@ -204,6 +204,50 @@ Please note that the string `index` is not supported with the round trip format,
    new_df
    print(new_df.index.name)
 
+.. _whatsnew_0230.enhancements.index_division_by_zero:
+
+Index Division By Zero Fills Correctly
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Division operations on ``Index`` and subclasses will now fill division of positive numbers by zero with ``np.inf``, division of negative numbers by zero with ``-np.inf`` and `0 / 0` with ``np.nan``.  This matches existing ``Series`` behavior. (:issue:`19322`, :issue:`19347`)
+
+Previous Behavior:
+
+.. code-block:: ipython
+
+    In [6]: index = pd.Int64Index([-1, 0, 1])
+
+    In [7]: index / 0
+    Out[7]: Int64Index([0, 0, 0], dtype='int64')
+
+    # Previous behavior yielded different results depending on the type of zero in the divisor
+    In [8]: index / 0.0
+    Out[8]: Float64Index([-inf, nan, inf], dtype='float64')
+
+    In [9]: index = pd.UInt64Index([0, 1])
+
+    In [10]: index / np.array([0, 0], dtype=np.uint64)
+    Out[10]: UInt64Index([0, 0], dtype='uint64')
+
+    In [11]: pd.RangeIndex(1, 5) / 0
+    ZeroDivisionError: integer division or modulo by zero
+
+Current Behavior:
+
+.. ipython:: python
+
+    index = pd.Int64Index([-1, 0, 1])
+    # division by zero gives -infinity where negative, +infinity where positive, and NaN for 0 / 0
+    index / 0
+
+    # The result of division by zero should not depend on whether the zero is int or float
+    index / 0.0
+
+    index = pd.UInt64Index([0, 1])
+    index / np.array([0, 0], dtype=np.uint64)
+
+    pd.RangeIndex(1, 5) / 0
+
 .. _whatsnew_0230.enhancements.other:
 
 Other Enhancements
@@ -289,13 +333,64 @@ Convert to an xarray DataArray
    p.to_xarray()
 
 
+.. _whatsnew_0230.api_breaking.build_changes:
+
 Build Changes
 ^^^^^^^^^^^^^
 
 - Building pandas for development now requires ``cython >= 0.24`` (:issue:`18613`)
 - Building from source now explicitly requires ``setuptools`` in ``setup.py`` (:issue:`18113`)
 - Updated conda recipe to be in compliance with conda-build 3.0+ (:issue:`18002`)
 
+.. _whatsnew_0230.api_breaking.extract:
+
+Extraction of matching patterns from strings
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+By default, extracting matching patterns from strings with :func:`str.extract` used to return a
+``Series`` if a single group was being extracted (a ``DataFrame`` if more than one group was
+extracted``). As of Pandas 0.23.0 :func:`str.extract` always returns a ``DataFrame``, unless
+``expand`` is set to ``False`` (:issue:`11386`).
+
+Also, ``None`` was an accepted value for the ``expand`` parameter (which was equivalent to
+``False``), but now raises a ``ValueError``.
+
+Previous Behavior:
+
+.. code-block:: ipython
+
+    In [1]: s = pd.Series(['number 10', '12 eggs'])
+
+    In [2]: extracted = s.str.extract('.*(\d\d).*')
+
+    In [3]: extracted
+    Out [3]:
+    0    10
+    1    12
+    dtype: object
+
+    In [4]: type(extracted)
+    Out [4]:
+    pandas.core.series.Series
+
+New Behavior:
+
+.. ipython:: python
+
+    s = pd.Series(['number 10', '12 eggs'])
+    extracted = s.str.extract('.*(\d\d).*')
+    extracted
+    type(extracted)
+
+To restore previous behavior, simply set ``expand`` to ``False``:
+
+.. ipython:: python
+
+    s = pd.Series(['number 10', '12 eggs'])
+    extracted = s.str.extract('.*(\d\d).*', expand=False)
+    extracted
+    type(extracted)
+
 .. _whatsnew_0230.api:
 
 Other API Changes
@@ -455,6 +550,7 @@ Datetimelike
 - Bug in :func:`Series.truncate` which raises ``TypeError`` with a monotonic ``PeriodIndex`` (:issue:`17717`)
 - Bug in :func:`~DataFrame.pct_change` using ``periods`` and ``freq`` returned different length outputs (:issue:`7292`)
 - Bug in comparison of :class:`DatetimeIndex` against ``None`` or ``datetime.date`` objects raising ``TypeError`` for ``==`` and ``!=`` comparisons instead of all-``False`` and all-``True``, respectively (:issue:`19301`)
+- Bug in :class:`Timestamp` and :func:`to_datetime` where a string representing a barely out-of-bounds timestamp would be incorrectly rounded down instead of raising ``OutOfBoundsDatetime`` (:issue:`19382`)
 -
 
 Timezones
@@ -531,6 +627,7 @@ I/O
 - Bug in :func:`DataFrame.to_parquet` where an exception was raised if the write destination is S3 (:issue:`19134`)
 - :class:`Interval` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`)
 - :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for xls file type (:issue:`19242`, :issue:`9155`)
+- Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)
 
 Plotting
 ^^^^^^^^
@@ -547,15 +644,16 @@ Groupby/Resample/Rolling
 - Fixed regression in :func:`DataFrame.groupby` which would not emit an error when called with a tuple key not in the index (:issue:`18798`)
 - Bug in :func:`DataFrame.resample` which silently ignored unsupported (or mistyped) options for ``label``, ``closed`` and ``convention`` (:issue:`19303`)
 - Bug in :func:`DataFrame.groupby` where tuples were interpreted as lists of keys rather than as keys (:issue:`17979`, :issue:`18249`)
-- Bug in ``transform`` where particular aggregation functions were being incorrectly cast to match the dtype(s) of the grouped data (:issue:`19200`)
 - Bug in :func:`DataFrame.groupby` where aggregation by ``first``/``last``/``min``/``max`` was causing timestamps to lose precision (:issue:`19526`)
+- Bug in :func:`DataFrame.transform` where particular aggregation functions were being incorrectly cast to match the dtype(s) of the grouped data (:issue:`19200`)
+- Bug in :func:`DataFrame.groupby` passing the `on=` kwarg, and subsequently using ``.apply()`` (:issue:`17813`)
 
 Sparse
 ^^^^^^
 
 - Bug in which creating a ``SparseDataFrame`` from a dense ``Series`` or an unsupported type raised an uncontrolled exception (:issue:`19374`)
 - Bug in :class:`SparseDataFrame.to_csv` causing exception (:issue:`19384`)
--
+- Bug in :class:`SparseSeries.memory_usage` which caused segfault by accessing non sparse elements (:issue:`19368`)
 
 Reshaping
 ^^^^^^^^^
@@ -571,6 +669,7 @@ Reshaping
 - Bug in :func:`DataFrame.stack`, :func:`DataFrame.unstack`, :func:`Series.unstack` which were not returning subclasses (:issue:`15563`)
 - Bug in timezone comparisons, manifesting as a conversion of the index to UTC in ``.concat()`` (:issue:`18523`)
 - Bug in :func:`concat` when concatting sparse and dense series it returns only a ``SparseDataFrame``. Should be a ``DataFrame``. (:issue:`18914`, :issue:`18686`, and :issue:`16874`)
+- Improved error message for :func:`DataFrame.merge` when there is no common merge key (:issue:`19427`)
 -