Merge remote branch 'jreback/dtypes'

* jreback/dtypes: ENH: allow propgation and coexistance of numeric dtypes (closes GH #622) construction of multi numeric dtypes with other types in a dict validated get_numeric_data returns correct dtypes added blocks attribute (and as_blocks()) method that returns a dict of dtype -> homogeneous Frame to DataFrame added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns fixed merging to correctly merge on multiple dtypes with blocks (e.g. float64 and float32 in other merger) changed implementation of get_dtype_counts() to use .blocks revised DataFrame.convert_objects to use blocks to be more efficient added Dtype printing to show on default with a Series added convert_dates='coerce' option to convert_objects, to force conversions to datetime64[ns] where can upcast integer to float as needed (on inplace ops #2793) added fully cythonized support for int8/int16 no support for float16 (it can exist, but no cython methods for it)
pandas-dev · Feb 10, 2013 · 8ad9598 · 8ad9598
2 parents eb505fd + 166a80d
commit 8ad9598
Show file tree

Hide file tree

Showing 37 changed files with 9,634 additions and 3,178 deletions.
diff --git a/RELEASE.rst b/RELEASE.rst
@@ -22,6 +22,42 @@ Where to get it
 * Binary installers on PyPI: http://pypi.python.org/pypi/pandas
 * Documentation: http://pandas.pydata.org
 
+pandas 0.10.2
+=============
+
+**Release date:** 2013-??-??
+
+**New features**
+
+  - Allow mixed dtypes (e.g ``float32/float64/int32/int16/int8``) to coexist in DataFrames and propogate in operations
+
+**Improvements to existing features**
+
+  - added ``blocks`` attribute to DataFrames, to return a dict of dtypes to homogeneously dtyped DataFrames
+  - added keyword ``convert_numeric`` to ``convert_objects()`` to try to convert object dtypes to numeric types
+  - ``convert_dates`` in ``convert_objects`` can now be ``coerce`` which will return a datetime64[ns] dtype
+    with non-convertibles set as ``NaT``; will preserve an all-nan object (e.g. strings)
+  - Series print output now includes the dtype by default
+
+**API Changes**
+
+  - Do not automatically upcast numeric specified dtypes to ``int64`` or ``float64`` (GH622_ and GH797_)
+  - Guarantee that ``convert_objects()`` for Series/DataFrame always returns a copy
+  - groupby operations will respect dtypes for numeric float operations (float32/float64); other types will be operated on,
+    and will try to cast back to the input dtype (e.g. if an int is passed, as long as the output doesn't have nans, 
+    then an int will be returned)
+  - backfill/pad/take/diff/ohlc will now support ``float32/int16/int8`` operations
+  - Integer block types will upcast as needed in where operations (GH2793_)
+
+**Bug Fixes**
+
+  - Fix seg fault on empty data frame when fillna with ``pad`` or ``backfill`` (GH2778_)
+
+.. _GH622: https://github.com/pydata/pandas/issues/622
+.. _GH797: https://github.com/pydata/pandas/issues/797
+.. _GH2778: https://github.com/pydata/pandas/issues/2778
+.. _GH2793: https://github.com/pydata/pandas/issues/2793
+
 pandas 0.10.1
 =============
 
@@ -36,6 +72,7 @@ pandas 0.10.1
   - Restored inplace=True behavior returning self (same object) with
     deprecation warning until 0.11 (GH1893_)
   - ``HDFStore``
+
     - refactored HFDStore to deal with non-table stores as objects, will allow future enhancements
     - removed keyword ``compression`` from ``put`` (replaced by keyword
       ``complib`` to be consistent across library)
@@ -49,7 +86,7 @@ pandas 0.10.1
     - support data column indexing and selection, via ``data_columns`` keyword in append
     - support write chunking to reduce memory footprint, via ``chunksize``
       keyword to append
-    - support automagic indexing via ``index`` keywork to append
+    - support automagic indexing via ``index`` keyword to append
     - support ``expectedrows`` keyword in append to inform ``PyTables`` about
       the expected tablesize
     - support ``start`` and ``stop`` keywords in select to limit the row

diff --git a/doc/source/dsintro.rst b/doc/source/dsintro.rst
@@ -450,15 +450,101 @@ DataFrame:
    df.xs('b')
    df.ix[2]
 
-Note if a DataFrame contains columns of multiple dtypes, the dtype of the row
-will be chosen to accommodate all of the data types (dtype=object is the most
-general).
-
 For a more exhaustive treatment of more sophisticated label-based indexing and
 slicing, see the :ref:`section on indexing <indexing>`. We will address the
 fundamentals of reindexing / conforming to new sets of lables in the
 :ref:`section on reindexing <basics.reindexing>`.
 
+DataTypes
+~~~~~~~~~
+
+.. _dsintro.column_types:
+
+The main types stored in pandas objects are float, int, boolean, datetime64[ns],
+and object. A convenient ``dtypes`` attribute return a Series with the data type of
+each column.
+
+.. ipython:: python
+
+   df['integer'] = 1
+   df['int32']   = df['integer'].astype('int32')
+   df['float32'] = Series([1.0]*len(df),dtype='float32')
+   df['timestamp'] = Timestamp('20010102')
+   df.dtypes
+
+If a DataFrame contains columns of multiple dtypes, the dtype of the column
+will be chosen to accommodate all of the data types (dtype=object is the most
+general).
+
+The related method ``get_dtype_counts`` will return the number of columns of
+each type:
+
+.. ipython:: python
+
+   df.get_dtype_counts()
+
+Numeric dtypes will propgate and can coexist in DataFrames (starting in v0.10.2). 
+If a dtype is passed (either directly via the ``dtype`` keyword, a passed ``ndarray``, 
+or a passed ``Series``, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will **NOT** be combined. The following example will give you a taste.
+
+.. ipython:: python
+
+   df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')
+   df1
+   df1.dtypes
+   df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'), 
+                         B = Series(randn(8)), 
+                         C = Series(np.array(randn(8),dtype='uint8')) ))
+   df2
+   df2.dtypes
+
+   # here you get some upcasting
+   df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
+   df3
+   df3.dtypes
+
+   # this is lower-common-denomicator upcasting (meaning you get the dtype which can accomodate all of the types)
+   df3.values.dtype
+
+Upcasting is always according to the **numpy** rules. If two different dtypes are involved in an operation, then the more *general* one will be used as the result of the operation.
+
+DataType Conversion
+~~~~~~~~~~~~~~~~~~~
+
+You can use the ``astype`` method to convert dtypes from one to another. These *always* return a copy. 
+In addition, ``convert_objects`` will attempt to *soft* conversion of any *object* dtypes, meaning that if all the objects in a Series are of the same type, the Series
+will have that dtype.
+
+.. ipython:: python
+
+   df3
+   df3.dtypes
+
+   # conversion of dtypes
+   df3.astype('float32').dtypes
+
+To force conversion of specific types of number conversion, pass ``convert_numeric = True``. 
+This will force strings and numbers alike to be numbers if possible, otherwise the will be set to ``np.nan``.
+To force conversion to ``datetime64[ns]``, pass ``convert_dates = 'coerce'``. 
+This will convert any datetimelike object to dates, forcing other values to ``NaT``.
+
+.. ipython:: python
+
+   # mixed type conversions
+   df3['D'] = '1.'
+   df3['E'] = '1'
+   df3.convert_objects(convert_numeric=True).dtypes
+
+   # same, but specific dtype conversion
+   df3['D'] = df3['D'].astype('float16')
+   df3['E'] = df3['E'].astype('int32')
+   df3.dtypes
+
+   # forcing date coercion
+   s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1, Timestamp('20010104'), '20010105'],dtype='O')
+   s
+   s.convert_objects(convert_dates='coerce')
+
 Data alignment and arithmetic
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -633,26 +719,6 @@ You can also disable this feature via the ``expand_frame_repr`` option:
    reset_option('expand_frame_repr')
 
 
-DataFrame column types
-~~~~~~~~~~~~~~~~~~~~~~
-
-.. _dsintro.column_types:
-
-The four main types stored in pandas objects are float, int, boolean, and
-object. A convenient ``dtypes`` attribute return a Series with the data type of
-each column:
-
-.. ipython:: python
-
-   baseball.dtypes
-
-The related method ``get_dtype_counts`` will return the number of columns of
-each type:
-
-.. ipython:: python
-
-   baseball.get_dtype_counts()
-
 DataFrame column attribute access and IPython completion
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 

diff --git a/doc/source/indexing.rst b/doc/source/indexing.rst
@@ -304,6 +304,34 @@ so that the original data can be modified without creating a copy:
 
    df.mask(df >= 0)
 
+Upcasting Gotchas
+~~~~~~~~~~~~~~~~~
+
+Performing indexing operations on ``integer`` type data can easily upcast the data to ``floating``.
+The dtype of the input data will be preserved in cases where ``nans`` are not introduced (coming soon).
+
+.. ipython:: python
+
+   dfi = df.astype('int32')
+   dfi['E'] = 1
+   dfi
+   dfi.dtypes
+
+   casted = dfi[dfi>0]
+   casted
+   casted.dtypes
+
+While float dtypes are unchanged.
+
+.. ipython:: python
+
+   df2 = df.copy()
+   df2['A'] = df2['A'].astype('float32')
+   df2.dtypes
+
+   casted = df2[df2>0]
+   casted
+   casted.dtypes
 
 Take Methods
 ~~~~~~~~~~~~

diff --git a/doc/source/v0.10.2.txt b/doc/source/v0.10.2.txt
@@ -0,0 +1,95 @@
+.. _whatsnew_0102:
+
+v0.10.2 (February ??, 2013)
+---------------------------
+
+This is a minor release from 0.10.1 and includes many new features and
+enhancements along with a large number of bug fixes. There are also a number of
+important API changes that long-time pandas users should pay close attention
+to.
+
+API changes
+~~~~~~~~~~~
+
+Numeric dtypes will propgate and can coexist in DataFrames. If a dtype is passed (either directly via the ``dtype`` keyword, a passed ``ndarray``, or a passed ``Series``, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will **NOT** be combined. The following example will give you a taste.
+
+**Dtype Specification**
+
+.. ipython:: python
+
+   df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')
+   df1
+   df1.dtypes
+   df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'), B = Series(randn(8)), C = Series(randn(8),dtype='uint8') ))
+   df2
+   df2.dtypes
+
+   # here you get some upcasting
+   df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
+   df3
+   df3.dtypes
+
+**Dtype conversion**
+
+.. ipython:: python
+
+   # this is lower-common-denomicator upcasting (meaning you get the dtype which can accomodate all of the types)
+   df3.values.dtype
+
+   # conversion of dtypes
+   df3.astype('float32').dtypes
+
+   # mixed type conversions
+   df3['D'] = '1.'
+   df3['E'] = '1'
+   df3.convert_objects(convert_numeric=True).dtypes
+
+   # same, but specific dtype conversion
+   df3['D'] = df3['D'].astype('float16')
+   df3['E'] = df3['E'].astype('int32')
+   df3.dtypes
+
+   # forcing date coercion
+   s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1, 
+               Timestamp('20010104'), '20010105'],dtype='O')
+   s.convert_objects(convert_dates='coerce')
+
+**Upcasting Gotchas**
+
+Performing indexing operations on integer type data can easily upcast the data.
+The dtype of the input data will be preserved in cases where ``nans`` are not introduced (coming soon).
+
+.. ipython:: python
+
+   dfi = df3.astype('int32')
+   dfi['D'] = dfi['D'].astype('int64')
+   dfi
+   dfi.dtypes
+
+   casted = dfi[dfi>0]
+   casted
+   casted.dtypes
+
+While float dtypes are unchanged.
+
+.. ipython:: python
+
+   df4 = df3.copy()
+   df4['A'] = df4['A'].astype('float32')
+   df4.dtypes
+
+   casted = df4[df4>0]
+   casted
+   casted.dtypes
+
+New features
+~~~~~~~~~~~~
+
+**Enhancements**
+
+**Bug Fixes**
+
+See the `full release notes
+<https://github.com/pydata/pandas/blob/master/RELEASE.rst>`__ or issue tracker
+on GitHub for a complete list.
+
diff --git a/doc/source/whatsnew.rst b/doc/source/whatsnew.rst
@@ -16,6 +16,8 @@ What's New
 
 These are new features and improvements of note in each release.
 
+.. include:: v0.10.2.txt
+
 .. include:: v0.10.1.txt
 
 .. include:: v0.10.0.txt