Skip to content

Commit

Permalink
ENH: allow propgation and coexistance of numeric dtypes (closes GH pa…
Browse files Browse the repository at this point in the history
…ndas-dev#622)

     construction of multi numeric dtypes with other types in a dict
     validated get_numeric_data returns correct dtypes
     added blocks attribute (and as_blocks()) method that returns a dict of dtype -> homogeneous Frame to DataFrame
     added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns
     fixed merging to correctly merge on multiple dtypes with blocks (e.g. float64 and float32 in other merger)
     changed implementation of get_dtype_counts() to use .blocks
     revised DataFrame.convert_objects to use blocks to be more efficient
     added Dtype printing to show on default with a Series
     added convert_dates='coerce' option to convert_objects, to force conversions to datetime64[ns]
     where can upcast integer to float as needed (on inplace ops pandas-dev#2793)
     added fully cythonized support for int8/int16
     no support for float16 (it can exist, but no cython methods for it)

TST: fixed test in test_from_records_sequencelike (dict orders can be different on different arch!)
       NOTE: using tuples will remove dtype info from the input stream (using a record array is ok though!)
     test updates for merging (multi-dtypes)
     added tests for replace (but skipped for now, algos not set for float32/16)
     tests for astype and convert in internals
     fixes for test_excel on 32-bit
     fixed test_resample_median_bug_1688 I belive
     separated out test_from_records_dictlike
     testing of panel constructors (GH pandas-dev#797)
     where ops now have a full test suite
     allow slightly less sensitive decimal tests for less precise dtypes

BUG: fixed GH pandas-dev#2778, fillna on empty frame causes seg fault
     fixed bug in groupby where types were not being casted to original dtype
     respect the dtype of non-natural numeric (Decimal)
     don't upcast ints/bools to floats (if you say were agging on len, you can get an int)
DOC: added astype conversion examples to whatsnew and docs (dsintro)
     updated RELEASE notes
     whatsnew for 0.10.2
     added upcasting gotchas docs

CLN: updated convert_objects to be more consistent across frame/series
     moved most groupby functions out of algos.pyx to generated.pyx
     fully support cython functions for pad/bfill/take/diff/groupby for float32
     moved more block-like conversion loops from frame.py to internals.py (created apply method)
       (e.g. diff,fillna,where,shift,replace,interpolate,combining), to top-level methods in BlockManager
  • Loading branch information
jreback committed Feb 8, 2013
1 parent 3ba3119 commit 166a80d
Show file tree
Hide file tree
Showing 37 changed files with 9,634 additions and 3,178 deletions.
39 changes: 38 additions & 1 deletion RELEASE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,42 @@ Where to get it
* Binary installers on PyPI: http://pypi.python.org/pypi/pandas
* Documentation: http://pandas.pydata.org

pandas 0.10.2
=============

**Release date:** 2013-??-??

**New features**

- Allow mixed dtypes (e.g ``float32/float64/int32/int16/int8``) to coexist in DataFrames and propogate in operations

**Improvements to existing features**

- added ``blocks`` attribute to DataFrames, to return a dict of dtypes to homogeneously dtyped DataFrames
- added keyword ``convert_numeric`` to ``convert_objects()`` to try to convert object dtypes to numeric types
- ``convert_dates`` in ``convert_objects`` can now be ``coerce`` which will return a datetime64[ns] dtype
with non-convertibles set as ``NaT``; will preserve an all-nan object (e.g. strings)
- Series print output now includes the dtype by default

**API Changes**

- Do not automatically upcast numeric specified dtypes to ``int64`` or ``float64`` (GH622_ and GH797_)
- Guarantee that ``convert_objects()`` for Series/DataFrame always returns a copy
- groupby operations will respect dtypes for numeric float operations (float32/float64); other types will be operated on,
and will try to cast back to the input dtype (e.g. if an int is passed, as long as the output doesn't have nans,
then an int will be returned)
- backfill/pad/take/diff/ohlc will now support ``float32/int16/int8`` operations
- Integer block types will upcast as needed in where operations (GH2793_)

**Bug Fixes**

- Fix seg fault on empty data frame when fillna with ``pad`` or ``backfill`` (GH2778_)

.. _GH622: https://github.com/pydata/pandas/issues/622
.. _GH797: https://github.com/pydata/pandas/issues/797
.. _GH2778: https://github.com/pydata/pandas/issues/2778
.. _GH2793: https://github.com/pydata/pandas/issues/2793

pandas 0.10.1
=============

Expand All @@ -36,6 +72,7 @@ pandas 0.10.1
- Restored inplace=True behavior returning self (same object) with
deprecation warning until 0.11 (GH1893_)
- ``HDFStore``

- refactored HFDStore to deal with non-table stores as objects, will allow future enhancements
- removed keyword ``compression`` from ``put`` (replaced by keyword
``complib`` to be consistent across library)
Expand All @@ -49,7 +86,7 @@ pandas 0.10.1
- support data column indexing and selection, via ``data_columns`` keyword in append
- support write chunking to reduce memory footprint, via ``chunksize``
keyword to append
- support automagic indexing via ``index`` keywork to append
- support automagic indexing via ``index`` keyword to append
- support ``expectedrows`` keyword in append to inform ``PyTables`` about
the expected tablesize
- support ``start`` and ``stop`` keywords in select to limit the row
Expand Down
114 changes: 90 additions & 24 deletions doc/source/dsintro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -450,15 +450,101 @@ DataFrame:
df.xs('b')
df.ix[2]
Note if a DataFrame contains columns of multiple dtypes, the dtype of the row
will be chosen to accommodate all of the data types (dtype=object is the most
general).

For a more exhaustive treatment of more sophisticated label-based indexing and
slicing, see the :ref:`section on indexing <indexing>`. We will address the
fundamentals of reindexing / conforming to new sets of lables in the
:ref:`section on reindexing <basics.reindexing>`.

DataTypes
~~~~~~~~~

.. _dsintro.column_types:

The main types stored in pandas objects are float, int, boolean, datetime64[ns],
and object. A convenient ``dtypes`` attribute return a Series with the data type of
each column.

.. ipython:: python
df['integer'] = 1
df['int32'] = df['integer'].astype('int32')
df['float32'] = Series([1.0]*len(df),dtype='float32')
df['timestamp'] = Timestamp('20010102')
df.dtypes
If a DataFrame contains columns of multiple dtypes, the dtype of the column
will be chosen to accommodate all of the data types (dtype=object is the most
general).

The related method ``get_dtype_counts`` will return the number of columns of
each type:

.. ipython:: python
df.get_dtype_counts()
Numeric dtypes will propgate and can coexist in DataFrames (starting in v0.10.2).
If a dtype is passed (either directly via the ``dtype`` keyword, a passed ``ndarray``,
or a passed ``Series``, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will **NOT** be combined. The following example will give you a taste.

.. ipython:: python
df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')
df1
df1.dtypes
df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'),
B = Series(randn(8)),
C = Series(np.array(randn(8),dtype='uint8')) ))
df2
df2.dtypes
# here you get some upcasting
df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
df3
df3.dtypes
# this is lower-common-denomicator upcasting (meaning you get the dtype which can accomodate all of the types)
df3.values.dtype
Upcasting is always according to the **numpy** rules. If two different dtypes are involved in an operation, then the more *general* one will be used as the result of the operation.

DataType Conversion
~~~~~~~~~~~~~~~~~~~

You can use the ``astype`` method to convert dtypes from one to another. These *always* return a copy.
In addition, ``convert_objects`` will attempt to *soft* conversion of any *object* dtypes, meaning that if all the objects in a Series are of the same type, the Series
will have that dtype.

.. ipython:: python
df3
df3.dtypes
# conversion of dtypes
df3.astype('float32').dtypes
To force conversion of specific types of number conversion, pass ``convert_numeric = True``.
This will force strings and numbers alike to be numbers if possible, otherwise the will be set to ``np.nan``.
To force conversion to ``datetime64[ns]``, pass ``convert_dates = 'coerce'``.
This will convert any datetimelike object to dates, forcing other values to ``NaT``.

.. ipython:: python
# mixed type conversions
df3['D'] = '1.'
df3['E'] = '1'
df3.convert_objects(convert_numeric=True).dtypes
# same, but specific dtype conversion
df3['D'] = df3['D'].astype('float16')
df3['E'] = df3['E'].astype('int32')
df3.dtypes
# forcing date coercion
s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1, Timestamp('20010104'), '20010105'],dtype='O')
s
s.convert_objects(convert_dates='coerce')
Data alignment and arithmetic
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -633,26 +719,6 @@ You can also disable this feature via the ``expand_frame_repr`` option:
reset_option('expand_frame_repr')
DataFrame column types
~~~~~~~~~~~~~~~~~~~~~~

.. _dsintro.column_types:

The four main types stored in pandas objects are float, int, boolean, and
object. A convenient ``dtypes`` attribute return a Series with the data type of
each column:

.. ipython:: python
baseball.dtypes
The related method ``get_dtype_counts`` will return the number of columns of
each type:

.. ipython:: python
baseball.get_dtype_counts()
DataFrame column attribute access and IPython completion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
28 changes: 28 additions & 0 deletions doc/source/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,34 @@ so that the original data can be modified without creating a copy:
df.mask(df >= 0)
Upcasting Gotchas
~~~~~~~~~~~~~~~~~

Performing indexing operations on ``integer`` type data can easily upcast the data to ``floating``.
The dtype of the input data will be preserved in cases where ``nans`` are not introduced (coming soon).

.. ipython:: python
dfi = df.astype('int32')
dfi['E'] = 1
dfi
dfi.dtypes
casted = dfi[dfi>0]
casted
casted.dtypes
While float dtypes are unchanged.

.. ipython:: python
df2 = df.copy()
df2['A'] = df2['A'].astype('float32')
df2.dtypes
casted = df2[df2>0]
casted
casted.dtypes
Take Methods
~~~~~~~~~~~~
Expand Down
95 changes: 95 additions & 0 deletions doc/source/v0.10.2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
.. _whatsnew_0102:

v0.10.2 (February ??, 2013)
---------------------------

This is a minor release from 0.10.1 and includes many new features and
enhancements along with a large number of bug fixes. There are also a number of
important API changes that long-time pandas users should pay close attention
to.

API changes
~~~~~~~~~~~

Numeric dtypes will propgate and can coexist in DataFrames. If a dtype is passed (either directly via the ``dtype`` keyword, a passed ``ndarray``, or a passed ``Series``, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will **NOT** be combined. The following example will give you a taste.

**Dtype Specification**

.. ipython:: python

df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')
df1
df1.dtypes
df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'), B = Series(randn(8)), C = Series(randn(8),dtype='uint8') ))
df2
df2.dtypes

# here you get some upcasting
df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
df3
df3.dtypes

**Dtype conversion**

.. ipython:: python

# this is lower-common-denomicator upcasting (meaning you get the dtype which can accomodate all of the types)
df3.values.dtype

# conversion of dtypes
df3.astype('float32').dtypes

# mixed type conversions
df3['D'] = '1.'
df3['E'] = '1'
df3.convert_objects(convert_numeric=True).dtypes

# same, but specific dtype conversion
df3['D'] = df3['D'].astype('float16')
df3['E'] = df3['E'].astype('int32')
df3.dtypes

# forcing date coercion
s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1,
Timestamp('20010104'), '20010105'],dtype='O')
s.convert_objects(convert_dates='coerce')

**Upcasting Gotchas**

Performing indexing operations on integer type data can easily upcast the data.
The dtype of the input data will be preserved in cases where ``nans`` are not introduced (coming soon).

.. ipython:: python

dfi = df3.astype('int32')
dfi['D'] = dfi['D'].astype('int64')
dfi
dfi.dtypes

casted = dfi[dfi>0]
casted
casted.dtypes

While float dtypes are unchanged.

.. ipython:: python

df4 = df3.copy()
df4['A'] = df4['A'].astype('float32')
df4.dtypes

casted = df4[df4>0]
casted
casted.dtypes

New features
~~~~~~~~~~~~

**Enhancements**

**Bug Fixes**

See the `full release notes
<https://github.com/pydata/pandas/blob/master/RELEASE.rst>`__ or issue tracker
on GitHub for a complete list.

2 changes: 2 additions & 0 deletions doc/source/whatsnew.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ What's New

These are new features and improvements of note in each release.

.. include:: v0.10.2.txt

.. include:: v0.10.1.txt

.. include:: v0.10.0.txt
Expand Down
Loading

0 comments on commit 166a80d

Please sign in to comment.