Merge commit 'v0.4.1' into debian

* commit 'v0.4.1': (53 commits) RLS: Version 0.4.1 BUG: use int64 BUG: reverted Series constructor NumPy < 1.6 bug TST: wrap up test coverage TST: test coverage, minor refactoring TST: test coverage and minor bugfix in NDFrame.swaplevel DOC: documented reading CSV/table into MultiIndex, address GH pandas-dev#165 DOC: documented swaplevel, address GH pandas-dev#150 ENH: better JR join function ENH: add join panel function for testing and later integration BUG: do not allow appending with different item order ENH: don't raise exception when calling remove on non-existent node ENH: tinkering with other join impl ENH: speed up assert_almost_equal BUG: DateRange.copy did not produce well-formed object. fixes GH pandas-dev#168 DOC: update release notes BUG: count_level did not handle zero-length data case, caused segfault with NumPy < 1.6 for some. Fixes GH pandas-dev#169 ENH: sped up inner/outer_join_indexer cython functions ENH: don't boundscheck or wraparound ENH: bug fixes, speed enh, benchmark suite to compare with xts ...
neurodebian · Sep 26, 2011 · a1ae6f2 · a1ae6f2
2 parents 645d611 + cdc607c
commit a1ae6f2
Show file tree

Hide file tree

Showing 37 changed files with 1,720 additions and 227 deletions.
diff --git a/.gitignore b/.gitignore
@@ -6,6 +6,7 @@ MANIFEST
 *.pyd
 pandas/src/tseries.c
 pandas/src/sparse.c
+pandas/version.py
 doc/source/generated
 *flymake*
 scikits

diff --git a/RELEASE.rst b/RELEASE.rst
@@ -1,10 +1,83 @@
-========================
-pandas 0.4 Release Notes
-========================
+=============
+Release Notes
+=============
 
-What is it
+This is the list of changes to pandas between each release. For full details,
+see the commit logs at http://github.com/wesm/pandas
+
+
+pandas 0.4.1
+============
+
+**Release date:** Not yet released
+
+This is primarily a bug fix release but includes some new features and
+improvements
+
+**New features / modules**
+
+  - Added new `DataFrame` methods `get_dtype_counts` and property `dtypes`
+  - Setting of values using ``.ix`` indexing attribute in mixed-type DataFrame
+    objects has been implemented (fixes GH #135)
+  - `read_csv` can read multiple columns into a `MultiIndex`. DataFrame's
+    `to_csv` method will properly write out a `MultiIndex` which can be read
+    back (PR #151, thanks to Skipper Seabold)
+  - Wrote fast time series merging / joining methods in Cython. Will be
+    integrated later into DataFrame.join and related functions
+  - Added `ignore_index` option to `DataFrame.append` for combining unindexed
+    records stored in a DataFrame
+
+**Improvements to existing features**
+
+  - Some speed enhancements with internal Index type-checking function
+  - `DataFrame.rename` has a new `copy` parameter which can rename a DataFrame
+    in place
+  - Enable unstacking by level name (PR #142)
+  - Enable sortlevel to work by level name (PR #141)
+  - `read_csv` can automatically "sniff" other kinds of delimiters using
+    `csv.Sniffer` (PR #146)
+  - Improved speed of unit test suite by about 40%
+  - Exception will not be raised calling `HDFStore.remove` on non-existent node
+    with where clause
+  - Optimized `_ensure_index` function resulting in performance savings in
+    type-checking Index objects
+
+**Bug fixes**
+
+  - Fixed DataFrame constructor bug causing downstream problems (e.g. .copy()
+    failing) when passing a Series as the values along with a column name and
+    index
+  - Fixed single-key groupby on DataFrame with as_index=False (GH #160)
+  - `Series.shift` was failing on integer Series (GH #154)
+  - `unstack` methods were producing incorrect output in the case of duplicate
+    hierarchical labels. An exception will now be raised (GH #147)
+  - Calling `count` with level argument caused reduceat failure or segfault in
+    earlier NumPy (GH #169)
+  - Fixed `DataFrame.corrwith` to automatically exclude non-numeric data (GH
+    #144)
+  - Unicode handling bug fixes in `DataFrame.to_string` (GH #138)
+  - Excluding OLS degenerate unit test case that was causing platform specific
+    failure (GH #149)
+  - Skip blosc-dependent unit tests for PyTables < 2.2 (PR #137)
+  - Calling `copy` on `DateRange` did not copy over attributes to the new object
+    (GH #168)
+  - Fix bug in `HDFStore` in which Panel data could be appended to a Table with
+    different item order, thus resulting in an incorrect result read back
+
+Thanks
+------
+- Yaroslav Halchenko
+- Jeff Reback
+- Skipper Seabold
+- Dan Lovell
+- Nick Pentreath
+
+pandas 0.4
 ==========
 
+What is it
+----------
+
 **pandas** is a library of powerful labeled-axis data structures, statistical
 tools, and general code for working with relational data sets, including time
 series and cross-sectional data. It was designed with the practical needs of
@@ -13,14 +86,14 @@ particularly well suited for, among other things, financial data analysis
 applications.
 
 Where to get it
-===============
+---------------
 
 Source code: http://github.com/wesm/pandas
 Binary installers on PyPI: http://pypi.python.org/pypi/pandas
 Documentation: http://pandas.sourceforge.net
 
 Release notes
-=============
+-------------
 
 **Release date:** 9/12/2011
 
@@ -279,12 +352,8 @@ Thanks
   - Skipper Seabold
   - Chris Jordan-Squire
 
-========================
-pandas 0.3 Release Notes
-========================
-
-Release Notes
-=============
+pandas 0.3
+==========
 
 This major release of pandas represents approximately 1 year of continuous
 development work and brings with it many new features, bug fixes, speed
@@ -293,22 +362,22 @@ change from the 0.2 release has been the completion of a rigorous unit test
 suite covering all of the core functionality.
 
 What is it
-==========
+----------
 
 **pandas** is a library of labeled data structures, statistical models, and
 general code for working with time series and cross-sectional data. It was
 designed with the practical needs of statistical modeling and large,
 inhomogeneous data sets in mind.
 
 Where to get it
-===============
+---------------
 
 Source code: http://github.com/wesm/pandas
 Binary installers on PyPI: http://pypi.python.org/pypi/pandas
 Documentation: http://pandas.sourceforge.net
 
 Release notes
-=============
+-------------
 
 **Release date:** February 20, 2011
 

diff --git a/bench/bench_join_panel.py b/bench/bench_join_panel.py
@@ -0,0 +1,77 @@
+# reasonably effecient
+
+def create_panels_append(cls, panels):
+        """ return an append list of panels """
+        panels = [ a for a in panels if a is not None ]
+        # corner cases
+        if len(panels) == 0:
+                return None
+        elif len(panels) == 1:
+                return panels[0]
+        elif len(panels) == 2 and panels[0] == panels[1]:
+                return panels[0]
+        #import pdb; pdb.set_trace()
+        # create a joint index for the axis
+        def joint_index_for_axis(panels, axis):
+                s = set()
+                for p in panels:
+                        s.update(list(getattr(p,axis)))
+                return sorted(list(s))
+        def reindex_on_axis(panels, axis, axis_reindex):
+                new_axis = joint_index_for_axis(panels, axis)
+                new_panels = [ p.reindex(**{ axis_reindex : new_axis, 'copy' : False}) for p in panels ]
+                return new_panels, new_axis
+        # create the joint major index, dont' reindex the sub-panels - we are appending
+        major = joint_index_for_axis(panels, 'major_axis')
+        # reindex on minor axis
+        panels, minor = reindex_on_axis(panels, 'minor_axis', 'minor')
+        # reindex on items
+        panels, items = reindex_on_axis(panels, 'items', 'items')
+        # concatenate values
+        try:
+                values = np.concatenate([ p.values for p in panels ],axis=1)
+        except (Exception), detail:
+                raise Exception("cannot append values that dont' match dimensions! -> [%s] %s" % (','.join([ "%s" % p for p in panels ]),str(detail)))
+        #pm('append - create_panel')
+        p = Panel(values, items = items, major_axis = major, minor_axis = minor )
+        #pm('append - done')
+        return p
+
+
+
+# does the job but inefficient (better to handle like you read a table in pytables...e.g create a LongPanel then convert to Wide)
+
+def create_panels_join(cls, panels):
+        """ given an array of panels's, create a single panel """
+        panels = [ a for a in panels if a is not None ]
+        # corner cases
+        if len(panels) == 0:
+                return None
+        elif len(panels) == 1:
+                return panels[0]
+        elif len(panels) == 2 and panels[0] == panels[1]:
+                return panels[0]
+        d = dict()
+        minor, major, items = set(), set(), set()
+        for panel in panels:
+                items.update(panel.items)
+                major.update(panel.major_axis)
+                minor.update(panel.minor_axis)
+                values = panel.values
+                for item, item_index in panel.items.indexMap.items():
+                        for minor_i, minor_index in panel.minor_axis.indexMap.items():
+                                for major_i, major_index in panel.major_axis.indexMap.items():
+                                        try:
+                                                d[(minor_i,major_i,item)] = values[item_index,major_index,minor_index]
+                                        except:
+                                                pass
+        # stack the values
+        minor = sorted(list(minor))
+        major = sorted(list(major))
+        items = sorted(list(items))
+        # create the 3d stack (items x columns x indicies)
+        data = np.dstack([ np.asarray([ np.asarray([ d.get((minor_i,major_i,item),np.nan) for item in items ]) for major_i in major ]).transpose() for minor_i in minor ])
+        # construct the panel
+        return Panel(data, items, major, minor)
+add_class_method(Panel, create_panels_join, 'join_many')
+
diff --git a/bench/bench_take_indexing.py b/bench/bench_take_indexing.py
@@ -0,0 +1,52 @@
+import numpy as np
+
+from pandas import *
+import pandas._tseries as lib
+
+from pandas import DataFrame
+import timeit
+
+setup = """
+from pandas import Series
+import pandas._tseries as lib
+import random
+import numpy as np
+
+import random
+n = %d
+k = %d
+arr = np.random.randn(n, k)
+indexer = np.arange(n, dtype=np.int32)
+indexer = indexer[::-1]
+"""
+
+sizes = [100, 1000, 10000, 100000]
+iters = [1000, 1000, 100, 1]
+
+fancy_2d = []
+take_2d = []
+cython_2d = []
+
+n = 1000
+
+def _timeit(stmt, size, k=5, iters=1000):
+    timer =  timeit.Timer(stmt=stmt, setup=setup % (sz, k))
+    return timer.timeit(n) / n
+
+for sz, its in zip(sizes, iters):
+    print sz
+    fancy_2d.append(_timeit('arr[indexer]', sz, iters=its))
+    take_2d.append(_timeit('arr.take(indexer, axis=0)', sz, iters=its))
+    cython_2d.append(_timeit('lib.take_axis0(arr, indexer)', sz, iters=its))
+
+df = DataFrame({'fancy' : fancy_2d,
+                'take' : take_2d,
+                'cython' : cython_2d})
+
+print df
+
+from pandas.rpy.common import r
+r('mat <- matrix(rnorm(50000), nrow=10000, ncol=5)')
+r('set.seed(12345')
+r('indexer <- sample(1:10000)')
+r('mat[indexer,]')
diff --git a/doc/data/mindex_ex.csv b/doc/data/mindex_ex.csv
@@ -0,0 +1,16 @@
+year,indiv,zit,xit
+1977,"A",1.2,.6
+1977,"B",1.5,.5
+1977,"C",1.7,.8
+1978,"A",.2,.06
+1978,"B",.7,.2
+1978,"C",.8,.3
+1978,"D",.9,.5
+1978,"E",1.4,.9
+1979,"C",.2,.15
+1979,"D",.14,.05
+1979,"E",.5,.15
+1979,"F",1.2,.5
+1979,"G",3.4,1.9
+1979,"H",5.4,2.7
+1979,"I",6.4,1.2
diff --git a/doc/source/dsintro.rst b/doc/source/dsintro.rst
@@ -513,11 +513,6 @@ The API for insertion and deletion is the same as for DataFrame.
 Indexing / Selection
 ~~~~~~~~~~~~~~~~~~~~
 
-As of this writing, indexing with Panel is a bit more restrictive than in
-DataFrame. Notably, :ref:`advanced indexing <indexing>` via the **ix** property
-has not yet been integrated in Panel. This will be done, however, in a
-future release.
-
 .. csv-table::
     :header: "Operation", "Syntax", "Result"
     :widths: 30, 20, 10

diff --git a/doc/source/indexing.rst b/doc/source/indexing.rst
@@ -291,19 +291,16 @@ than integer locations. Therefore, advanced indexing with ``.ix`` will always
 Setting values in mixed-type objects
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Setting values on a mixed-type DataFrame or Panel is not yet supported:
+Setting values on a mixed-type DataFrame or Panel is supported when using scalar
+values, though setting arbitrary vectors is not yet supported:
 
 .. ipython:: python
 
    df2 = df[:4]
    df2['foo'] = 'bar'
    df2.ix[3]
    df2.ix[3] = np.nan
-
-The reason it has not been implemented yet is simply due to difficulty of
-implementation relative to its utility. Handling the full spectrum of
-exceptional cases for setting values is trickier than getting values (which is
-relatively straightforward).
+   df2
 
 .. _indexing.hierarchical:
 
@@ -523,6 +520,16 @@ However:
    >>> s.ix[('a', 'b'):('b', 'a')]
    Exception: MultiIndex lexsort depth 1, key was length 2
 
+Swapping levels with ``swaplevel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``swaplevel`` function can switch the order of two levels:
+
+.. ipython:: python
+
+   df[:5]
+   df[:5].swaplevel(0, 1, axis=0)
+
 The ``delevel`` DataFrame function
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 

diff --git a/doc/source/io.rst b/doc/source/io.rst
@@ -96,6 +96,24 @@ fragile. Type inference is a pretty big deal. So if a column can be coerced to
 integer dtype without altering the contents, it will do so. Any non-numeric
 columns will come through as object dtype as with the rest of pandas objects.
 
+Reading DataFrame objects with ``MultiIndex``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Suppose you have data indexed by two columns:
+
+.. ipython:: python
+
+   print open('data/mindex_ex.csv').read()
+
+The ``index_col`` argument to ``read_csv`` and ``read_table`` can take a list of
+column numbers to turn multiple columns into a ``MultiIndex``:
+
+.. ipython:: python
+
+   df = read_csv("data/mindex_ex.csv", index_col=[0,1])
+   df
+   df.ix[1978]
+
 Excel 2003 files
 ----------------