Arguably better handling of input data in constructor for DataFrame (fix for #4297) #4317

lgautier · 2013-07-22T16:11:19Z

The problem was basically an odd restriction to list ,tuple, dict, or numpy.ndarray for data structure given as input to any constructor for DataFrame. Much cruder than a possible issue around buffer protocols I thought (the whole issue report went off on a tangent anyway ;-) ).

The source did not have much (any, actually) comments in that region, and I could not find unit tests documenting the behaviour, but I am guessing that the restriction stemmed from the fact that a string is a sequence in Python while the constructors should consider one string passed as data a scalar rather than a sequence.

The change made goes the other way around and accepts any sequence (not just tuple, list, or numpy.ndarray), except str or unicode (the later only applies to Python 2.x). bytes and bytearray are allowed (but I am not sure they should).

The pull request was tested (only tests/test_frame.py was run) on Python 2.7.4 and 3.3.1

The problem finally seem to be an odd restriction: acceptable sequences were list, tuple, or numpy.ndarray. A guess is that strings in Python wanted to be considered scalars, but the above restriction was implemented rather than take anything with a __len__ except str (or unicode for Python 2.x) in. Considerations about the use of the buffer interface are not needed for now. note: I cannot run tests on Python 2.7. ``` nosetests pandas ``` ends with: ``` from . import hashtable, tslib, lib ImportError: cannot import name hashtable ```

- Upper case for the first letter in exception messages

- Added unit test for new capabilities introduced with commit e6c3cd6 (ran only the unit tests for test_frame.py - `nosetests pandas` aborts with an import error)

lgautier · 2013-07-22T16:14:13Z

fix for #4297

jreback · 2013-07-22T16:24:16Z

ok...you will need to hook up to travis (see contributing.md) and push again

also...use com._is_sequence() instead of your direct try: except around len

lgautier · 2013-07-22T18:19:06Z

@jreback : I can track _is_sequence() to be in pandas/core/common.py:

def _is_sequence(x):
    try:
        iter(x)
        len(x) # it has a length
        return not isinstance(x, basestring) and True
    except Exception:
        return False

I think that there is a (silent) possible issue with the logic at the following line:

19dc2847 (Wes McKinney     2012-11-18 23:59:31 -0500 1618)         return not isinstance(x, basestring) and True

(and True does not bring anything).

lgautier · 2013-07-22T19:06:12Z

@jreback
Hook to Travis seems to be in place.

jreback · 2013-07-22T19:10:36Z

@lgautier

other thing is this is messing with the constructor, so need perf testing

.test_perf -b <base_commit> -t <last commit of this PR>

post (only care about > 1.15 ratio)...(some random things will appear, that's ok)

lgautier · 2013-07-22T20:04:46Z

@jreback

Isn't this leaning a little toward hairsplitting ?

The only effective code difference is:

-            elif isinstance(v, (list, tuple, np.ndarray)):
+            elif com._is_sequence(v):

the rest is comments, or editing the message string associated with raising exceptions when a problem occurs...

jreback · 2013-07-22T20:16:24Z

pandas/core/frame.py

@@ -5700,12 +5705,22 @@ def extract_index(data):
            elif isinstance(v, dict):
                have_dicts = True
                indexes.append(v.keys())
-            elif isinstance(v, (list, tuple, np.ndarray)):
+            elif com._is_sequence(v):


oh....Iyes....either way is prob fine

Unordered sequences were not allowed here previously, why the change?

Where was the test that the sequences are ordered ?

list, tuple, and ndarray objects are ordered sequences meaning if you put element X at index i it will remain there for the lifetime of the object (modulo any operations that you do to remove it or modify it). dict objects and Python Mapping objects cannot generally make the same guarantee (specific objects may enforce an ordering such as collections.OrderedDict). So, in the following code

def _is_sequence(x): try: iter(x) len(x) # it has a length return not isinstance(x, basestring) and True except Exception: return False

a dict object is both iterable and has a length. The orderedness is guaranteed by the type of objects that test True in the conditional that you changed.

I see. I had understood it as /sorted/ sequence.

While the original code seemingly (see below) guarantees that all items have an immutable order (when modifications of the order are not explicitly made), it does so by also excluding items satisfying this property. Similarly, your comment is leading to the observation that dict is not the only possible mapping. collections.OrderedDict would go as a "sequence", since mappings are accompanied by a warning. [note: I am unsure that the persistence of the structure is required to make an object ordered, which would mean that xrange (Python 2) and range (Python 3) would qualify)].

While I initially looked into this because of array.array, an sequence in the Python stdlib, I felt like generalization would be beneficial.

At the C level there is PySequence_Check() and PyMapping_Check(). Interestingly I don't think that there is an equivalent at the Python level... may be design decision made to encourage duck-typing. Anyway, while PyMapping_Check() is the most interesting source of information about what to check: https://github.com/python-git/python/blob/master/Objects/abstract.c#L2387 . PySequence_Check() is here: https://github.com/python-git/python/blob/master/Objects/abstract.c#L1806 there is no way to ensure that the property is satisfied.

I'd argue that the requirement for orderedness should be mentioned in the documention, and that extract_index() (should be called build_index(), really) relaxes its test for mapping and sequences in order to allow other objects with the right properties. The argument about guarantee does not hold, as shown right below (before the crowd starts brandishing torches and pitchforks, I'd like to stress that the argumentation is moving to the ground of proofs and certainties, and that the sole purpose of the example is to show that there is none - the real argumentation should be around practicality):

class MyWeirdList(list): def __getitem__(self, i): return super(MyWeirdList, self).__getitem__(0)

>>> l = MyWeirdList(range(5)) >>> l[3] 0 >>> isinstance(l, list) True

The reason for this in numpy-land that you need to know the size of the object ahead of time to efficiently allocate space for it. There's no standard, widely-used way for a generator to express this information.

Generators can have a __len__, as I wrote.

import sys if sys.version_info[0] == 2: range = xrange # make python 2 behave r = range(10) print(len(r)) # prints 10

If you pass a generator to pandas, it treats it just like numpy does, which is to say just considers it a Python object.

# using r as defined above import numpy print(numpy.array(r)) # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Moreover, if the rationale is to treat objects "just like numpy does" we shall agree that the current behaviour of the constructors for DataFrame is not accommodating enough.

That said, using collections isn't compatible with using the array module in 2.7, e.g. the following

I had no doubt about that.

@lgautier right, but (x for x in range(10)) would fail.

You can compute the space necessary for range/xrange arithmetically

For arbitrary generator expressions you cannot.

@cpcloud

You can compute the space necessary for range/xrange arithmetically

For arbitrary generator expressions you cannot.

We are very explicitly talking about generators with a __len__. How the length is calculated (implemented in __len__) is pretty much irrelevant.

On a related note, the question of the size of iterators motivated a PEP 424 and might be more frequent than one thinks.

@jtratner

right, but (x for x in range(10)) would fail.

Yes it would fail. The length is not passed to downstream iterators when chained. Thinking of it, may be it could for generator expressions, but that's a question for the python-dev list.

an exception should be raised

jtratner · 2013-07-24T03:31:06Z

pandas/tests/test_frame.py

+        recons_ar = DataFrame.from_items([('A', array.array('i', range(10)))])
+        recons_rg = DataFrame.from_items([('A', range(10))])
+        recons_np = DataFrame.from_items([('A', np.array(range(10)))])
+        self.assertEquals(tuple(recons_ar['A']),


probably better to use assert_frame_equals(recons_ar, recons_rg) here. where assert_frame_equals is from pandas.util.testing

Indeed.
I'll have to put all this on hold until time again (may be as long as in few weeks).

In the unlikely case someone is in a hurry to see this merged, or irrevocably rejected, he/she should feel free to edit further.

The use of _is_sequence() is convenient but not optimal (not basestring implicitly tested twice)

jreback · 2013-07-24T15:22:45Z

@lgautier I think it would also be nice to incorporate #3478/#3479 in this same PR

and really no hurry on this....0.13 will be brewing for a while

jreback · 2013-09-13T15:19:08Z

closing in favor of #4829

lgautier added 4 commits July 21, 2013 22:10

Typo in error message

d2b85ff

- Comment the source

209c01c

- Upper case for the first letter in exception messages

- Fixed unit test for commit d2b85ff

04732e0

- Added unit test for new capabilities introduced with commit e6c3cd6 (ran only the unit tests for test_frame.py - `nosetests pandas` aborts with an import error)

lgautier mentioned this pull request Jul 22, 2013

API: Missing support for creating a DataFrame from Python native arrays (or other objects supporting the buffer protocol) #4297

Closed

refactored to use core.common._is_sequence

9172436

jreback reviewed Jul 22, 2013
View reviewed changes

lgautier added 2 commits July 23, 2013 13:10

Corrected a comment in the code that was suggesting incorrectly that

4f71e98

an exception should be raised

removed unnecessary test

f131b71

jtratner reviewed Jul 24, 2013
View reviewed changes

Fixed behaviour for mapping vs sequence

8787ef5

The use of _is_sequence() is convenient but not optimal (not basestring implicitly tested twice)

jreback closed this Sep 13, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arguably better handling of input data in constructor for DataFrame (fix for #4297) #4317

Arguably better handling of input data in constructor for DataFrame (fix for #4297) #4317

lgautier commented Jul 22, 2013

lgautier commented Jul 22, 2013

jreback commented Jul 22, 2013

lgautier commented Jul 22, 2013

lgautier commented Jul 22, 2013

jreback commented Jul 22, 2013

lgautier commented Jul 22, 2013

jreback Jul 22, 2013

ghost Jul 23, 2013

lgautier Jul 23, 2013

cpcloud Jul 23, 2013

lgautier Jul 23, 2013

lgautier Jul 24, 2013

jtratner Jul 24, 2013

cpcloud Jul 24, 2013

lgautier Jul 24, 2013

lgautier Jul 24, 2013

jtratner Jul 24, 2013

lgautier Jul 24, 2013

jreback commented Jul 24, 2013

jreback commented Sep 13, 2013

Arguably better handling of input data in constructor for DataFrame (fix for #4297) #4317

Arguably better handling of input data in constructor for DataFrame (fix for #4297) #4317

Conversation

lgautier commented Jul 22, 2013

lgautier commented Jul 22, 2013

jreback commented Jul 22, 2013

lgautier commented Jul 22, 2013

lgautier commented Jul 22, 2013

jreback commented Jul 22, 2013

lgautier commented Jul 22, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 24, 2013

jreback commented Sep 13, 2013