Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arguably better handling of input data in constructor for DataFrame (fix for #4297) #4317

Closed
wants to merge 8 commits into from

Conversation

lgautier
Copy link
Contributor

The problem was basically an odd restriction to list ,tuple, dict, or numpy.ndarray for data structure given as input to any constructor for DataFrame. Much cruder than a possible issue around buffer protocols I thought (the whole issue report went off on a tangent anyway ;-) ).

The source did not have much (any, actually) comments in that region, and I could not find unit tests documenting the behaviour, but I am guessing that the restriction stemmed from the fact that a string is a sequence in Python while the constructors should consider one string passed as data a scalar rather than a sequence.

The change made goes the other way around and accepts any sequence (not just tuple, list, or numpy.ndarray), except str or unicode (the later only applies to Python 2.x). bytes and bytearray are allowed (but I am not sure they should).

The pull request was tested (only tests/test_frame.py was run) on Python 2.7.4 and 3.3.1

lgautier added 4 commits July 21, 2013 22:10
The problem finally seem to be an odd restriction: acceptable sequences
were list, tuple, or numpy.ndarray.

A guess is that strings in Python wanted to be considered scalars,
but the above restriction was implemented rather than take anything
with a __len__ except str (or unicode for Python 2.x) in.

Considerations about the use of the buffer interface are not needed
for now.

note: I cannot run tests on Python 2.7.

```
nosetests pandas
```
ends with:
```
    from . import hashtable, tslib, lib
ImportError: cannot import name hashtable
```
- Upper case for the first letter in exception messages
- Added unit test for new capabilities introduced with commit e6c3cd6

(ran only the unit tests for test_frame.py - `nosetests pandas` aborts
with an import error)
@lgautier
Copy link
Contributor Author

fix for #4297

@jreback
Copy link
Contributor

jreback commented Jul 22, 2013

ok...you will need to hook up to travis (see contributing.md) and push again

also...use com._is_sequence() instead of your direct try: except around len

@lgautier
Copy link
Contributor Author

@jreback : I can track _is_sequence() to be in pandas/core/common.py:

def _is_sequence(x):
    try:
        iter(x)
        len(x) # it has a length
        return not isinstance(x, basestring) and True
    except Exception:
        return False

I think that there is a (silent) possible issue with the logic at the following line:

19dc2847 (Wes McKinney     2012-11-18 23:59:31 -0500 1618)         return not isinstance(x, basestring) and True

(and True does not bring anything).

@lgautier
Copy link
Contributor Author

@jreback
Hook to Travis seems to be in place.

@jreback
Copy link
Contributor

jreback commented Jul 22, 2013

@lgautier

other thing is this is messing with the constructor, so need perf testing

.test_perf -b <base_commit> -t <last commit of this PR>

post (only care about > 1.15 ratio)...(some random things will appear, that's ok)

@lgautier
Copy link
Contributor Author

@jreback

Isn't this leaning a little toward hairsplitting ?

The only effective code difference is:

-            elif isinstance(v, (list, tuple, np.ndarray)):
+            elif com._is_sequence(v):

the rest is comments, or editing the message string associated with raising exceptions when a problem occurs...

@@ -5700,12 +5705,22 @@ def extract_index(data):
elif isinstance(v, dict):
have_dicts = True
indexes.append(v.keys())
elif isinstance(v, (list, tuple, np.ndarray)):
elif com._is_sequence(v):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh....Iyes....either way is prob fine

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unordered sequences were not allowed here previously, why the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where was the test that the sequences are ordered ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list, tuple, and ndarray objects are ordered sequences meaning if you put element X at index i it will remain there for the lifetime of the object (modulo any operations that you do to remove it or modify it). dict objects and Python Mapping objects cannot generally make the same guarantee (specific objects may enforce an ordering such as collections.OrderedDict). So, in the following code

def _is_sequence(x):
    try:
        iter(x)
        len(x) # it has a length
        return not isinstance(x, basestring) and True
    except Exception:
        return False

a dict object is both iterable and has a length. The orderedness is guaranteed by the type of objects that test True in the conditional that you changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I had understood it as /sorted/ sequence.

While the original code seemingly (see below) guarantees that all items have an immutable order (when modifications of the order are not explicitly made), it does so by also excluding items satisfying this property. Similarly, your comment is leading to the observation that dict is not the only possible mapping. collections.OrderedDict would go as a "sequence", since mappings are accompanied by a warning. [note: I am unsure that the persistence of the structure is required to make an object ordered, which would mean that xrange (Python 2) and range (Python 3) would qualify)].

While I initially looked into this because of array.array, an sequence in the Python stdlib, I felt like generalization would be beneficial.

At the C level there is PySequence_Check() and PyMapping_Check(). Interestingly I don't think that there is an equivalent at the Python level... may be design decision made to encourage duck-typing. Anyway, while PyMapping_Check() is the most interesting source of information about what to check: https://github.com/python-git/python/blob/master/Objects/abstract.c#L2387 . PySequence_Check() is here: https://github.com/python-git/python/blob/master/Objects/abstract.c#L1806 there is no way to ensure that the property is satisfied.

I'd argue that the requirement for orderedness should be mentioned in the documention, and that extract_index() (should be called build_index(), really) relaxes its test for mapping and sequences in order to allow other objects with the right properties. The argument about guarantee does not hold, as shown right below (before the crowd starts brandishing torches and pitchforks, I'd like to stress that the argumentation is moving to the ground of proofs and certainties, and that the sole purpose of the example is to show that there is none - the real argumentation should be around practicality):

class MyWeirdList(list):
    def __getitem__(self, i):
        return super(MyWeirdList, self).__getitem__(0)    
>>> l = MyWeirdList(range(5))
>>> l[3]
0
>>> isinstance(l, list)
True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this in numpy-land that you need to know the size of the object ahead of time to efficiently allocate space for it. There's no standard, widely-used way for a generator to express this information.

Generators can have a __len__, as I wrote.

import sys
if sys.version_info[0] == 2:
    range = xrange # make python 2 behave
r = range(10)
print(len(r)) # prints 10

If you pass a generator to pandas, it treats it just like numpy does, which is to say just considers it a Python object.

# using r as defined above
import numpy
print(numpy.array(r))
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Moreover, if the rationale is to treat objects "just like numpy does" we shall agree that the current behaviour of the constructors for DataFrame is not accommodating enough.

That said, using collections isn't compatible with using the array module in 2.7, e.g. the following

I had no doubt about that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgautier right, but (x for x in range(10)) would fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can compute the space necessary for range/xrange arithmetically

For arbitrary generator expressions you cannot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud

You can compute the space necessary for range/xrange arithmetically

For arbitrary generator expressions you cannot.

We are very explicitly talking about generators with a __len__. How the length is calculated (implemented in __len__) is pretty much irrelevant.

On a related note, the question of the size of iterators motivated a PEP 424 and might be more frequent than one thinks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtratner

right, but (x for x in range(10)) would fail.

Yes it would fail. The length is not passed to downstream iterators when chained. Thinking of it, may be it could for generator expressions, but that's a question for the python-dev list.

recons_ar = DataFrame.from_items([('A', array.array('i', range(10)))])
recons_rg = DataFrame.from_items([('A', range(10))])
recons_np = DataFrame.from_items([('A', np.array(range(10)))])
self.assertEquals(tuple(recons_ar['A']),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to use assert_frame_equals(recons_ar, recons_rg) here. where assert_frame_equals is from pandas.util.testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.
I'll have to put all this on hold until time again (may be as long as in few weeks).

In the unlikely case someone is in a hurry to see this merged, or irrevocably rejected, he/she should feel free to edit further.

The use of _is_sequence() is convenient but not optimal (not basestring
implicitly tested twice)
@jreback
Copy link
Contributor

jreback commented Jul 24, 2013

@lgautier I think it would also be nice to incorporate #3478/#3479 in this same PR

and really no hurry on this....0.13 will be brewing for a while

@jreback
Copy link
Contributor

jreback commented Sep 13, 2013

closing in favor of #4829

@jreback jreback closed this Sep 13, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants