-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arguably better handling of input data in constructor for DataFrame (fix for #4297) #4317
Conversation
The problem finally seem to be an odd restriction: acceptable sequences were list, tuple, or numpy.ndarray. A guess is that strings in Python wanted to be considered scalars, but the above restriction was implemented rather than take anything with a __len__ except str (or unicode for Python 2.x) in. Considerations about the use of the buffer interface are not needed for now. note: I cannot run tests on Python 2.7. ``` nosetests pandas ``` ends with: ``` from . import hashtable, tslib, lib ImportError: cannot import name hashtable ```
- Upper case for the first letter in exception messages
- Added unit test for new capabilities introduced with commit e6c3cd6 (ran only the unit tests for test_frame.py - `nosetests pandas` aborts with an import error)
fix for #4297 |
ok...you will need to hook up to travis (see contributing.md) and push again also...use |
@jreback : I can track def _is_sequence(x):
try:
iter(x)
len(x) # it has a length
return not isinstance(x, basestring) and True
except Exception:
return False I think that there is a (silent) possible issue with the logic at the following line:
( |
@jreback |
other thing is this is messing with the constructor, so need perf testing
post (only care about > 1.15 ratio)...(some random things will appear, that's ok) |
Isn't this leaning a little toward hairsplitting ? The only effective code difference is: - elif isinstance(v, (list, tuple, np.ndarray)):
+ elif com._is_sequence(v): the rest is comments, or editing the message string associated with raising exceptions when a problem occurs... |
@@ -5700,12 +5705,22 @@ def extract_index(data): | |||
elif isinstance(v, dict): | |||
have_dicts = True | |||
indexes.append(v.keys()) | |||
elif isinstance(v, (list, tuple, np.ndarray)): | |||
elif com._is_sequence(v): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh....Iyes....either way is prob fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unordered sequences were not allowed here previously, why the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where was the test that the sequences are ordered ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list
, tuple
, and ndarray
objects are ordered sequences meaning if you put element X
at index i
it will remain there for the lifetime of the object (modulo any operations that you do to remove it or modify it). dict
objects and Python Mapping
objects cannot generally make the same guarantee (specific objects may enforce an ordering such as collections.OrderedDict
). So, in the following code
def _is_sequence(x):
try:
iter(x)
len(x) # it has a length
return not isinstance(x, basestring) and True
except Exception:
return False
a dict
object is both iterable and has a length. The orderedness is guaranteed by the type of objects that test True
in the conditional that you changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I had understood it as /sorted/ sequence.
While the original code seemingly (see below) guarantees that all items have an immutable order (when modifications of the order are not explicitly made), it does so by also excluding items satisfying this property. Similarly, your comment is leading to the observation that dict
is not the only possible mapping. collections.OrderedDict
would go as a "sequence", since mappings are accompanied by a warning. [note: I am unsure that the persistence of the structure is required to make an object ordered, which would mean that xrange
(Python 2) and range
(Python 3) would qualify)].
While I initially looked into this because of array.array
, an sequence in the Python stdlib, I felt like generalization would be beneficial.
At the C level there is PySequence_Check()
and PyMapping_Check()
. Interestingly I don't think that there is an equivalent at the Python level... may be design decision made to encourage duck-typing. Anyway, while PyMapping_Check()
is the most interesting source of information about what to check: https://github.com/python-git/python/blob/master/Objects/abstract.c#L2387 . PySequence_Check()
is here: https://github.com/python-git/python/blob/master/Objects/abstract.c#L1806 there is no way to ensure that the property is satisfied.
I'd argue that the requirement for orderedness should be mentioned in the documention, and that extract_index()
(should be called build_index()
, really) relaxes its test for mapping and sequences in order to allow other objects with the right properties. The argument about guarantee does not hold, as shown right below (before the crowd starts brandishing torches and pitchforks, I'd like to stress that the argumentation is moving to the ground of proofs and certainties, and that the sole purpose of the example is to show that there is none - the real argumentation should be around practicality):
class MyWeirdList(list):
def __getitem__(self, i):
return super(MyWeirdList, self).__getitem__(0)
>>> l = MyWeirdList(range(5))
>>> l[3]
0
>>> isinstance(l, list)
True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this in numpy-land that you need to know the size of the object ahead of time to efficiently allocate space for it. There's no standard, widely-used way for a generator to express this information.
Generators can have a __len__
, as I wrote.
import sys
if sys.version_info[0] == 2:
range = xrange # make python 2 behave
r = range(10)
print(len(r)) # prints 10
If you pass a generator to pandas, it treats it just like numpy does, which is to say just considers it a Python object.
# using r as defined above
import numpy
print(numpy.array(r))
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Moreover, if the rationale is to treat objects "just like numpy does" we shall agree that the current behaviour of the constructors for DataFrame
is not accommodating enough.
That said, using collections isn't compatible with using the array module in 2.7, e.g. the following
I had no doubt about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgautier right, but (x for x in range(10))
would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can compute the space necessary for range
/xrange
arithmetically
For arbitrary generator expressions you cannot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can compute the space necessary for range/xrange arithmetically
For arbitrary generator expressions you cannot.
We are very explicitly talking about generators with a __len__
. How the length is calculated (implemented in __len__
) is pretty much irrelevant.
On a related note, the question of the size of iterators motivated a PEP 424 and might be more frequent than one thinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, but (x for x in range(10)) would fail.
Yes it would fail. The length is not passed to downstream iterators when chained. Thinking of it, may be it could for generator expressions, but that's a question for the python-dev list.
an exception should be raised
recons_ar = DataFrame.from_items([('A', array.array('i', range(10)))]) | ||
recons_rg = DataFrame.from_items([('A', range(10))]) | ||
recons_np = DataFrame.from_items([('A', np.array(range(10)))]) | ||
self.assertEquals(tuple(recons_ar['A']), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better to use assert_frame_equals(recons_ar, recons_rg)
here. where assert_frame_equals
is from pandas.util.testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
I'll have to put all this on hold until time again (may be as long as in few weeks).
In the unlikely case someone is in a hurry to see this merged, or irrevocably rejected, he/she should feel free to edit further.
The use of _is_sequence() is convenient but not optimal (not basestring implicitly tested twice)
closing in favor of #4829 |
The problem was basically an odd restriction to
list
,tuple
,dict
, ornumpy.ndarray
for data structure given as input to any constructor for DataFrame. Much cruder than a possible issue around buffer protocols I thought (the whole issue report went off on a tangent anyway ;-) ).The source did not have much (any, actually) comments in that region, and I could not find unit tests documenting the behaviour, but I am guessing that the restriction stemmed from the fact that a string is a sequence in Python while the constructors should consider one string passed as data a scalar rather than a sequence.
The change made goes the other way around and accepts any sequence (not just
tuple
,list
, ornumpy.ndarray
), exceptstr
orunicode
(the later only applies to Python 2.x).bytes
andbytearray
are allowed (but I am not sure they should).The pull request was tested (only
tests/test_frame.py
was run) on Python 2.7.4 and 3.3.1