Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Missing support for creating a DataFrame from Python native arrays (or other objects supporting the buffer protocol) #4297

Closed
lgautier opened this issue Jul 19, 2013 · 17 comments · Fixed by #4829
Labels
API Design IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@lgautier
Copy link
Contributor

import pandas
import array

x=array.array('i', range(10))

pdf = pandas.DataFrame.from_items([('a', x)])
# fails with:
# ValueError: If use all scalar values, must pass index

Also, since x has a __len__ may be the error message should not refer to a scalar.

Explicit casting to numpy arrays is a workaround, but it would seem convenient to have it done within the constructor for DataFrame:

import numpy
pdf = pandas.DataFrame.from_items([('a', numpy.asarray(x))])
@jreback
Copy link
Contributor

jreback commented Jul 19, 2013

out of curiosity why are you using the python array as opposed to a numpy array to begin with?

@lgautier
Copy link
Contributor Author

I am not seeing an opposition in the use of one rather than the other, as I am using using either (or both) depending on the needs.

When I pick Python's array, the reason can be one of:

  • Shipping with Python
  • Fast
$ python -m timeit -s 'import array' -s 'a = array.array("i", range(1000))' -- 'for e in a:' '  pass'
100000 loops, best of 3: 15.8 usec per loop
$ python -m timeit -s 'import numpy' -s 'a = numpy.array(range(1000))' -- 'for e in a:' '  pass'
10000 loops, best of 3: 77.2 usec per loop
  • Lightweight implementation of the buffer protocol (to try out things)

@jreback
Copy link
Contributor

jreback commented Jul 19, 2013

your first point is valid, though numpy is so installed nowadays as its a base for most other packages

you speed test is not testing anything, try doing actual operations

In [25]: x=array.array('i', range(100000))

In [28]: %timeit sum(x)
1000 loops, best of 3: 1.09 ms per loop

In [29]: y = np.arange(100000)

In [30]: %timeit y.sum()
10000 loops, best of 3: 72.7 us per loop

numpy support the buffer protocol

@lgautier
Copy link
Contributor Author

Note: the issue report is about is about a missing functionality. Should you wish (and have the powers) to decide not to
implement it, just close the report (grey button). If wishing to tell the world that Python's arrays have no use, the Python mailing lists is where you should head ;-)

Otherwise:

  1. The claim that numpy is a base for most Python packages is somewhat bold. Data would help believe it.
  2. I am comparing the time to access elements in the respective arrays with the exact same code. You are comparing a generic Python function with a specialized C-extension. When outside of the canned C-extensions, Python's arrays are faster.
  3. "lightweight" (as in "shipping with Python - no dependency involved")

@jreback
Copy link
Contributor

jreback commented Jul 19, 2013

@lgautier I don't have a problem with your request.

I am simply pointing out to you that IMHO there is no reason to usearray over numpy ever for any kind of actual computation. Yes array.array has its uses, but 99% of the time numpy or lists are usually the answer.

Your 3rd point is also very odd; if you want lightweight, then how would you expect to use pandas, which relies on numpy?

your second point is also not a use case. accessing elements is not a relevant benchmark as it tests a very specific criteria (which maybe array is optimized for); numpy is optimized for a much broader range of computations.

as to your first point, http://sourceforge.net/projects/numpy/files/NumPy/stats/timeline?dates=2013-01-01+to+2013-07-19, seems like a lot to me

@lgautier
Copy link
Contributor Author

I'd like to keep the issue report on track: this is about suggesting that objects supporting the buffer protocol should be accepted by constructors for DataFrame (the Python arrays being the easiest way to have them since the package array is part of the Python standard library).

To keep answering to what has moved, in my opinion, to an extraneous discussion, may be because of mutual incomprehension.

Point 1: I looked at the link, and I really could not get the slightest hint that "numpy is a base for most Python packages" (if the thread was more relaxed I could have made the comment that the graph caught my attention first, and that a striking feature is the decreasing number of downloads over time... ;-) ). If the number of download is what I should be looking at, that's not a small number, but a little perspective would not hurt:

$ vanity sqlalchemy | tail -n 1
SQLAlchemy has been downloaded 2183196 times!
$ vanity numpy | tail -n 1
numpy has been downloaded 883206 times!

I would certainly not claim that most Python packages are based on SQLAlchemy, or not even that most Python packages having anything to do with SQL are using SQLAlchemy.
My point was that one of the general reasons I can occasionally be using array, in answer to your initial question, is that it is shipping with Python (so always there, including alternative Python implementations).

Point 2: accessing elements in an array quickly is either not an operation, or too specialized ?! Come on !
Again, my point here is to give one of the general reasons I can occasionally be using array, in answer to your initial question. I said that Python arrays are fast, and backed it up with the exact same Python code on arrays from numpy and array. I initially did not write /faster/, but you looked for it ;-)

Point 3: that's also in answer to your question about why one would ever use Python's arrays. I just listed what is sometimes making me use arrays over numpy. pandas is requiring numpy, but other packages do not necessarily (see point 1 - by the way, did I just see a gimmick "99%" thrown around without backing data ? Tsk, tsk, tsk... not when arguing about scientific computing... ;-) ). The buffer interface is here to allow a loose coupling between packages, and the array package in Python is here to precisely allow that. Should one wish to do operations with numpy it can be imported. This issue was unfortunately hijacked, but in the original context this means that pandas could be imported whenever required, and the vectors to be bundled in a DataFrame.

@jreback
Copy link
Contributor

jreback commented Jul 20, 2013

ok...I have marked this for 0.13. Not a big deal.

Your point 3 is obviously completely False; you cannot use pandas without numpy. Doesn't matter if it supports fully array.array or not. That is irrelevant.

@lgautier
Copy link
Contributor Author

ok...I have marked this for 0.13.

Thanks.

 Not a big deal.

I don't want to be arguing over a controversial issue then... ;-)

Your point 3 is obviously completely False; you cannot use pandas without numpy. Doesn't matter if it supports fully array.array or not. That is irrelevant.

Could you point out which part is (allegedly) false ? I'd happy to try clarifying.

Your comment suggests that you might have understood that I am claiming that pandas can be used without numpy, leaving me quite puzzled since I wrote the exact opposite:

pandas is requiring numpy

@cpcloud
Copy link
Member

cpcloud commented Jul 20, 2013

@lgautier

  1. In point 3, you seem to be implying that users should have a choice between using array or using numpy when using pandas. This simply will never happen. Furthermore, array instances will be turned into numpy arrays anyway, if this supported, so at best it's introducing a redundancy for the very rare case (and no, I don't have any empirical evidence for this, just anecdotal) of using array instead of numpy. IMHO, It's very silly to use array instead of numpy. Element access is a weak benchmark, since most of the time you're not using arrays because you want fast element-by-element access, you want to do operations on the whole array so that benchmark isn't very realistic. If you're going to use pandas then you need numpy, so again, why not just use that?
  2. Yes, other packages don't require numpy, but so what? pandas will probably have numpy as a dependency for the foreseeable future.
  3. Can you honestly provide a real-world use case where array outperforms numpy, other than element access?
  4. Python's arrays are a "canned" C-extension:
Python 2.7.5 (default, May 17 2013, 07:55:04)
[GCC 4.8.0 20130502 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import array
>>> array
<module 'array' from '/home/phillip/.virtualenvs/pandas/lib/python2.7/lib-dynload/array.so'>

When outside of the canned C-extensions, Python's arrays are faster.

What evidence is there to support this, other than the rarely-used element-by-element access in pure Python? A factor of ~3 for a rare use case is not evidence.

@jreback
Copy link
Contributor

jreback commented Jul 20, 2013

@lgautier

sorry this got out of hand :)

I was honestly trying to find a usecase for arrays, so was interested in why you were using them.

@cpcloud
Copy link
Member

cpcloud commented Jul 20, 2013

I think supporting objects that implement the buffer protocol is probably okay. E.g., you could pass a Cython memoryview array into DataFrame. This would be useful if we get the ball rolling on using fused types since we could just return Cython arrays and not have to worry about converting them to numpy arrays in Cython. Although, I'm not sure if any of the Cython routines are used at such a high level. I would imagine they are called by Python functions that DataFrame calls and thus it would never even see any Cython arrays

@jreback
Copy link
Contributor

jreback commented Jul 20, 2013

@cpcloud you could easily do that in wrappers, DataFrame is too high level for that, and in any event usually need the 3 different (data,index,columns) arrays to really do anything.

@cpcloud
Copy link
Member

cpcloud commented Jul 20, 2013

yeah was thinking something similar. crossing too much of the api for something probably not that useful

@lgautier
Copy link
Contributor Author

@cpcloud

IMHO, It's very silly to use array instead of numpy. Element access is a weak benchmark, since most of the time you're not using arrays because you want fast element-by-element access, you want to do operations on the whole array so that benchmark isn't very realistic.

now we have a combo: unsubstantiated judgment (first part), lecturing people about what their code is doing (without knowing much of the said code), and comments (there and in the rest of the post) largely showing a miscomprehension of what is discussed.

Sorry for teasing, but can you read the thread ?

@cpcloud
Copy link
Member

cpcloud commented Jul 20, 2013

@lgautier Why don't you just submit a PR then instead of wasting time on pedantic arguments? It sounds like you're the only one that understands what's going on in your head, so let the code speak for itself.

@lgautier
Copy link
Contributor Author

@jreback

sorry this got out of hand :)

I am tempted to say: "I did not expect the Spanish Inquisition"
;-)

I was honestly trying to find a usecase for arrays, so was interested in why you were using them.

Sure.

Note that I understand the question as:
"why would one ever use array instead of numpy ?"

("where does array fit when pandas requires numpy ?" is for later).

While numpy is providing a number of useful operations on arrays (initially matrix multiplication and linear algebra, by wrapping BLAS/LAPACK) there are situations where the most important part is to have an efficient structure for storing fixed-size numerical arrays (lists or tuple use quickly much more memory)... and Python array provides just that.

To give a concrete example. I can have a simple module "foo.py"

from array import array

def load_data():
    genome_pos = array.fromfile('genome_pos.dat') # sorted genomic positions
    signal = array.fromfile('signal.dat') # some signal (let's keep it short, this is an example)
    return (genome_pos, signal)

import bisect
def interval_indices(genome_pos, pos_a, pos_b):
    i_a = bisect.bisect(genome_pos, pos_a)
    i_b = bisect.bisect(genome_pos, pos_b)
    return (i_a, i_b)

That module can be used to put together a quick application.

import foo
genome_pos, signal = foo.load_data()

from flask import Flask
app = Flask(__name__)
from flask import request

@app.route('/slice/<pos_a>/<pos_b>')
def hello(pos_a=None, pos_b=None):
    i_a, i_b = foo.interval_indices(genome_pos, pos_a, pos_b)
    return render_template('xyplot.html', x=genome_pos[i_a:i_b], y=signal[i_a:i_b])

if __name__ == '__main__':
    app.run()

numpy could be used in place of array, but array is up to the task and is shipping with Python. I often have to write code to be run elsewhere and I weight every dependency on 3rd-party modules.

This leads to the second question:
"where does array fit when pandas requires numpy ?"

The convenience of the buffer interface is that, should computation with numpy (or OpenCV... just to be annoying ;-) ) be wished, the underlying data can be wrapped up seamlessly. I can ensure minimal capabilities without requiring numpy or else.

lgautier added a commit to lgautier/pandas that referenced this issue Jul 21, 2013
The problem finally seem to be an odd restriction: acceptable sequences
were list, tuple, or numpy.ndarray.

A guess is that strings in Python wanted to be considered scalars,
but the above restriction was implemented rather than take anything
with a __len__ except str (or unicode for Python 2.x) in.

Considerations about the use of the buffer interface are not needed
for now.

note: I cannot run tests on Python 2.7.

```
nosetests pandas
```
ends with:
```
    from . import hashtable, tslib, lib
ImportError: cannot import name hashtable
```
@lgautier
Copy link
Contributor Author

The buffer interface turned out to be out of the equation for the fix.

Pull request:
#4317

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO Data IO issues that don't fit into a more specific label
Projects
None yet
3 participants