-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot produce DataFrame from dict-yielding iterator. #2193
Comments
To the best of my knowledge, DataFrames are not lazy. For example, in the absence of an In theory, you might have been able to avoid having two copies of the data in memory at the |
Could you please let me know what API you would want on this issue? |
perhaps something like |
Sorry for the late reply. :) It seems the DataFrame constructor is already pretty heavily overloaded, however my initial desire would be for:
To result in a valid frame. Looking at the code it seems possible to add (yet another) fall-through for this case, however ryanwitt's suggestion is probably saner, given that the more tests added to that constructor the more magical and unpredictable it'll become. The other problem is how to check for an iterable sanely: options appear to be attempting |
+1 for I think something like this would be nice to have. For example I have the following utility function which imitates expand.grid() in R: import itertools as it
import operator as op
import pandas as pd
def expand_grid(*args, **kwargs):
columns = []
lst = []
if args:
columns += xrange(len(args))
lst += args
if kwargs:
columns += kwargs.iterkeys()
lst += kwargs.itervalues()
return pd.DataFrame(list(it.product(*lst)), columns=columns)
print expand_grid([0,1], [1,2,3])
print expand_grid(a=[0,1], b=[1,2,3])
print expand_grid([0,1], b=[1,2,3]) It would be nice if we didn't need the list() constructor in the last line of the function and could avoid creating a temporary copy of all the tuples. |
Since in my case the total number of rows can be computed upfront, I tried to see if I could get it to work with some duck typing. Alas, I got stuck when it calls out to def expand_grid3(*args, **kwargs):
columns = []
lst = []
if args:
columns += xrange(len(args))
lst += args
if kwargs:
columns += kwargs.iterkeys()
lst += kwargs.itervalues()
class _proditer(list):
def __init__(self, l):
self.N = reduce(op.mul, map(len, l), 1)
self.iter_ = it.product(*l)
def __iter__(self): return self
def next(self): return next(self.iter_)
def __len__(self): return self.N
def __getitem__(self, item):
if item==0:
value = next(self.iter_)
self.iter_ = it.chain(iter([value]), self.iter_)
return value
else:
raise NotImplementedError("item=={}".format(item))
lst = _proditer(lst)
return pd.DataFrame.from_records(lst, index=range(len(lst)), columns=columns) but this fails with: In [3]: expand_grid3([0,1], [1,2,3])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-f632bd646903> in <module>()
----> 1 expand_grid3([0,1], [1,2,3])
/home/tobias/code/dev/argon/scripts/expand_grid.py in expand_grid3(*args, **kwargs)
68 raise NotImplementedError("item=={}".format(item))
69 lst = _proditer(lst)
---> 70 return pd.DataFrame.from_records(lst, index=range(len(lst)), columns=columns)
71
72 print expand_grid3([0,1], [1,2,3])
/home/tobias/code/envs/mac/local/lib/python2.7/site-packages/pandas-0.9.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in from_records(cls, data, index, exclude, columns, names, coerce_float)
844 else:
845 sdict, columns = _to_sdict(data, columns,
--> 846 coerce_float=coerce_float)
847
848 if exclude is None:
/home/tobias/code/envs/mac/local/lib/python2.7/site-packages/pandas-0.9.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _to_sdict(data, columns, coerce_float)
4958 return {}, columns
4959 if isinstance(data[0], (list, tuple)):
-> 4960 return _list_to_sdict(data, columns, coerce_float=coerce_float)
4961 elif isinstance(data[0], dict):
4962 return _list_of_dict_to_sdict(data, columns, coerce_float=coerce_float)
/home/tobias/code/envs/mac/local/lib/python2.7/site-packages/pandas-0.9.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _list_to_sdict(data, columns, coerce_float)
4971 def _list_to_sdict(data, columns, coerce_float=False):
4972 if len(data) > 0 and isinstance(data[0], tuple):
-> 4973 content = list(lib.to_object_array_tuples(data).T)
4974 elif len(data) > 0:
4975 # list of lists
TypeError: Argument 'rows' has incorrect type (expected list, got _proditer) |
The form In [46]: def data_source():
...: for row_count in xrange(10):
...: yield np.random.rand(5)
...: pd.DataFrame(list(data_source()))
...:
Out[46]:
0 1 2 3 4
0 0.241975 0.336069 0.584298 0.990567 0.628024
1 0.491197 0.153000 0.069927 0.923961 0.321994
2 0.331411 0.135619 0.742957 0.917101 0.751598
3 0.259744 0.566237 0.854419 0.478699 0.322557
4 0.237514 0.725468 0.258899 0.106149 0.238813
5 0.703378 0.905303 0.200714 0.492119 0.050031
6 0.881858 0.199581 0.057018 0.855477 0.094684
7 0.006712 0.081865 0.359633 0.758901 0.587118
8 0.511695 0.788931 0.753998 0.159687 0.678947
9 0.267654 0.840976 0.278592 0.051802 0.704802
In [47]: pd.DataFrame(data_source())
---------------------------------------------------------------------------
PandasError Traceback (most recent call last)
<ipython-input-47-5667ec91c127> in <module>()
----> 1 pd.DataFrame(data_source())
/home/user1/src/pandas/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
445 copy=False)
446 else:
--> 447 raise PandasError('DataFrame constructor not properly called!')
448
449 NDFrame.__init__(self, mgr)
PandasError: DataFrame constructor not properly called!
In [48]: type(data_source())
Out[48]: generator |
+1, this would be nice to have, the API can be similar to numpy.fromiter where if you know the number of rows memory can can be preallocated. The main use would be to avoid having two copies of the same data in memory at the same time. |
you can't really preallocate this instead you allocate chunks and fill, appending as needed could be done but only necessary for really really large structures |
@jreback, this just came to mind and it seems like there's a possible memory win here after all. 0.14? from_records seems to be doing the wrong thing memory-wise as well . Am I missing something obvious? |
yep not that hard |
FYI could refactor from_records in any event to be more integrated into reg dataframe ctor |
somewhat related (if from_records is refactored), #4464 |
Comparing memory footprint in a roundabout way:
so a roughly 4x reduction in footprint by not materializing fully in memory. |
I don't know why you think that is a valid measurement. They are 2 different formats. npz is a specific binary format, where pickle is a text-ified binary repr |
I was trying to measure the savings by eliminating the per-object overhead of python objects Will look around for a better way to measure in-memory python footprint. |
Second try:
python objects require 32 bytes to store an additional int64, numpy array require only 8 (64bits), a 4x difference. That's an added bonus to the basic win of not having an extra copy of the data in memory during construction. Is that a valid conclusion in your eyes? |
did you try memory_profiler (use via %memit) |
It monitors overall process memory, I don't trust it for detailed measurement. |
oh I agree numpy arrays are more efficient, but the point is that they ARE starting as python objects, so not sure this makes much difference in practice as it should be chunked anyhow (THAT is how to save memory) |
They are starting a python objects, but potentially only one row at a time. That's the iterator's business. Isn't that right? Update: mod the fact that O(x) = O(5x)... :) . far better constants is what I meant. |
I agree....I just think the soln is easier to simply chunk the stream into say x rows, then create frames and concatenate at the end, rather than handle this one row at a time. makes code much simpler with a guaranteed peak maxx mem usage. |
Absolutely. I read your comments in the other thread and it's a losing battle to support larger dataset 5x reduction in max memory usage is worth doing I think. even if it won't solve everyone's not-small-data problems. |
agreed....#5902 looks interesting. IMHO its going to be non-trivial to do it, but possible. |
yeah, though we discussed this recently before that and @elyase made a comment to the same |
What's the rational of DataFrame() accepting a generator as an input but not an iterator? It converts generator to a list anyway, so why not do the same with an iterator? (Note I'm not talking about performance, just convenience.)
|
Hi there,
I have a codebase that (so far) consumes data series in the form of dict-yielding iterators, similar in form to how DataFrame's constructor accepts a list of such. Unfortunately it seems from current implementation of that constructor, there is no straightforward way to consume from an iterator straight into a DataFrame.
For example, I have a row factory for sqlite3 written in C that produces dicts. If pandas would accept an iterator, it should be possible to consume a series directly from speedy storage code into speedy data structure code. :)
Is there an inherent limitation preventing this? Otherwise with a few pointers I would be willing to try contribute the patch myself.
Thanks,
David
The text was updated successfully, but these errors were encountered: