Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot produce DataFrame from dict-yielding iterator. #2193

Closed
dw opened this issue Nov 7, 2012 · 26 comments · Fixed by #21987
Closed

Cannot produce DataFrame from dict-yielding iterator. #2193

dw opened this issue Nov 7, 2012 · 26 comments · Fixed by #21987
Labels
Enhancement Internals Related to non-user accessible pandas implementation
Milestone

Comments

@dw
Copy link

dw commented Nov 7, 2012

Hi there,

I have a codebase that (so far) consumes data series in the form of dict-yielding iterators, similar in form to how DataFrame's constructor accepts a list of such. Unfortunately it seems from current implementation of that constructor, there is no straightforward way to consume from an iterator straight into a DataFrame.

For example, I have a row factory for sqlite3 written in C that produces dicts. If pandas would accept an iterator, it should be possible to consume a series directly from speedy storage code into speedy data structure code. :)

Is there an inherent limitation preventing this? Otherwise with a few pointers I would be willing to try contribute the patch myself.

Thanks,

David

@ghost
Copy link

ghost commented Nov 9, 2012

To the best of my knowledge, DataFrames are not lazy.

For example, in the absence of an index arg to the df ctor, len(data) is used to determine the size of the index.
Assuming you'll need all the data in memory anyway, I don't see how having a row generator will provide a win
for you.

In theory, you might have been able to avoid having two copies of the data in memory at the
same time by having a generator of rows, but that's not naturally supported right now afaict.
You also can't fake it by appending rows one at a time, because you end up copying the array
each time.

@wesm
Copy link
Member

wesm commented Jan 19, 2013

Could you please let me know what API you would want on this issue?

@ryanwitt
Copy link

ryanwitt commented Feb 1, 2013

perhaps something like DataFrame.from_records_lazy()?

@dw
Copy link
Author

dw commented Feb 1, 2013

Sorry for the late reply. :)

It seems the DataFrame constructor is already pretty heavily overloaded, however my initial desire would be for:

>>> pandas.DataFrame((x for x in range(0)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 417, in __init__
    raise PandasError('DataFrame constructor not properly called!')
pandas.core.common.PandasError: DataFrame constructor not properly called!

To result in a valid frame. Looking at the code it seems possible to add (yet another) fall-through for this case, however ryanwitt's suggestion is probably saner, given that the more tests added to that constructor the more magical and unpredictable it'll become.

The other problem is how to check for an iterable sanely: options appear to be attempting iter(obj) (which would succeed for strings!) or hasattr(obj, 'next'). Again it's sounding like something that shouldn't be implicitly tested for in a constructor, so +1 for from_records_lazy() or perhaps just from_iter().

@snth
Copy link
Contributor

snth commented Feb 14, 2013

+1 for from_iter()

I think something like this would be nice to have. For example I have the following utility function which imitates expand.grid() in R:

import itertools as it
import operator as op
import pandas as pd

def expand_grid(*args, **kwargs):
    columns = []
    lst = []
    if args:
        columns += xrange(len(args))
        lst += args
    if kwargs:
        columns += kwargs.iterkeys()
        lst += kwargs.itervalues()
    return pd.DataFrame(list(it.product(*lst)), columns=columns)

print expand_grid([0,1], [1,2,3])
print expand_grid(a=[0,1], b=[1,2,3])
print expand_grid([0,1], b=[1,2,3])

It would be nice if we didn't need the list() constructor in the last line of the function and could avoid creating a temporary copy of all the tuples.

@snth
Copy link
Contributor

snth commented Feb 14, 2013

Since in my case the total number of rows can be computed upfront, I tried to see if I could get it to work with some duck typing. Alas, I got stuck when it calls out to pandas.lib, here's what I had gotten to by then:

def expand_grid3(*args, **kwargs):
    columns = []
    lst = []
    if args:
        columns += xrange(len(args))
        lst += args
    if kwargs:
        columns += kwargs.iterkeys()
        lst += kwargs.itervalues()
    class _proditer(list):
        def __init__(self, l):
            self.N = reduce(op.mul, map(len, l), 1)
            self.iter_ = it.product(*l)
        def __iter__(self): return self
        def next(self): return next(self.iter_)
        def __len__(self): return self.N
        def __getitem__(self, item):
            if item==0:
                value = next(self.iter_)
                self.iter_ = it.chain(iter([value]), self.iter_)
                return value
            else:
                raise NotImplementedError("item=={}".format(item))
    lst = _proditer(lst)
    return pd.DataFrame.from_records(lst, index=range(len(lst)), columns=columns)

but this fails with:

In [3]: expand_grid3([0,1], [1,2,3])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-f632bd646903> in <module>()
----> 1 expand_grid3([0,1], [1,2,3])

/home/tobias/code/dev/argon/scripts/expand_grid.py in expand_grid3(*args, **kwargs)
     68                 raise NotImplementedError("item=={}".format(item))
     69     lst = _proditer(lst)
---> 70     return pd.DataFrame.from_records(lst, index=range(len(lst)), columns=columns)
     71 
     72 print expand_grid3([0,1], [1,2,3])

/home/tobias/code/envs/mac/local/lib/python2.7/site-packages/pandas-0.9.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in from_records(cls, data, index, exclude, columns, names, coerce_float)
    844         else:
    845             sdict, columns = _to_sdict(data, columns,
--> 846                                        coerce_float=coerce_float)
    847 
    848         if exclude is None:

/home/tobias/code/envs/mac/local/lib/python2.7/site-packages/pandas-0.9.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _to_sdict(data, columns, coerce_float)
   4958         return {}, columns
   4959     if isinstance(data[0], (list, tuple)):
-> 4960         return _list_to_sdict(data, columns, coerce_float=coerce_float)
   4961     elif isinstance(data[0], dict):
   4962         return _list_of_dict_to_sdict(data, columns, coerce_float=coerce_float)

/home/tobias/code/envs/mac/local/lib/python2.7/site-packages/pandas-0.9.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _list_to_sdict(data, columns, coerce_float)
   4971 def _list_to_sdict(data, columns, coerce_float=False):
   4972     if len(data) > 0 and isinstance(data[0], tuple):
-> 4973         content = list(lib.to_object_array_tuples(data).T)
   4974     elif len(data) > 0:
   4975         # list of lists

TypeError: Argument 'rows' has incorrect type (expected list, got _proditer)

@ghost
Copy link

ghost commented Feb 14, 2013

The form (x for x in xrange(10) suggested by dw yields a generator,
and accepting a generator that yields fixed-length sequences seems
like the logical choice here:

In [46]: def data_source():
    ...:     for row_count in xrange(10):
    ...:         yield np.random.rand(5)
    ...: pd.DataFrame(list(data_source()))
    ...: 
Out[46]: 
          0         1         2         3         4
0  0.241975  0.336069  0.584298  0.990567  0.628024
1  0.491197  0.153000  0.069927  0.923961  0.321994
2  0.331411  0.135619  0.742957  0.917101  0.751598
3  0.259744  0.566237  0.854419  0.478699  0.322557
4  0.237514  0.725468  0.258899  0.106149  0.238813
5  0.703378  0.905303  0.200714  0.492119  0.050031
6  0.881858  0.199581  0.057018  0.855477  0.094684
7  0.006712  0.081865  0.359633  0.758901  0.587118
8  0.511695  0.788931  0.753998  0.159687  0.678947
9  0.267654  0.840976  0.278592  0.051802  0.704802

In [47]: pd.DataFrame(data_source())
---------------------------------------------------------------------------
PandasError                               Traceback (most recent call last)
<ipython-input-47-5667ec91c127> in <module>()
----> 1 pd.DataFrame(data_source())

/home/user1/src/pandas/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    445                                          copy=False)
    446             else:
--> 447                 raise PandasError('DataFrame constructor not properly called!')
    448 
    449         NDFrame.__init__(self, mgr)

PandasError: DataFrame constructor not properly called!

In [48]: type(data_source())
Out[48]: generator

@elyase
Copy link

elyase commented Dec 17, 2013

+1, this would be nice to have, the API can be similar to numpy.fromiter where if you know the number of rows memory can can be preallocated. The main use would be to avoid having two copies of the same data in memory at the same time.

@jreback
Copy link
Contributor

jreback commented Dec 17, 2013

you can't really preallocate this

instead you allocate chunks and fill, appending as needed
then take a view of the final memory

could be done but only necessary for really really large structures

@ghost
Copy link

ghost commented Dec 28, 2013

@jreback, this just came to mind and it seems like there's a possible memory win here after all.
If the number of records is known, and the dtypes are inferred from the first record, we can allocate
the numpy arrays, mutate them one .next() at a time. The dataframe ctor can then reuse the resulting
arrays to form the df without copying.
Peak memory is then the numpy array + epsilon, rather then numpy array + full dataset as list of lists
of python objects. That can be a significant diff given the mem overhead of a python object .

0.14?

#1794
#2305

from_records seems to be doing the wrong thing memory-wise as well . Am I missing something obvious?

@jreback
Copy link
Contributor

jreback commented Dec 28, 2013

yep not that hard
you would have to intercept the generator as a separate case in init
then ultimately just create an ndarray then. init_ndarray

@jreback
Copy link
Contributor

jreback commented Dec 28, 2013

FYI could refactor from_records in any event to be more integrated into reg dataframe ctor

@jreback
Copy link
Contributor

jreback commented Dec 28, 2013

somewhat related (if from_records is refactored), #4464

@ghost
Copy link

ghost commented Jan 16, 2014

Comparing memory footprint in a roundabout way:

In [2]: import pickle
   ...: l=arange(1000000)
   ...: pickle.dump(l,open("/tmp/py","wb"))
In [4]: np.save(open("/tmp/np","wb"),np.array(l))
In [5]: ll /tmp/np
-rw-rw-r--. 1 user1 8000080 Jan 16 15:45 /tmp/np
In [6]: ll /tmp/py
-rw-rw-r--. 1 user1 29340698 Jan 16 15:44 /tmp/py

so a roughly 4x reduction in footprint by not materializing fully in memory.
@jreback, is that a valid measurement or did I miss something?

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

I don't know why you think that is a valid measurement. They are 2 different formats. npz is a specific binary format, where pickle is a text-ified binary repr

@ghost
Copy link

ghost commented Jan 16, 2014

I was trying to measure the savings by eliminating the per-object overhead of python objects
in favor of numpy arrays. np.save surely takes advantage of that to store things more compactly,
illustrating the potential in-memory saving, pickle apperently does not.

Will look around for a better way to measure in-memory python footprint.

@ghost
Copy link

ghost commented Jan 16, 2014

Second try:

In [27]: from pympler.asizeof import asizeof
    ...: asizeof([1,2],infer=True)-asizeof([2],infer=True)
Out[27]: 32

In [29]: np.save(open("/tmp/np1","wb"),np.array([1,2]))
    ...: np.save(open("/tmp/np2","wb"),np.array([1]))

In [30]: ll /tmp/np1
-rw-rw-r--. 1 user1 96 Jan 16 16:22 /tmp/np1

In [31]: ll /tmp/np2
-rw-rw-r--. 1 user1 88 Jan 16 16:22 /tmp/np2

In [32]: 96-88
Out[32]: 8

python objects require 32 bytes to store an additional int64, numpy array require only 8 (64bits), a 4x difference.

That's an added bonus to the basic win of not having an extra copy of the data in memory during construction.
numpy arrays are way more efficient in storing the data in memory.

Is that a valid conclusion in your eyes?

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

did you try memory_profiler (use via %memit)

@ghost
Copy link

ghost commented Jan 16, 2014

It monitors overall process memory, I don't trust it for detailed measurement.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

oh I agree numpy arrays are more efficient, but the point is that they ARE starting as python objects, so not sure this makes much difference in practice as it should be chunked anyhow (THAT is how to save memory)

@ghost
Copy link

ghost commented Jan 16, 2014

They are starting a python objects, but potentially only one row at a time. That's the iterator's business.
If that's the case, then peak mem usage will be (for this int64 case for example) O(x) instead of O(x+4x).

Isn't that right?

Update: mod the fact that O(x) = O(5x)... :) . far better constants is what I meant.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

I agree....I just think the soln is easier to simply chunk the stream into say x rows, then create frames and concatenate at the end, rather than handle this one row at a time. makes code much simpler with a guaranteed peak maxx mem usage.

@ghost
Copy link

ghost commented Jan 16, 2014

Absolutely. I read your comments in the other thread and it's a losing battle to support larger dataset
by reducing memory footprint. chunking is a much more general solution.
There's still reasonable cases where this would make a big difference for users, in eliminating the requirement
to move to a more complex worklfow.

5x reduction in max memory usage is worth doing I think. even if it won't solve everyone's not-small-data problems.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

agreed....#5902 looks interesting. IMHO its going to be non-trivial to do it, but possible.

@ghost
Copy link

ghost commented Jan 16, 2014

yeah, though we discussed this recently before that and @elyase made a comment to the same
effect a little before the idea occured to me.

@pkch
Copy link

pkch commented Jul 9, 2016

What's the rational of DataFrame() accepting a generator as an input but not an iterator? It converts generator to a list anyway, so why not do the same with an iterator? (Note I'm not talking about performance, just convenience.)

def gen():
    for i in range(5):
        yield {'a': 1, 'b': 10}
pd.DataFrame(gen(), columns=['a', 'b']) # ok
pd.DataFrame(itertools.islice(gen(), 2), columns=['a', 'b']) # TypeError: data argument can't be an iterator 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants