Cannot produce DataFrame from dict-yielding iterator. #2193

dw · 2012-11-07T22:21:44Z

Hi there,

I have a codebase that (so far) consumes data series in the form of dict-yielding iterators, similar in form to how DataFrame's constructor accepts a list of such. Unfortunately it seems from current implementation of that constructor, there is no straightforward way to consume from an iterator straight into a DataFrame.

For example, I have a row factory for sqlite3 written in C that produces dicts. If pandas would accept an iterator, it should be possible to consume a series directly from speedy storage code into speedy data structure code. :)

Is there an inherent limitation preventing this? Otherwise with a few pointers I would be willing to try contribute the patch myself.

Thanks,

David

ghost · 2012-11-09T00:33:01Z

To the best of my knowledge, DataFrames are not lazy.

For example, in the absence of an index arg to the df ctor, len(data) is used to determine the size of the index.
Assuming you'll need all the data in memory anyway, I don't see how having a row generator will provide a win
for you.

In theory, you might have been able to avoid having two copies of the data in memory at the
same time by having a generator of rows, but that's not naturally supported right now afaict.
You also can't fake it by appending rows one at a time, because you end up copying the array
each time.

wesm · 2013-01-19T19:43:31Z

Could you please let me know what API you would want on this issue?

ryanwitt · 2013-02-01T03:45:00Z

perhaps something like DataFrame.from_records_lazy()?

dw · 2013-02-01T09:06:48Z

Sorry for the late reply. :)

It seems the DataFrame constructor is already pretty heavily overloaded, however my initial desire would be for:

>>> pandas.DataFrame((x for x in range(0)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 417, in __init__
    raise PandasError('DataFrame constructor not properly called!')
pandas.core.common.PandasError: DataFrame constructor not properly called!

To result in a valid frame. Looking at the code it seems possible to add (yet another) fall-through for this case, however ryanwitt's suggestion is probably saner, given that the more tests added to that constructor the more magical and unpredictable it'll become.

The other problem is how to check for an iterable sanely: options appear to be attempting iter(obj) (which would succeed for strings!) or hasattr(obj, 'next'). Again it's sounding like something that shouldn't be implicitly tested for in a constructor, so +1 for from_records_lazy() or perhaps just from_iter().

snth · 2013-02-14T12:09:38Z

+1 for from_iter()

I think something like this would be nice to have. For example I have the following utility function which imitates expand.grid() in R:

import itertools as it
import operator as op
import pandas as pd

def expand_grid(*args, **kwargs):
    columns = []
    lst = []
    if args:
        columns += xrange(len(args))
        lst += args
    if kwargs:
        columns += kwargs.iterkeys()
        lst += kwargs.itervalues()
    return pd.DataFrame(list(it.product(*lst)), columns=columns)

print expand_grid([0,1], [1,2,3])
print expand_grid(a=[0,1], b=[1,2,3])
print expand_grid([0,1], b=[1,2,3])

It would be nice if we didn't need the list() constructor in the last line of the function and could avoid creating a temporary copy of all the tuples.

snth · 2013-02-14T12:36:57Z

Since in my case the total number of rows can be computed upfront, I tried to see if I could get it to work with some duck typing. Alas, I got stuck when it calls out to pandas.lib, here's what I had gotten to by then:

def expand_grid3(*args, **kwargs):
    columns = []
    lst = []
    if args:
        columns += xrange(len(args))
        lst += args
    if kwargs:
        columns += kwargs.iterkeys()
        lst += kwargs.itervalues()
    class _proditer(list):
        def __init__(self, l):
            self.N = reduce(op.mul, map(len, l), 1)
            self.iter_ = it.product(*l)
        def __iter__(self): return self
        def next(self): return next(self.iter_)
        def __len__(self): return self.N
        def __getitem__(self, item):
            if item==0:
                value = next(self.iter_)
                self.iter_ = it.chain(iter([value]), self.iter_)
                return value
            else:
                raise NotImplementedError("item=={}".format(item))
    lst = _proditer(lst)
    return pd.DataFrame.from_records(lst, index=range(len(lst)), columns=columns)

but this fails with:

In [3]: expand_grid3([0,1], [1,2,3])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-f632bd646903> in <module>()
----> 1 expand_grid3([0,1], [1,2,3])

/home/tobias/code/dev/argon/scripts/expand_grid.py in expand_grid3(*args, **kwargs)
     68                 raise NotImplementedError("item=={}".format(item))
     69     lst = _proditer(lst)
---> 70     return pd.DataFrame.from_records(lst, index=range(len(lst)), columns=columns)
     71 
     72 print expand_grid3([0,1], [1,2,3])

/home/tobias/code/envs/mac/local/lib/python2.7/site-packages/pandas-0.9.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in from_records(cls, data, index, exclude, columns, names, coerce_float)
    844         else:
    845             sdict, columns = _to_sdict(data, columns,
--> 846                                        coerce_float=coerce_float)
    847 
    848         if exclude is None:

/home/tobias/code/envs/mac/local/lib/python2.7/site-packages/pandas-0.9.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _to_sdict(data, columns, coerce_float)
   4958         return {}, columns
   4959     if isinstance(data[0], (list, tuple)):
-> 4960         return _list_to_sdict(data, columns, coerce_float=coerce_float)
   4961     elif isinstance(data[0], dict):
   4962         return _list_of_dict_to_sdict(data, columns, coerce_float=coerce_float)

/home/tobias/code/envs/mac/local/lib/python2.7/site-packages/pandas-0.9.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _list_to_sdict(data, columns, coerce_float)
   4971 def _list_to_sdict(data, columns, coerce_float=False):
   4972     if len(data) > 0 and isinstance(data[0], tuple):
-> 4973         content = list(lib.to_object_array_tuples(data).T)
   4974     elif len(data) > 0:
   4975         # list of lists

TypeError: Argument 'rows' has incorrect type (expected list, got _proditer)

ghost · 2013-02-14T18:23:34Z

The form (x for x in xrange(10) suggested by dw yields a generator,
and accepting a generator that yields fixed-length sequences seems
like the logical choice here:

In [46]: def data_source():
    ...:     for row_count in xrange(10):
    ...:         yield np.random.rand(5)
    ...: pd.DataFrame(list(data_source()))
    ...: 
Out[46]: 
          0         1         2         3         4
0  0.241975  0.336069  0.584298  0.990567  0.628024
1  0.491197  0.153000  0.069927  0.923961  0.321994
2  0.331411  0.135619  0.742957  0.917101  0.751598
3  0.259744  0.566237  0.854419  0.478699  0.322557
4  0.237514  0.725468  0.258899  0.106149  0.238813
5  0.703378  0.905303  0.200714  0.492119  0.050031
6  0.881858  0.199581  0.057018  0.855477  0.094684
7  0.006712  0.081865  0.359633  0.758901  0.587118
8  0.511695  0.788931  0.753998  0.159687  0.678947
9  0.267654  0.840976  0.278592  0.051802  0.704802

In [47]: pd.DataFrame(data_source())
---------------------------------------------------------------------------
PandasError                               Traceback (most recent call last)
<ipython-input-47-5667ec91c127> in <module>()
----> 1 pd.DataFrame(data_source())

/home/user1/src/pandas/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    445                                          copy=False)
    446             else:
--> 447                 raise PandasError('DataFrame constructor not properly called!')
    448 
    449         NDFrame.__init__(self, mgr)

PandasError: DataFrame constructor not properly called!

In [48]: type(data_source())
Out[48]: generator

elyase · 2013-12-17T15:33:25Z

+1, this would be nice to have, the API can be similar to numpy.fromiter where if you know the number of rows memory can can be preallocated. The main use would be to avoid having two copies of the same data in memory at the same time.

jreback · 2013-12-17T15:43:55Z

you can't really preallocate this

instead you allocate chunks and fill, appending as needed
then take a view of the final memory

could be done but only necessary for really really large structures

ghost · 2013-12-28T18:59:30Z

@jreback, this just came to mind and it seems like there's a possible memory win here after all.
If the number of records is known, and the dtypes are inferred from the first record, we can allocate
the numpy arrays, mutate them one .next() at a time. The dataframe ctor can then reuse the resulting
arrays to form the df without copying.
Peak memory is then the numpy array + epsilon, rather then numpy array + full dataset as list of lists
of python objects. That can be a significant diff given the mem overhead of a python object .

0.14?

#1794
#2305

from_records seems to be doing the wrong thing memory-wise as well . Am I missing something obvious?

jreback · 2013-12-28T19:08:31Z

yep not that hard
you would have to intercept the generator as a separate case in init
then ultimately just create an ndarray then. init_ndarray

jreback · 2013-12-28T19:09:46Z

FYI could refactor from_records in any event to be more integrated into reg dataframe ctor

jreback · 2013-12-28T19:11:06Z

somewhat related (if from_records is refactored), #4464

ghost · 2014-01-16T13:47:06Z

Comparing memory footprint in a roundabout way:

In [2]: import pickle
   ...: l=arange(1000000)
   ...: pickle.dump(l,open("/tmp/py","wb"))
In [4]: np.save(open("/tmp/np","wb"),np.array(l))
In [5]: ll /tmp/np
-rw-rw-r--. 1 user1 8000080 Jan 16 15:45 /tmp/np
In [6]: ll /tmp/py
-rw-rw-r--. 1 user1 29340698 Jan 16 15:44 /tmp/py

so a roughly 4x reduction in footprint by not materializing fully in memory.
@jreback, is that a valid measurement or did I miss something?

jreback · 2014-01-16T13:50:16Z

I don't know why you think that is a valid measurement. They are 2 different formats. npz is a specific binary format, where pickle is a text-ified binary repr

ghost · 2014-01-16T13:57:20Z

I was trying to measure the savings by eliminating the per-object overhead of python objects
in favor of numpy arrays. np.save surely takes advantage of that to store things more compactly,
illustrating the potential in-memory saving, pickle apperently does not.

Will look around for a better way to measure in-memory python footprint.

ghost · 2014-01-16T14:26:11Z

Second try:

In [27]: from pympler.asizeof import asizeof
    ...: asizeof([1,2],infer=True)-asizeof([2],infer=True)
Out[27]: 32

In [29]: np.save(open("/tmp/np1","wb"),np.array([1,2]))
    ...: np.save(open("/tmp/np2","wb"),np.array([1]))

In [30]: ll /tmp/np1
-rw-rw-r--. 1 user1 96 Jan 16 16:22 /tmp/np1

In [31]: ll /tmp/np2
-rw-rw-r--. 1 user1 88 Jan 16 16:22 /tmp/np2

In [32]: 96-88
Out[32]: 8

python objects require 32 bytes to store an additional int64, numpy array require only 8 (64bits), a 4x difference.

That's an added bonus to the basic win of not having an extra copy of the data in memory during construction.
numpy arrays are way more efficient in storing the data in memory.

Is that a valid conclusion in your eyes?

jreback · 2014-01-16T14:26:11Z

did you try memory_profiler (use via %memit)

ghost · 2014-01-16T14:29:00Z

It monitors overall process memory, I don't trust it for detailed measurement.

jreback · 2014-01-16T14:32:56Z

oh I agree numpy arrays are more efficient, but the point is that they ARE starting as python objects, so not sure this makes much difference in practice as it should be chunked anyhow (THAT is how to save memory)

ghost · 2014-01-16T14:36:49Z

They are starting a python objects, but potentially only one row at a time. That's the iterator's business.
If that's the case, then peak mem usage will be (for this int64 case for example) O(x) instead of O(x+4x).

Isn't that right?

Update: mod the fact that O(x) = O(5x)... :) . far better constants is what I meant.

jreback · 2014-01-16T14:38:19Z

I agree....I just think the soln is easier to simply chunk the stream into say x rows, then create frames and concatenate at the end, rather than handle this one row at a time. makes code much simpler with a guaranteed peak maxx mem usage.

ghost · 2014-01-16T14:40:43Z

Absolutely. I read your comments in the other thread and it's a losing battle to support larger dataset
by reducing memory footprint. chunking is a much more general solution.
There's still reasonable cases where this would make a big difference for users, in eliminating the requirement
to move to a more complex worklfow.

5x reduction in max memory usage is worth doing I think. even if it won't solve everyone's not-small-data problems.

jreback · 2014-01-16T14:44:25Z

agreed....#5902 looks interesting. IMHO its going to be non-trivial to do it, but possible.

ghost · 2014-01-16T15:54:48Z

yeah, though we discussed this recently before that and @elyase made a comment to the same
effect a little before the idea occured to me.

pkch · 2016-07-09T05:15:57Z

What's the rational of DataFrame() accepting a generator as an input but not an iterator? It converts generator to a list anyway, so why not do the same with an iterator? (Note I'm not talking about performance, just convenience.)

def gen():
    for i in range(5):
        yield {'a': 1, 'b': 10}
pd.DataFrame(gen(), columns=['a', 'b']) # ok
pd.DataFrame(itertools.islice(gen(), 2), columns=['a', 'b']) # TypeError: data argument can't be an iterator

ghost self-assigned this Jan 10, 2014

tinproject mentioned this issue Jan 10, 2014

PERF: Load data (create Series, Dataframe) in a more functional way. #5902

Closed

ghost closed this as completed Jan 12, 2014

ghost mentioned this issue Jan 16, 2014

Creating DataFrame from generator/iterator yielding records with minimal memory use #2305

Closed

ghost reopened this Jan 16, 2014

ghost mentioned this issue Jan 18, 2014

ENH/API: Add count parameter to limit generator in Series, DataFrame, and DataFrame.from_records() #5898

Closed

ghost mentioned this issue Feb 1, 2014

Reducing memory footprint for catagoricals #6219

Closed

jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014

jreback modified the milestones: Someday, 0.16.0 Jan 26, 2015

This was referenced Jul 7, 2018

Why data in DataFrame(data) can be a generator but not an iterator? #21783

Closed

Can't create DataFrame from SQLite3 cursor #10134

Closed

holymonson mentioned this issue Jul 20, 2018

allow using Iterable in Series and DataFrame constructor #21987

Merged

4 tasks

datapythonista mentioned this issue Jul 23, 2018

Refactor from_records to load data in an efficient way #22025

Open

jreback modified the milestones: Someday, 0.24.0 Jul 25, 2018

jreback closed this as completed in #21987 Jul 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot produce DataFrame from dict-yielding iterator. #2193

Cannot produce DataFrame from dict-yielding iterator. #2193

dw commented Nov 7, 2012

ghost commented Nov 9, 2012

wesm commented Jan 19, 2013

ryanwitt commented Feb 1, 2013

dw commented Feb 1, 2013

snth commented Feb 14, 2013

snth commented Feb 14, 2013

ghost commented Feb 14, 2013

elyase commented Dec 17, 2013

jreback commented Dec 17, 2013

ghost commented Dec 28, 2013

jreback commented Dec 28, 2013

jreback commented Dec 28, 2013

jreback commented Dec 28, 2013

ghost commented Jan 16, 2014

jreback commented Jan 16, 2014

ghost commented Jan 16, 2014

ghost commented Jan 16, 2014

jreback commented Jan 16, 2014

ghost commented Jan 16, 2014

jreback commented Jan 16, 2014

ghost commented Jan 16, 2014

jreback commented Jan 16, 2014

ghost commented Jan 16, 2014

jreback commented Jan 16, 2014

ghost commented Jan 16, 2014

pkch commented Jul 9, 2016 •

edited

Loading

Cannot produce DataFrame from dict-yielding iterator. #2193

Cannot produce DataFrame from dict-yielding iterator. #2193

Comments

dw commented Nov 7, 2012

ghost commented Nov 9, 2012

wesm commented Jan 19, 2013

ryanwitt commented Feb 1, 2013

dw commented Feb 1, 2013

snth commented Feb 14, 2013

snth commented Feb 14, 2013

ghost commented Feb 14, 2013

elyase commented Dec 17, 2013

jreback commented Dec 17, 2013

ghost commented Dec 28, 2013

jreback commented Dec 28, 2013

jreback commented Dec 28, 2013

jreback commented Dec 28, 2013

ghost commented Jan 16, 2014

jreback commented Jan 16, 2014

ghost commented Jan 16, 2014

ghost commented Jan 16, 2014

jreback commented Jan 16, 2014

ghost commented Jan 16, 2014

jreback commented Jan 16, 2014

ghost commented Jan 16, 2014

jreback commented Jan 16, 2014

ghost commented Jan 16, 2014

jreback commented Jan 16, 2014

ghost commented Jan 16, 2014

pkch commented Jul 9, 2016 • edited Loading

pkch commented Jul 9, 2016 •

edited

Loading