Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ER: compound dtypes - DataFrame constructor/astype #4464

Open
mamikonyan opened this issue Aug 5, 2013 · 38 comments
Open

ER: compound dtypes - DataFrame constructor/astype #4464

mamikonyan opened this issue Aug 5, 2013 · 38 comments
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions Enhancement

Comments

@mamikonyan
Copy link

xref #9133, maybe allow a dict of dtypes to be passed as well

I trying to use the dtype argument in the DataFrame constructor to set the types of several columns, and I'm getting incorrect types. Everything works well, however, when the dtypes come from the recarray itself.

In [61]:
data = [(1,1.2), (2,2.3)]
dtype = [('a','i4'),('b','f4')]
a = np.array(data, dtype=dtype)
pd.DataFrame(a).dtypes

Out [61]:
a      int32
b    float32
dtype: object

But if I use the dtype constructor argument, I get incorrect types:

In [65]:
pd.DataFrame(data, dtype=dtype).dtypes
Out [65]:
0    object
1    object
dtype: object

The astype() member function doesn't work either:

In [75]:
pd.DataFrame(data).astype(dtype)

Truncated Traceback (Use C-c C-x to view full TB):
c:\Anaconda\lib\site-packages\pandas\core\common.pyc in take_nd(arr, indexer,  axis, out, fill_value, mask_info, allow_fill)
    491         indexer = _ensure_int64(indexer)
    492         if not allow_fill:
--> 493             dtype, fill_value = arr.dtype, arr.dtype.type()
    494             mask_info = None, False
    495         else:

TypeError: function takes exactly 1 argument (0 given)

I'm using Pandas 0.11.0 from Anaconda.
Thanks in advance.

@cpcloud
Copy link
Member

cpcloud commented Aug 5, 2013

this will be fixed, but is there a case where you need to pass the dtype? I suppose if you have the array and the dtype separate then you might want to just pass them in, but if you have that you can construct the recarray and then pass to DataFrame...

@jreback
Copy link
Contributor

jreback commented Aug 5, 2013

actually, this is not currently implemented

@jreback
Copy link
Contributor

jreback commented Aug 5, 2013

this should raise now, marking as a bug for that (it expectes a single dtype, not a compound one)

@cpcloud
Copy link
Member

cpcloud commented Aug 5, 2013

@jreback so it should raise? what's wrong with accepting a compound dtype?

@jreback
Copy link
Contributor

jreback commented Aug 5, 2013

@cpcloud nothing wrong with accepting it, but its NotImplementedError (until it is)

@mamikonyan
Copy link
Author

OK, thanks, guys.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2013

@mamikony going to reopen...thanks for noticing this....don't really have any test for this

@jreback jreback reopened this Aug 5, 2013
@mamikonyan
Copy link
Author

So, what do you think about the second issue of astype() throwing?

@jreback
Copy link
Contributor

jreback commented Aug 5, 2013

same issue, its setup to deal with a single dtype (not compound). the purpose is to coerce your data. What is your goal here?

@mamikonyan
Copy link
Author

I had a series where the elements were compound strings,

In [130]:
pd.Series(['A:1:3.14'])

Out [130]:
0    A:1:3.14
dtype: object

So I wanted to split them into a data frame with appropriate types, but I got this garbage:

In [131]:
pd.DataFrame(_.map(lambda s: s.split(':')).tolist(), dtype=[('a','S1'),('b','i4'),('c','f4')])
Out [131]:
             0            1                  2
0  (A, 0, 0.0)  (1, 0, 0.0)  (3, 3420462, 0.0)

I'll just have to do it column by column.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2013

The split creates a series with elements that are lists
The apply creates a frame from this
convert_objects with the convert_numeric=True coerces strings to numbers (if it can)

In [1]: s = pd.Series(['A:1:3.14'])

In [7]: s.str.split(':').apply(Series).convert_objects(convert_numeric=True)
Out[7]: 
   0  1     2
0  A  1  3.14

In [8]: s.str.split(':').apply(Series).convert_objects(convert_numeric=True).dtypes
Out[8]: 
0     object
1      int64
2    float64
dtype: object

@mamikonyan
Copy link
Author

Fair enough, thanks. You can't set the dtype, but at least it guesses correctly. I need precise control of the dtype because I write it out to an HDF5 file (with Pytables and df.to_records()) and I'd like to have the proper dtype at the start. Also, I didn't know about the str namespace. Thanks.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2013

@mamikony

you can certainly change it if you like (just do it one-by-one)

are you not using HDFStore? (which uses PyTables under the hood)? (and will be significantly faster than using to_records, btw), not to mention that it save indices and almost all dtypes. see here: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables

@mamikonyan
Copy link
Author

I understand about the HDFStore, and I use it to read files. But unfortunately, I don't like to use it for output because it creates Pandas-specific HDF5 files that look quite incomprehensible for standard tools, e.g., h5ls(1). Also, I usually don't want to save details like index. At least I remember this to be the case a few months ago, I'm not sure if there is a new way of creating clean HDF5 files.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2013

up 2 you. They actually are fully compatible HDF5 Files, just with extra meta-data (and the indices are just columns).
If you are using PyTables, you should use their tools, ptdump,ptrepack, rather than a 'standard' HDF5 tool which doesn't understand even the PyTables meta data

@mamikonyan
Copy link
Author

Sure, but if you're sharing data with people who use other tools and languages to read your HDF5 files, then your meta-data becomes extraneous garbage. Actually, if I could put in a feature request, I'd like to ask you guys to include an option to create plain HDF5 files. I think there is an option (don't remember right now) that makes slightly cleaner files, but even that puts extra stuff in.

In any case, I appreciate your help.

@jreback
Copy link
Contributor

jreback commented Sep 26, 2017

you are doing things in a very inefficient manner
by storing lists

@ehein6
Copy link

ehein6 commented Sep 26, 2017

I don't want to store lists. I want the DataFrame structure from the previous post. Do you know of a fast, idiomatic way to create that structure from the json I provided?

@jreback
Copy link
Contributor

jreback commented Sep 26, 2017

use read_json directly

@ehein6
Copy link

ehein6 commented Sep 27, 2017

Like this?

>>> print pd.read_json(json_data)
Traceback (most recent call last):
  File "example.py", line 23, in <module>
    print pd.read_json(json_data)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/json/json.py", line 322, in read_json
    encoding=encoding)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/common.py", line 210, in get_filepath_or_buffer
    raise ValueError(msg.format(_type=type(filepath_or_buffer)))
ValueError: Invalid file path or buffer object type: <type 'list'>

It doesn't work because in my example code, json_data is an object, not a path or string. This is why I called json.dumps() before passing it to pd.read_json in my previous example. But that still gives the wrong output, with lists nested inside cells of the DataFrame.

Please give me a clear example of what you mean. Again, I want an fast, idiomatic way to go from this:

json_data = [
    {"day":"Monday", "temp":[32, 33, 34, 34], "humidity": [50, 60, 70, 60]},
    {"day":"Tuesday", "temp":[32, 33, 34, 34], "humidity": [50, 60, 70, 60]},
    {"day":"Wednesday", "humidity": [50, 60, 70, 60]},
    {"day":"Thursday", "temp":[32, 33, 34, 34], "humidity": [50, 60, 70, 60]},
    {"day":"Friday", "temp":[32, 33, 34, 34]},
]

to this:

         day  humidity  temp
0     Monday      50.0  32.0
1     Monday      60.0  33.0
2     Monday      70.0  34.0
3     Monday      60.0  34.0
0    Tuesday      50.0  32.0
1    Tuesday      60.0  33.0
2    Tuesday      70.0  34.0
3    Tuesday      60.0  34.0
0  Wednesday      50.0   NaN
1  Wednesday      60.0   NaN
2  Wednesday      70.0   NaN
3  Wednesday      60.0   NaN
0   Thursday      50.0  32.0
1   Thursday      60.0  33.0
2   Thursday      70.0  34.0
3   Thursday      60.0  34.0
0     Friday       NaN  32.0
1     Friday       NaN  33.0
2     Friday       NaN  34.0
3     Friday       NaN  34.0

@jreback
Copy link
Contributor

jreback commented Sep 28, 2017

In [29]: pd.concat([pd.DataFrame(l) for l in json_data])
Out[29]: 
         day  humidity  temp
0     Monday      50.0  32.0
1     Monday      60.0  33.0
2     Monday      70.0  34.0
3     Monday      60.0  34.0
0    Tuesday      50.0  32.0
1    Tuesday      60.0  33.0
2    Tuesday      70.0  34.0
3    Tuesday      60.0  34.0
0  Wednesday      50.0   NaN
1  Wednesday      60.0   NaN
2  Wednesday      70.0   NaN
3  Wednesday      60.0   NaN
0   Thursday      50.0  32.0
1   Thursday      60.0  33.0
2   Thursday      70.0  34.0
3   Thursday      60.0  34.0
0     Friday       NaN  32.0
1     Friday       NaN  33.0
2     Friday       NaN  34.0
3     Friday       NaN  34.0

@jreback
Copy link
Contributor

jreback commented Sep 28, 2017

DataFrame.from_records above has a bit of overhead. Your basic problem is the data starts out in a python structure (dict of lists).

@ehein6
Copy link

ehein6 commented Sep 29, 2017

Right, it's the creation of DataFrames in a tight loop that causes the overhead, and unfortunately it looks like this can't be avoided without changing my input data structure. My original question was whether passing a compound dtype argument to this constructor would help a little bit, but this seems unlikely now. Thanks for taking a look!

@jreback
Copy link
Contributor

jreback commented Oct 1, 2017

@ehein6 you don't need to create a dataframe in each part of the loop, just do it once at the end.

@ehein6
Copy link

ehein6 commented Oct 2, 2017

Your latest example is calling pd.DataFrame() inside a list comprehension. Aside from the syntax sugar, the code I originally posted is doing exactly the same thing: building up a list of DataFrames, then calling pd.concat() to join them together. If it can indeed be done without creating DataFrames in a loop, please show me how.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2017

my example is not the same as yours
you are calling DataFrame.from_records which is very different

@ehein6
Copy link

ehein6 commented Oct 2, 2017

In this case they are identical. Using just pd.DataFrame() instead of pd.DataFrame.from_records() doesn't change the output. It doesn't change the performance on my full dataset, either:

Best of 20 runs:
pd.DataFrame.from_records(): 3.65 seconds
pd.DataFrame(): 3.65 seconds

Either way, both examples are still creating a DataFrame in a loop.

@Quetzalcohuatl
Copy link

Got any ideas on how one can fix this? I have a list of dictionaries, and I'm just doing pd.DataFrame(list_of_dicts) but I would prefer to specify the dtypes because I'm low on RAM.

Also, consider changing the title of this Github issue? Most people in here and other mentioned issues all talking about passing a dictionary into dtypes either during construction or after it's already defined. Kind of a cryptic title.

@Raphencoder
Copy link

Any news or recommendation on this subject ? 10 years it has been open..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

No branches or pull requests

10 participants