Refactor from_records to load data in an efficient way #22025

datapythonista · 2018-07-23T14:03:53Z

Based on the SciPy sprint discussions, and the discussions on related issues, seems like from_records should be the pandas way to create a DataFrame from row based data.

The current signature is next:

def from_records(cls, data, index=None, exclude=None, columns=None, coerce_float=False, nrows=None):

And it currently supports input data as:

dict
numpy.ndarray
DataFrame
Iterable of list
Iterable of tuple
Iterable of dict

What I propose is to make from_records only work when data is an iterable of array-like (list, tuple, np.array...) or an iterable of dict. And deprecate the other cases (dict and DataFrame). After searching on GitHub repos and blogs, couldn't find cases where it's used with dict or DataFrame, and IMO the DataFrame constructor is a better way for those cases. This would make the code simpler.

Then, I'd add a new parameter dtypes expecting a list or a dict with the dtypes of the new DataFrame. With this, and for the case when data is a generator, and nrows and dtypes are specified, we wouldn't need to exhaust the generator and load it to Python structures. Meaning that we'd just need to allocate the DataFrame memory, and there wouldn't be any intermediate memory requirements.

Related issues: #5902, #2305, #2193, #4464, #1794, #13818.

@wesm, @jreback, @jorisvandenbossche any comments? Are you ok with this approach and the proposed changes to the API?

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-07-23T15:22:44Z

Big +1 in general!

What I propose is to make from_records only work when data is an iterable of array-like (list, tuple, np.array...) or an iterable of dict. And deprecate the other cases (dict and DataFrame).

The dict case is also kind of "wrong", as when seeing a dict as a "dict of records", I would expect a transposed result:

In [3]: pd.DataFrame.from_records({'a': [1, 2, 3], 'b': [3, 4, 5]})
Out[3]: 
   a  b
0  1  3
1  2  4
2  3  5

So +1 on deprecating that.
Only about deprecating the structured/record array I am less sure. Of course this perfectly works as well with the normal DataFrame constructor, but currently that is the main "documented" use case of from_records (at least according to the docstring), so I would expect more people to use this. (and it also gives some consistency with to_records).

datapythonista · 2018-07-23T15:46:40Z

Good points. But I think I wasn't clear with numpy.array or a record array. I'd keep them, as iterating over them returns array-like values. And I think that's the case from_records should address.

So, to clarify what I meant, I'd support:

Iterable of tuple, list or dict
2D numpy array, and record arrays
Anything else that when iterating over it returns a list-like or a dict

And I wouldn't support other cases, including:

DataFrame: what is it expected to do anyway? return the same, a copy, or be able to rename columns?
dict: Agree with Joris points, and we already have DataFrame.from_dict that covers the use case of a dict of rows (with orient='index')

So, I think we 100% agree.

jorisvandenbossche · 2018-07-23T16:05:31Z

Yep, and 2D numpy array and and record arrays are also consistent with the 'iterable of array-like', as iterating through them gives you row by row (as opposed to DataFrame or dict).

jreback · 2018-07-23T23:04:59Z

this seems ok - though would have to deprecate accepting a dict here

I am also not sure you ‘can allocate the Dataframe memory’ as you don’t know the length of the data apriori if it’s a generator

you are right that it is possible to not instantiating to python lists though (for iterables with a known length)

llawall · 2018-10-23T09:34:59Z

The case of accepting dict in from_records is also currently broken in my opinion. I filed #22708 (PR: #22687) to suggest making the ordering of columns consistent with the constructor taking a similar dict argument.

If from_records continues to accept dict then I think it should at least be made consistent though on the whole I am +1 with the sentiment for a cleaner API by removing dict support in from_records.

jorisvandenbossche · 2018-10-26T13:28:00Z

@llawall given the discussion here, I would maybe rather do a PR to deprecate passing a dict in from_records, than the PR you have now changing its behaviour.

datapythonista added IO Data IO issues that don't fit into a more specific label API Design Low-Memory labels Jul 23, 2018

datapythonista added this to the Contributions Welcome milestone Jul 23, 2018

datapythonista self-assigned this Jul 23, 2018

jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019

mroeschke added Performance Memory or execution speed performance and removed Low-Memory labels Apr 3, 2020

mroeschke removed the IO Data IO issues that don't fit into a more specific label label May 2, 2020

mroeschke added Deprecate Functionality to remove in pandas Refactor Internal refactoring of code and removed API Design labels Jun 21, 2021

mroeschke removed the Deprecate Functionality to remove in pandas label Jun 8, 2022

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

datapythonista removed their assignment Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor from_records to load data in an efficient way #22025

Refactor from_records to load data in an efficient way #22025

datapythonista commented Jul 23, 2018

jorisvandenbossche commented Jul 23, 2018

datapythonista commented Jul 23, 2018

jorisvandenbossche commented Jul 23, 2018

jreback commented Jul 23, 2018

llawall commented Oct 23, 2018

jorisvandenbossche commented Oct 26, 2018

Refactor from_records to load data in an efficient way #22025

Refactor from_records to load data in an efficient way #22025

Comments

datapythonista commented Jul 23, 2018

jorisvandenbossche commented Jul 23, 2018

datapythonista commented Jul 23, 2018

jorisvandenbossche commented Jul 23, 2018

jreback commented Jul 23, 2018

llawall commented Oct 23, 2018

jorisvandenbossche commented Oct 26, 2018