Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor from_records to load data in an efficient way #22025

Open
datapythonista opened this issue Jul 23, 2018 · 6 comments
Open

Refactor from_records to load data in an efficient way #22025

datapythonista opened this issue Jul 23, 2018 · 6 comments
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance Refactor Internal refactoring of code

Comments

@datapythonista
Copy link
Member

Based on the SciPy sprint discussions, and the discussions on related issues, seems like from_records should be the pandas way to create a DataFrame from row based data.

The current signature is next:

  • def from_records(cls, data, index=None, exclude=None, columns=None, coerce_float=False, nrows=None):

And it currently supports input data as:

  • dict
  • numpy.ndarray
  • DataFrame
  • Iterable of list
  • Iterable of tuple
  • Iterable of dict

What I propose is to make from_records only work when data is an iterable of array-like (list, tuple, np.array...) or an iterable of dict. And deprecate the other cases (dict and DataFrame). After searching on GitHub repos and blogs, couldn't find cases where it's used with dict or DataFrame, and IMO the DataFrame constructor is a better way for those cases. This would make the code simpler.

Then, I'd add a new parameter dtypes expecting a list or a dict with the dtypes of the new DataFrame. With this, and for the case when data is a generator, and nrows and dtypes are specified, we wouldn't need to exhaust the generator and load it to Python structures. Meaning that we'd just need to allocate the DataFrame memory, and there wouldn't be any intermediate memory requirements.

Related issues: #5902, #2305, #2193, #4464, #1794, #13818.

@wesm, @jreback, @jorisvandenbossche any comments? Are you ok with this approach and the proposed changes to the API?

@datapythonista datapythonista added IO Data IO issues that don't fit into a more specific label API Design Low-Memory labels Jul 23, 2018
@datapythonista datapythonista added this to the Contributions Welcome milestone Jul 23, 2018
@datapythonista datapythonista self-assigned this Jul 23, 2018
@jorisvandenbossche
Copy link
Member

Big +1 in general!

What I propose is to make from_records only work when data is an iterable of array-like (list, tuple, np.array...) or an iterable of dict. And deprecate the other cases (dict and DataFrame).

The dict case is also kind of "wrong", as when seeing a dict as a "dict of records", I would expect a transposed result:

In [3]: pd.DataFrame.from_records({'a': [1, 2, 3], 'b': [3, 4, 5]})
Out[3]: 
   a  b
0  1  3
1  2  4
2  3  5

So +1 on deprecating that.
Only about deprecating the structured/record array I am less sure. Of course this perfectly works as well with the normal DataFrame constructor, but currently that is the main "documented" use case of from_records (at least according to the docstring), so I would expect more people to use this. (and it also gives some consistency with to_records).

@datapythonista
Copy link
Member Author

Good points. But I think I wasn't clear with numpy.array or a record array. I'd keep them, as iterating over them returns array-like values. And I think that's the case from_records should address.

So, to clarify what I meant, I'd support:

  • Iterable of tuple, list or dict
  • 2D numpy array, and record arrays
  • Anything else that when iterating over it returns a list-like or a dict

And I wouldn't support other cases, including:

  • DataFrame: what is it expected to do anyway? return the same, a copy, or be able to rename columns?
  • dict: Agree with Joris points, and we already have DataFrame.from_dict that covers the use case of a dict of rows (with orient='index')

So, I think we 100% agree.

@jorisvandenbossche
Copy link
Member

Yep, and 2D numpy array and and record arrays are also consistent with the 'iterable of array-like', as iterating through them gives you row by row (as opposed to DataFrame or dict).

@jreback
Copy link
Contributor

jreback commented Jul 23, 2018

this seems ok - though would have to deprecate accepting a dict here

I am also not sure you ‘can allocate the Dataframe memory’ as you don’t know the length of the data apriori if it’s a generator

you are right that it is possible to not instantiating to python lists though (for iterables with a known length)

@llawall
Copy link

llawall commented Oct 23, 2018

The case of accepting dict in from_records is also currently broken in my opinion. I filed #22708 (PR: #22687) to suggest making the ordering of columns consistent with the constructor taking a similar dict argument.

If from_records continues to accept dict then I think it should at least be made consistent though on the whole I am +1 with the sentiment for a cleaner API by removing dict support in from_records.

@jorisvandenbossche
Copy link
Member

@llawall given the discussion here, I would maybe rather do a PR to deprecate passing a dict in from_records, than the PR you have now changing its behaviour.

@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019
@mroeschke mroeschke added Performance Memory or execution speed performance and removed Low-Memory labels Apr 3, 2020
@mroeschke mroeschke removed the IO Data IO issues that don't fit into a more specific label label May 2, 2020
@mroeschke mroeschke added Deprecate Functionality to remove in pandas Refactor Internal refactoring of code and removed API Design labels Jun 21, 2021
@mroeschke mroeschke removed the Deprecate Functionality to remove in pandas label Jun 8, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@datapythonista datapythonista removed their assignment Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance Refactor Internal refactoring of code
Projects
None yet
Development

No branches or pull requests

6 participants