-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor from_records to load data in an efficient way #22025
Comments
Big +1 in general!
The dict case is also kind of "wrong", as when seeing a dict as a "dict of records", I would expect a transposed result:
So +1 on deprecating that. |
Good points. But I think I wasn't clear with So, to clarify what I meant, I'd support:
And I wouldn't support other cases, including:
So, I think we 100% agree. |
Yep, and 2D numpy array and and record arrays are also consistent with the 'iterable of array-like', as iterating through them gives you row by row (as opposed to DataFrame or dict). |
this seems ok - though would have to deprecate accepting a dict here I am also not sure you ‘can allocate the Dataframe memory’ as you don’t know the length of the data apriori if it’s a generator you are right that it is possible to not instantiating to python lists though (for iterables with a known length) |
The case of accepting dict in from_records is also currently broken in my opinion. I filed #22708 (PR: #22687) to suggest making the ordering of columns consistent with the constructor taking a similar dict argument. If from_records continues to accept dict then I think it should at least be made consistent though on the whole I am +1 with the sentiment for a cleaner API by removing dict support in from_records. |
@llawall given the discussion here, I would maybe rather do a PR to deprecate passing a dict in |
Based on the SciPy sprint discussions, and the discussions on related issues, seems like
from_records
should be the pandas way to create aDataFrame
from row based data.The current signature is next:
def from_records(cls, data, index=None, exclude=None, columns=None, coerce_float=False, nrows=None):
And it currently supports input
data
as:What I propose is to make
from_records
only work whendata
is an iterable of array-like (list, tuple, np.array...) or an iterable of dict. And deprecate the other cases (dict and DataFrame). After searching on GitHub repos and blogs, couldn't find cases where it's used with dict or DataFrame, and IMO the DataFrame constructor is a better way for those cases. This would make the code simpler.Then, I'd add a new parameter
dtypes
expecting a list or a dict with the dtypes of the new DataFrame. With this, and for the case whendata
is a generator, andnrows
anddtypes
are specified, we wouldn't need to exhaust the generator and load it to Python structures. Meaning that we'd just need to allocate the DataFrame memory, and there wouldn't be any intermediate memory requirements.Related issues: #5902, #2305, #2193, #4464, #1794, #13818.
@wesm, @jreback, @jorisvandenbossche any comments? Are you ok with this approach and the proposed changes to the API?
The text was updated successfully, but these errors were encountered: