-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using pd.DataFrame(tensor) is abnormally slow, you can make the following modifications #44616
Comments
I'm not sure how happy people would be adding pytorch as a dependency to pandas. (Could use Based on the documentation: It might be worth extending the documentation. Something along the lines of "For best performance, iterable objects, such as a Pytorch Tensor, that can efficiently be converted to a Numpy Array, should be converted before passing it to pd.DataFrame." |
I am not super familiar with pytorch, but I suppose they support the array interface? If we don't do that yet, I think we can certainly ensure to treat all objects like that as arrays instead of list-likes. |
Yes, I think they support the array interface, and it is easy to convert between Tensor and Numpy. If the two data types are the same, the memory will be shared after conversion. Judging from the above test results, it is indeed not suitable to convert tensor as list-likes. |
Yes, I think it is appropriate to add such a comment, because it is likely that someone will directly use pd.DataFrame(tensor) to create a DataFrame, which will not report an error, but the performance is very low. |
PR is welcome. |
IIRC from similar issues checking for an |
Sorry, I don't understand what you mean? Would you like to describe it in detail? |
|
Never mind, that advice was wrong. Better advice: in frame.py L707-708 we check |
This won't work because tensors are not sequences. (See numpy/numpy#2776 (comment))
This sounds reasonable! |
I think on this line: Line 672 in ca81e6c
we would need to also catch "array-likes", so those are passed through to |
IIRC trying to add EAs to go through that path broke some stuff, but I'd be very happy to be wrong about this. |
Found it in my notes. According to past-me, having EAs go through that branch on L672 broke 5 test_apply_series_on_date_time_index_aware_series tests bc PandasArray[object] going through treat_as_nested paths. This motivated #43986 |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Issue Description
Recently, using pd.DataFrame() to convert data of type torch.tensor to pandas DataFrame is very slow, while converting tensor to numpy and then to pandas DataFrame is very fast. The test code is shown in the Reproducible Example.
The code prints as follows:
Then I read the source code and found that if the data accepted by pd.DataFrame() is tensor, tensor will be processed as list_like (line 682 in https://github.com/pandas-dev/pandas/blob/master/pandas/core/ frame.py) .
Mainly time-consuming in the following three stages:
In the nested_data_to_arrays stage, a large number of data type conversion operations are involved, the row-list is converted to col-list, and the operation is read by row.This will take a long time.
Sure,This method of use may not be appropriate, but now torch.tensor is widely used, and it is inevitable that it will be used directly in this way, resulting in low efficiency. So can you add a comment at line 467 in frame.py, like this: If data is a torch.tensor, you can transform it to numpy first(tensor.numpy()).
Or can I submit a PR? When it is judged that the input parameter is tensor, execute the conversion, and then execute the ''elif isinstance(data, (np.ndarray, Series, Index))'' judgment.
Looking forward to your reply ~
Installed Versions
pandas.version == 1.3.4
The text was updated successfully, but these errors were encountered: