-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid dataframe deep copy while creating dataset #4596
Comments
Thanks very much! I think it will take a bit of investigation to test whether such a change can be made safely. Are you interested in working on this and submitting a pull request? |
Hi @jameslamb , yes, I'm rather busy right now but I've already set a remainder to start working on this. It's a very small change anyway, but I understand we have to ensure it's safe. AFAICS it is, because if the code that follows modifies the dataframe in an undesirable way then a fortiori it's a problem for the code not taking the |
Sounds great, thanks very much! |
Thank you @memeplex, indeed, this copy could be safely removed. PR is very welcome. |
I believe it's safe to add In terms of memory savings I ran the following script to test the memory usage which mimics what we do in sample_script.pyimport argparse
import numpy as np
import pandas as pd
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--deep', action='store_true')
args = parser.parse_args()
N = 100_000
X = np.random.rand(N, 100)
df = pd.DataFrame(X)
cat1 = np.random.choice(['a', 'b'], N)
df['cat1'] = cat1
df['cat2'] = np.random.choice(['a', 'b', 'c'], N)
cat_cols = ['cat1', 'cat2']
df[cat_cols] = df[cat_cols].astype('category')
df2 = df.copy(deep=args.deep)
df2[cat_cols] = df2[cat_cols].apply(lambda c: c.cat.codes)
res = df2.astype('float64').values
np.testing.assert_equal(df['cat1'], cat1)
np.testing.assert_equal(df2['cat1'], df['cat1'].cat.codes.values) I profiled this script with scalene and got the following results (I only show the relevant lines):
We see that we skip one copy of the dataframe setting If other maintainers agree I can make a PR adding the |
@jmoralez Thanks a lot for the profiling! I'm +1. |
Very nice work @jmoralez ! I support this change, please put up a PR whenever you have time. |
Isn't one more copy (deep) done here? If so, I think we can refactor the code to avoid making it. LightGBM/python-package/lightgbm/basic.py Lines 539 to 540 in 1617a63
https://github.com/pandas-dev/pandas/blob/4bfe3d07b4858144c219b9346329027024102ab6/pandas/core/frame.py#L4964-L5092 |
You're right, I should have profiled the whole function. We have to set |
Thanks for confirming my findings! Reopening this issue.
Maybe we can avoid doing a copy completely? Even a shallow one. I think we can track feature names in a separate structure like a raw list to not alter a passed DataFrame. |
TBH the fact that column names can be integers is very annoying and we'd have to modify a couple of sections in the |
That would be great to compare a shallow copy with a case without any copy. |
I used memory_profiler this time because scalene wasn't measuring the allocations in the import lightgbm as lgb
import numpy as np
import pandas as pd
from memory_profiler import profile
N = 100_000
X = np.random.rand(N, 100)
df = pd.DataFrame(X)
df['cat1'] = np.random.choice(['a', 'b'], N)
df['cat2'] = np.random.choice(['a', 'b', 'c'], N)
cat_cols = ['cat1', 'cat2']
df[cat_cols] = df[cat_cols].astype('category')
decorated = profile(lgb.basic._data_from_pandas)
data = decorated(df, 'auto', 'auto', None)[0] and running
diff@@ -537,7 +537,7 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
if len(data.shape) != 2 or data.shape[0] < 1:
raise ValueError('Input data must be 2 dimensional and non empty.')
if feature_name == 'auto' or feature_name is None:
- data = data.rename(columns=str)
+ data = data.rename(columns=str, copy=False)
cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, pd_CategoricalDtype)]
cat_cols_not_ordered = [col for col in cat_cols if not data[col].cat.ordered]
if pandas_categorical is None: # train dataset
diff@@ -537,7 +537,7 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
if len(data.shape) != 2 or data.shape[0] < 1:
raise ValueError('Input data must be 2 dimensional and non empty.')
if feature_name == 'auto' or feature_name is None:
- data = data.rename(columns=str)
+ feature_name = [str(col) for col in data.columns]
cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, pd_CategoricalDtype)]
cat_cols_not_ordered = [col for col in cat_cols if not data[col].cat.ordered]
if pandas_categorical is None: # train dataset
@@ -552,14 +552,10 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
data = data.copy(deep=False) # not alter origin DataFrame
data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})
if categorical_feature is not None:
- if feature_name is None:
- feature_name = list(data.columns)
if categorical_feature == 'auto': # use cat cols from DataFrame
- categorical_feature = cat_cols_not_ordered
+ categorical_feature = [str(col) for col in cat_cols_not_ordered]
else: # use cat cols specified by user
categorical_feature = list(categorical_feature)
- if feature_name == 'auto':
- feature_name = list(data.columns)
_check_for_bad_pandas_dtypes(data.dtypes)
df_dtypes = [dtype.type for dtype in data.dtypes]
df_dtypes.append(np.float32) # so that the target dtype considers floats
So we see that currently we're still making a deep copy because of the rename and that 2 and 3 are pretty much the same. However I'd prefer 2 because it's easier, in 3 we have to keep track of where the column names could be integers. |
@jmoralez Thanks a lot for profiling again!
Well, that's fine with me. |
This issue has been automatically locked since there has not been any recent activity since it was closed. |
Summary
The python API does the following while creating a dataset from a dataframe with cat cols:
By default
copy
does a deep copy (copies all data and indices) but AFAICS there is no need of that here and it could trigger a memory expensive operation.Motivation
Avoid extra memory usage. The sequence of steps here will end up with three copies of the data:
Description
Use
copy(deep=False)
:The text was updated successfully, but these errors were encountered: