-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Even 512 GiB Memory not enough for extract_features on 7895800 rows × 28 columns ? #947
Comments
Just tried without the n_jobs parameter. Which seems like utilising 50% of the available CPU by default. I'm using r6i.24xlarge at the moment. It comes with 96 vCPU and 768 GiB Memory I can confirm, tsfresh not utilising the CPU well. Most of the time, CPU utilisation stays below 12.5%. More than 87.5% of the CPU stays idle always. Also as you can see below, I have sufficient memory.
|
This line seems like the issue.
|
It took 7 hours on r6i.24xlarge. [96 vCPU and 768 GiB Memory]. Output: extracted.csv file size is 20 GB for 7895800 rows × 28 columns Hope that info help someone. Thanks. |
How many features were extracted? Facing the same problem, long time series data(only 3 ids) memory overflows in 16GB laptop. |
Thanks @dsstex for the analysis and the posted numbers (and really sorry for the long delay). |
The problem:
This is my code.
rolled.csv contains data that's been rolled using max_timeshift=96, min_timeshift=96. It contains 7895800 rows × 28 columns
the input
df
before rolling had around 82500 rows. That resulted in 7895800 rows × 28 columns.After feature extraction, I'm expecting ~ 82500 rows x 21438 columns
I have tested with 10 extracted rows. The size for 10 rows x 21438 columns is 3.5 MB.
So for 82500 rows, I presume I need ~ 30 GB disk space.
I have tried to extract the features using AWS EC2 r6i.16xlarge instance. It comes up with 64 vCPU and 512 GiB Memory. I have also added 100 GB EBS gp3 volume. I thought that was enough.
The problems:
(1) Only 12.5% of the CPU got used. 87.5% was idle. [Is it because
n_jobs=multiprocessing.cpu_count()
? Do I have to use like this?n_jobs=multiprocessing.cpu_count() - 1
?](2) Feature extraction progressed until 75%. After that, the script got terminated due to lack of memory. Is 512 GiB memory not enough?
Anything else we need to know?:
Yes, It took my script 4 hours to progress to 75%. Since I'm using r6i.16xlarge that's expensive in my case.
Since I'm using max_timeshift=96 and min_timeshift=96, during prediction/inference stage, i'll be only having 96 rows to extract features for a single prediction/inference. So I'm wondering why 512 GiB and 4 hours time is not enough for feature extraction, when it takes only 1 second to extract features for a single inference (96 rows).
If it takes 1 second, then for 82500 rows with 64 vCPU, 82500 / 64 = ~ 1290 Second (21.5 Minutes). So I think anything less than 30 Minutes is normal in my case.
I could use
LocalDaskDistributor
. However, according to this comment, it's not for production use.Is there any way, we can estimate the system requirements (e.g. Memory) and time (e.g. based on vCPU count) using the input dataframe?
Environment:
The text was updated successfully, but these errors were encountered: