Even 512 GiB Memory not enough for extract_features on 7895800 rows × 28 columns ? #947

dsstex · 2022-06-02T09:18:31Z

The problem:

This is my code.

import pandas as pd
from tsfresh import extract_features
from tsfresh.feature_extraction import (
    ComprehensiveFCParameters
)
import multiprocessing


rolled = pd.read_csv('rolled.csv', index_col=0, parse_dates=['time'])
rolled.set_index("time", inplace=True)

X = extract_features(
    rolled,
    column_id="id",
    default_fc_parameters=ComprehensiveFCParameters(),
    n_jobs=multiprocessing.cpu_count()
)

X.to_csv("extracted.csv", index=True, header=True)

rolled.csv contains data that's been rolled using max_timeshift=96, min_timeshift=96. It contains 7895800 rows × 28 columns

rolled = roll_time_series(df, 
                          column_id='id', 
                          rolling_direction=1,
                          column_kind=None,
                          column_sort='time',
                          max_timeshift=96,
                          min_timeshift=96).reset_index(drop=True)
rolled.to_csv("rolled.csv", index=True, header=True)

the input df before rolling had around 82500 rows. That resulted in 7895800 rows × 28 columns.

After feature extraction, I'm expecting ~ 82500 rows x 21438 columns

I have tested with 10 extracted rows. The size for 10 rows x 21438 columns is 3.5 MB.

So for 82500 rows, I presume I need ~ 30 GB disk space.

I have tried to extract the features using AWS EC2 r6i.16xlarge instance. It comes up with 64 vCPU and 512 GiB Memory. I have also added 100 GB EBS gp3 volume. I thought that was enough.

The problems:

(1) Only 12.5% of the CPU got used. 87.5% was idle. [Is it because n_jobs=multiprocessing.cpu_count()? Do I have to use like this? n_jobs=multiprocessing.cpu_count() - 1 ?]

(2) Feature extraction progressed until 75%. After that, the script got terminated due to lack of memory. Is 512 GiB memory not enough?

Anything else we need to know?:

Yes, It took my script 4 hours to progress to 75%. Since I'm using r6i.16xlarge that's expensive in my case.

Since I'm using max_timeshift=96 and min_timeshift=96, during prediction/inference stage, i'll be only having 96 rows to extract features for a single prediction/inference. So I'm wondering why 512 GiB and 4 hours time is not enough for feature extraction, when it takes only 1 second to extract features for a single inference (96 rows).

If it takes 1 second, then for 82500 rows with 64 vCPU, 82500 / 64 = ~ 1290 Second (21.5 Minutes). So I think anything less than 30 Minutes is normal in my case.

I could use LocalDaskDistributor. However, according to this comment, it's not for production use.

Is there any way, we can estimate the system requirements (e.g. Memory) and time (e.g. based on vCPU count) using the input dataframe?

Environment:

Python version: 3.7
Operating System: Amazon Linux 2
tsfresh version: 0.19.0
Install method (conda, pip, source): pip
Cloud: AWS EC2
Instance: r6i.16xlarge
Memory: 512 GiB
vCPU: 64
Storage: 100 GB gp3 Volume
n_jobs: multiprocessing.cpu_count()

The text was updated successfully, but these errors were encountered:

dsstex · 2022-06-02T10:23:24Z

Just tried without the n_jobs parameter. Which seems like utilising 50% of the available CPU by default. I'm using r6i.24xlarge at the moment. It comes with 96 vCPU and 768 GiB Memory

I can confirm, tsfresh not utilising the CPU well.

Most of the time, CPU utilisation stays below 12.5%.

More than 87.5% of the CPU stays idle always. Also as you can see below, I have sufficient memory.

top - 10:13:46 up 31 min,  2 users,  load average: 11.62, 13.63, 16.19
Tasks: 813 total,  11 running, 381 sleeping,   0 stopped,   0 zombie
%Cpu(s): 11.4 us,  0.0 sy,  0.0 ni, 88.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 78017574+total, 52574860+free, 25019536+used,  4231816 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 52583718+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                 
18604 root      20   0 8117352   5.1g  21312 R 100.0  0.7  14:01.11 python3                                                                 
18606 root      20   0 7997288   4.9g  21312 R 100.0  0.7  13:18.86 python3                                                                 
18489 root      20   0   94.6g  91.1g  99204 S 100.0 12.2  24:27.62 python3                                                                 
18562 root      20   0 7333480   4.3g  21312 R 100.0  0.6  17:16.38 python3                                                                 
18563 root      20   0 7694440   4.7g  21312 R 100.0  0.6  18:44.78 python3                                                                 
18565 root      20   0 7058792   4.1g  21312 R 100.0  0.5  15:54.73 python3                                                                 
18567 root      20   0 7213672   4.2g  21312 R 100.0  0.6  16:42.11 python3                                                                 
18568 root      20   0 7526248   4.5g  21312 R 100.0  0.6  17:57.02 python3                                                                 
18569 root      20   0 6890088   3.9g  21312 R 100.0  0.5  15:17.46 python3                                                                 
18573 root      20   0 6727272   3.7g  21312 R 100.0  0.5  14:29.92 python3                                                                 
18608 root      20   0 7791208   4.7g  21312 R 100.0  0.6  12:28.75 python3                                                                 
   14 root      20   0       0      0      0 I   0.4  0.0   0:00.32 rcu_sched                                                               
18837 ec2-user  20   0  171848   5064   3704 R   0.4  0.0   0:00.81 top                                                                     
    1 root      20   0  191096   5472   3900 S   0.0  0.0   0:01.72 systemd                                                                 
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd                                                                
    3 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_gp                                                                  
    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_par_gp                                                              
    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H-kb                                                         
    7 root      20   0       0      0      0 I   0.0  0.0   0:00.00 kworker/0:1-rcu                                                         
    8 root      20   0       0      0      0 I   0.0  0.0   0:00.00 kworker/u192:0-                                                         
   10 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq                                                            
   11 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_tasks_rude_                                                         
   12 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_tasks_trace                                                         
   13 root      20   0       0      0      0 S   0.0  0.0   0:00.00 ksoftirqd/0                                                             
   15 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 migration/0                                                             
   16 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/0                                                                 
   17 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/1                                                                 
   18 root      rt   0       0      0      0 S   0.0  0.0   0:00.24 migration/1

dsstex · 2022-06-02T10:53:00Z

This line seems like the issue.

return_df = data.pivot(result)

https://github.com/blue-yonder/tsfresh/blob/main/tsfresh/feature_extraction/extraction.py#L304

dsstex · 2022-06-02T15:56:59Z

It took 7 hours on r6i.24xlarge. [96 vCPU and 768 GiB Memory].

Output: extracted.csv file size is 20 GB for 7895800 rows × 28 columns

Hope that info help someone.

Thanks.

b-y-f · 2023-02-27T23:44:06Z

How many features were extracted? Facing the same problem, long time series data(only 3 ids) memory overflows in 16GB laptop.

nils-braun · 2023-02-28T20:44:02Z

Thanks @dsstex for the analysis and the posted numbers (and really sorry for the long delay).
How did you know that pivoting is the issue? Have you tried running without it?

dsstex added the bug label Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Even 512 GiB Memory not enough for extract_features on 7895800 rows × 28 columns ? #947

Even 512 GiB Memory not enough for extract_features on 7895800 rows × 28 columns ? #947

dsstex commented Jun 2, 2022

dsstex commented Jun 2, 2022 •

edited

Loading

dsstex commented Jun 2, 2022

dsstex commented Jun 2, 2022

b-y-f commented Feb 27, 2023

nils-braun commented Feb 28, 2023

Even 512 GiB Memory not enough for extract_features on 7895800 rows × 28 columns ? #947

Even 512 GiB Memory not enough for extract_features on 7895800 rows × 28 columns ? #947

Comments

dsstex commented Jun 2, 2022

dsstex commented Jun 2, 2022 • edited Loading

dsstex commented Jun 2, 2022

dsstex commented Jun 2, 2022

b-y-f commented Feb 27, 2023

nils-braun commented Feb 28, 2023

dsstex commented Jun 2, 2022 •

edited

Loading