ProfileReport extremly slow on a 40K lines file #76

MordicusEtCubitus · 2017-12-04T21:57:00Z

I'm using an CSV file from opendata.
Data is not perfectly clean, but is used without issue with pandas.
When I try to run pandas_profiling.ProfileReport() on it, jupyter goes into a very very long run using lot of cpu and never giving back hand.

File is not very big, only 36700 records. I've attached it to this defect.
If you can help...
Thanks a lot, and also for your nice idea !

villes_france.csv.zip.

conradoqg · 2017-12-31T21:18:19Z

Hey,

Two things are happening in this case:

The "check correlation" feature of this profiling tool is very heavy due the check of correlation over all combinations of categorical features. If you deactivate that feature pfr = pandas_profiling.ProfileReport(df, check_correlation=False) it will not crash due to memory usage, but it will throw an error, that I explain below.
This tool doesn't support multiple dtypes for the same column, and to solve this problem the dataset must be cleaned beforehand or you could force the dtype df = pd.read_csv("../examples/villes_france.csv", dtype={'dep': str, 'code insee': str}, encoding='UTF-8').

Working version of this case:

import pandas_profiling
import pandas as pd

df = pd.read_csv("../examples/villes_france.csv", dtype={'dep': str, 'code insee': str}, encoding='UTF-8')
pfr = pandas_profiling.ProfileReport(df, check_correlation=False)
pfr.to_file("/tmp/example.html")

I will look if there is anything I can do to solve those problems or at least inform the user what's happening.

conradoqg · 2018-01-04T22:10:47Z

Hey,

With the above PR #82 merged, you can:

Use the parameter pandas_profiling.ProfileReport(df, check_recoded = False) to alleviate memory pressure and at the same time get the correlation report. I think this is the most we can do right now. The recoded checking is indeed memory heavy.
You don't need to force the dtype anymore, the ProfileReport will automatically ignore unsupported types.

I think we can close this issue after those changes.

Best

romainx · 2018-01-05T06:22:31Z

Hello,

Thanks for the update. I close the issue.
We will reopen it if needed.

RiteshBabel · 2019-09-30T07:13:33Z

facing Value error:
Config parameter "check_correlation" does not exist.
after apling pandas_profiling.ProfileReport(df ,check_correlation=False)

MHUNCHO · 2019-10-09T07:32:12Z

also getting the error above

BryTyr · 2019-10-29T13:33:24Z

From what i have found you are getting that error because that setting is in the latest release of pandas_profiling that is not yet released on pypi.

neomatrix369 · 2019-12-12T22:46:41Z

Has anyone tried other means to speed up the execution like via PySpark / Apache Spark? Or running it in the cloud on a much bigger machine using? I would love to know more about these steps. I'm only processing 1000+ rows but it's taking ages to come back with a report. I can imagine what will happen when 4000+ rows are processed.

Any tips to speed it up?

hitfuture · 2019-12-14T20:18:13Z

I just found pandas-profiling two days ago while I was doing data analysis on file systems I'm translating. I really liked the output on one of my CSV files because I had been doing most of these statistics manually and this package was easier to take on. Yesterday, I wrote a script that would take on 120 CSV files that I'm processing, and it would process each CSV file with df.profile_report() and then export the results to individual HTML files. I ran this last night and it tried to run for over 10 hours, and it only completed four of the files. Several of the files were over 500MB, and this would not run.

I updated this based on a random sampling of the CSV files which I've used in the past.

def sample_dataframe(self , filename , sample_size=50):
        # We found this efficient sampling on stackoverflow
        n = sum( 1 for line in open( filename ) ) - 1  # number of records in file (excludes header)
        st = os.stat( filename )
        file_size = st.st_size

        sample_size = min( sample_size , n )
        skip = sorted( random.sample( range( 1 , n + 1 ) ,
                                      n - sample_size ) )  # the 0-indexed header will not be included in the skip list
        df = pd.read_csv( filename , skiprows=skip )
        return df , n , file_size
src_file = 'data/abc.csv'
out_file = 'abc_profile.html'
df , n , file_size =sample_dataframe(src_file, 99999)
profile = df.profile_report()
profile.to_file(output_file = out_file

By updating the code I wrote, I was able to run this against over 120 CSV files with a maximum size of 800MB. The process executed against all of the files ran all of the profiles within 40 minutes. I highly recommend that you use data sampling for doing this analysis work. It would be nice to add a new capability in the profile report that could correlate multiple random samples to come up with the best answer. In the future, it will also be good to add DASK to these data sets for faster performance. For now, random sampling of the data allows you to control the size of the data frame that will run on your local computer. If you have a small desktop, sample a thousand rows, and it will run fast for you.

Brett

neomatrix369 · 2020-04-19T12:47:38Z

I just found pandas-profiling two days ago while I was doing data analysis on file systems I'm translating. I really liked the output on one of my CSV files because I had been doing most of these statistics manually and this package was easier to take on. Yesterday, I wrote a script that would take on 120 CSV files that I'm processing, and it would process each CSV file with df.profile_report() and then export the results to individual HTML files. I ran this last night and it tried to run for over 10 hours, and it only completed four of the files. Several of the files were over 500MB, and this would not run.

I updated this based on a random sampling of the CSV files which I've used in the past.
def sample_dataframe(self , filename , sample_size=50):
        # We found this efficient sampling on stackoverflow
        n = sum( 1 for line in open( filename ) ) - 1  # number of records in file (excludes header)
        st = os.stat( filename )
        file_size = st.st_size

        sample_size = min( sample_size , n )
        skip = sorted( random.sample( range( 1 , n + 1 ) ,
                                      n - sample_size ) )  # the 0-indexed header will not be included in the skip list
        df = pd.read_csv( filename , skiprows=skip )
        return df , n , file_size
src_file = 'data/abc.csv'
out_file = 'abc_profile.html'
df , n , file_size =sample_dataframe(src_file, 99999)
profile = df.profile_report()
profile.to_file(output_file = out_file
By updating the code I wrote, I was able to run this against over 120 CSV files with a maximum size of 800MB. The process executed against all of the files ran all of the profiles within 40 minutes. I highly recommend that you use data sampling for doing this analysis work. It would be nice to add a new capability in the profile report that could correlate multiple random samples to come up with the best answer. In the future, it will also be good to add DASK to these data sets for faster performance. For now, random sampling of the data allows you to control the size of the data frame that will run on your local computer. If you have a small desktop, sample a thousand rows, and it will run fast for you.

Brett

Thanks for sharing this, I did something similar where I took batches of rows from my large dataset (5M rows by 57 columns) and split them into much smaller batches of 1M rows each and also used a reduced profile, and I was able to profile much quicker. Then taking the results from each of these batches and averaging them or taking the best or using any other method of selection can be done.

Although I couldn't do some additional checks like interactions etc... but I'll figure it out next time.

mmroden · 2020-07-02T15:16:47Z

I'm also finding that processing 10 million+ rows of data is taking hours. Some of these statistics look like they finish quickly, and others look like they take hours. I'm also seeing that graph generation takes a significant chunk of time. Can there be documentation that describes which statistics calculations have more algorithmic requirements than O(n)? I would think that all of the correlations fall into that bucket, but I've disabled all of those and coding still multiple hours to run.

In addition, I don't see multiple cores actually in use, despite using the default for thread_pool set to zero. I suspect that the GIL is preventing threads from being particularly useful; what kind of speedup could be expected from the inclusion of threads?

…i#258, ydataai#261, ydataai#293)

shyamcody · 2020-12-15T19:04:55Z

So I am getting check_recoded, check_correlation, check_correlation_pearson to be non-existing. The correlations dictionary thing with setting the 5 of the correlations as False also didn't work with an error saying correlations. Pearson should be a collection instead of a boolean.

conradoqg mentioned this issue Jan 2, 2018

Improve types handling #82

Merged

romainx closed this as completed Jan 5, 2018

neomatrix369 mentioned this issue Dec 12, 2019

It is taking too much time for dataset of size 100000 rows × 2395 columns. It is not scallable for large dataset. #261

Closed

sbrugman added a commit that referenced this issue Jan 2, 2020

Performance: introduce minimal mode. (#76, #222, #258, #261, #293)

3f099c4

neomatrix369 mentioned this issue Mar 27, 2020

[Question] What are the different options we can use when running analysis on a big/large datasets? #420

Closed

chanedwin pushed a commit to chanedwin/pandas-profiling that referenced this issue Oct 11, 2020

Performance: introduce minimal mode. (ydataai#76, ydataai#222, ydataa…

fa60395

…i#258, ydataai#261, ydataai#293)

zhoujianch mentioned this issue Apr 18, 2023

index -9223372036854775808 is out of bounds for axis 0 with size 2 #1313

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProfileReport extremly slow on a 40K lines file #76

ProfileReport extremly slow on a 40K lines file #76

MordicusEtCubitus commented Dec 4, 2017 •

edited

Loading

conradoqg commented Dec 31, 2017 •

edited

Loading

conradoqg commented Jan 4, 2018 •

edited

Loading

romainx commented Jan 5, 2018

RiteshBabel commented Sep 30, 2019

MHUNCHO commented Oct 9, 2019

BryTyr commented Oct 29, 2019

neomatrix369 commented Dec 12, 2019 •

edited

Loading

hitfuture commented Dec 14, 2019

neomatrix369 commented Apr 19, 2020

mmroden commented Jul 2, 2020

shyamcody commented Dec 15, 2020

ProfileReport extremly slow on a 40K lines file #76

ProfileReport extremly slow on a 40K lines file #76

Comments

MordicusEtCubitus commented Dec 4, 2017 • edited Loading

conradoqg commented Dec 31, 2017 • edited Loading

conradoqg commented Jan 4, 2018 • edited Loading

romainx commented Jan 5, 2018

RiteshBabel commented Sep 30, 2019

MHUNCHO commented Oct 9, 2019

BryTyr commented Oct 29, 2019

neomatrix369 commented Dec 12, 2019 • edited Loading

hitfuture commented Dec 14, 2019

neomatrix369 commented Apr 19, 2020

mmroden commented Jul 2, 2020

shyamcody commented Dec 15, 2020

MordicusEtCubitus commented Dec 4, 2017 •

edited

Loading

conradoqg commented Dec 31, 2017 •

edited

Loading

conradoqg commented Jan 4, 2018 •

edited

Loading

neomatrix369 commented Dec 12, 2019 •

edited

Loading