Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProfileReport extremly slow on a 40K lines file #76

Closed
MordicusEtCubitus opened this issue Dec 4, 2017 · 11 comments
Closed

ProfileReport extremly slow on a 40K lines file #76

MordicusEtCubitus opened this issue Dec 4, 2017 · 11 comments

Comments

@MordicusEtCubitus
Copy link

MordicusEtCubitus commented Dec 4, 2017

I'm using an CSV file from opendata.
Data is not perfectly clean, but is used without issue with pandas.
When I try to run pandas_profiling.ProfileReport() on it, jupyter goes into a very very long run using lot of cpu and never giving back hand.

File is not very big, only 36700 records. I've attached it to this defect.
If you can help...
Thanks a lot, and also for your nice idea !

villes_france.csv.zip.

@conradoqg
Copy link
Contributor

conradoqg commented Dec 31, 2017

Hey,

Two things are happening in this case:

  • The "check correlation" feature of this profiling tool is very heavy due the check of correlation over all combinations of categorical features. If you deactivate that feature pfr = pandas_profiling.ProfileReport(df, check_correlation=False) it will not crash due to memory usage, but it will throw an error, that I explain below.
  • This tool doesn't support multiple dtypes for the same column, and to solve this problem the dataset must be cleaned beforehand or you could force the dtype df = pd.read_csv("../examples/villes_france.csv", dtype={'dep': str, 'code insee': str}, encoding='UTF-8').

Working version of this case:

import pandas_profiling
import pandas as pd

df = pd.read_csv("../examples/villes_france.csv", dtype={'dep': str, 'code insee': str}, encoding='UTF-8')
pfr = pandas_profiling.ProfileReport(df, check_correlation=False)
pfr.to_file("/tmp/example.html")

I will look if there is anything I can do to solve those problems or at least inform the user what's happening.

@conradoqg
Copy link
Contributor

conradoqg commented Jan 4, 2018

Hey,

With the above PR #82 merged, you can:

  • Use the parameter pandas_profiling.ProfileReport(df, check_recoded = False) to alleviate memory pressure and at the same time get the correlation report. I think this is the most we can do right now. The recoded checking is indeed memory heavy.
  • You don't need to force the dtype anymore, the ProfileReport will automatically ignore unsupported types.

I think we can close this issue after those changes.

Best

@romainx
Copy link
Contributor

romainx commented Jan 5, 2018

Hello,

Thanks for the update. I close the issue.
We will reopen it if needed.

@romainx romainx closed this as completed Jan 5, 2018
@RiteshBabel
Copy link

facing Value error:
Config parameter "check_correlation" does not exist.
after apling pandas_profiling.ProfileReport(df ,check_correlation=False)

@MHUNCHO
Copy link

MHUNCHO commented Oct 9, 2019

also getting the error above

@BryTyr
Copy link

BryTyr commented Oct 29, 2019

From what i have found you are getting that error because that setting is in the latest release of pandas_profiling that is not yet released on pypi.

@neomatrix369
Copy link

neomatrix369 commented Dec 12, 2019

Has anyone tried other means to speed up the execution like via PySpark / Apache Spark? Or running it in the cloud on a much bigger machine using? I would love to know more about these steps. I'm only processing 1000+ rows but it's taking ages to come back with a report. I can imagine what will happen when 4000+ rows are processed.

Any tips to speed it up?

@hitfuture
Copy link

I just found pandas-profiling two days ago while I was doing data analysis on file systems I'm translating. I really liked the output on one of my CSV files because I had been doing most of these statistics manually and this package was easier to take on. Yesterday, I wrote a script that would take on 120 CSV files that I'm processing, and it would process each CSV file with df.profile_report() and then export the results to individual HTML files. I ran this last night and it tried to run for over 10 hours, and it only completed four of the files. Several of the files were over 500MB, and this would not run.

I updated this based on a random sampling of the CSV files which I've used in the past.

def sample_dataframe(self , filename , sample_size=50):
        # We found this efficient sampling on stackoverflow
        n = sum( 1 for line in open( filename ) ) - 1  # number of records in file (excludes header)
        st = os.stat( filename )
        file_size = st.st_size

        sample_size = min( sample_size , n )
        skip = sorted( random.sample( range( 1 , n + 1 ) ,
                                      n - sample_size ) )  # the 0-indexed header will not be included in the skip list
        df = pd.read_csv( filename , skiprows=skip )
        return df , n , file_size
src_file = 'data/abc.csv'
out_file = 'abc_profile.html'
df , n , file_size =sample_dataframe(src_file, 99999)
profile = df.profile_report()
profile.to_file(output_file = out_file

By updating the code I wrote, I was able to run this against over 120 CSV files with a maximum size of 800MB. The process executed against all of the files ran all of the profiles within 40 minutes. I highly recommend that you use data sampling for doing this analysis work. It would be nice to add a new capability in the profile report that could correlate multiple random samples to come up with the best answer. In the future, it will also be good to add DASK to these data sets for faster performance. For now, random sampling of the data allows you to control the size of the data frame that will run on your local computer. If you have a small desktop, sample a thousand rows, and it will run fast for you.

Brett

@neomatrix369
Copy link

I just found pandas-profiling two days ago while I was doing data analysis on file systems I'm translating. I really liked the output on one of my CSV files because I had been doing most of these statistics manually and this package was easier to take on. Yesterday, I wrote a script that would take on 120 CSV files that I'm processing, and it would process each CSV file with df.profile_report() and then export the results to individual HTML files. I ran this last night and it tried to run for over 10 hours, and it only completed four of the files. Several of the files were over 500MB, and this would not run.

I updated this based on a random sampling of the CSV files which I've used in the past.

def sample_dataframe(self , filename , sample_size=50):
        # We found this efficient sampling on stackoverflow
        n = sum( 1 for line in open( filename ) ) - 1  # number of records in file (excludes header)
        st = os.stat( filename )
        file_size = st.st_size

        sample_size = min( sample_size , n )
        skip = sorted( random.sample( range( 1 , n + 1 ) ,
                                      n - sample_size ) )  # the 0-indexed header will not be included in the skip list
        df = pd.read_csv( filename , skiprows=skip )
        return df , n , file_size
src_file = 'data/abc.csv'
out_file = 'abc_profile.html'
df , n , file_size =sample_dataframe(src_file, 99999)
profile = df.profile_report()
profile.to_file(output_file = out_file

By updating the code I wrote, I was able to run this against over 120 CSV files with a maximum size of 800MB. The process executed against all of the files ran all of the profiles within 40 minutes. I highly recommend that you use data sampling for doing this analysis work. It would be nice to add a new capability in the profile report that could correlate multiple random samples to come up with the best answer. In the future, it will also be good to add DASK to these data sets for faster performance. For now, random sampling of the data allows you to control the size of the data frame that will run on your local computer. If you have a small desktop, sample a thousand rows, and it will run fast for you.

Brett

Thanks for sharing this, I did something similar where I took batches of rows from my large dataset (5M rows by 57 columns) and split them into much smaller batches of 1M rows each and also used a reduced profile, and I was able to profile much quicker. Then taking the results from each of these batches and averaging them or taking the best or using any other method of selection can be done.

Although I couldn't do some additional checks like interactions etc... but I'll figure it out next time.

@mmroden
Copy link

mmroden commented Jul 2, 2020

I'm also finding that processing 10 million+ rows of data is taking hours. Some of these statistics look like they finish quickly, and others look like they take hours. I'm also seeing that graph generation takes a significant chunk of time. Can there be documentation that describes which statistics calculations have more algorithmic requirements than O(n)? I would think that all of the correlations fall into that bucket, but I've disabled all of those and coding still multiple hours to run.

In addition, I don't see multiple cores actually in use, despite using the default for thread_pool set to zero. I suspect that the GIL is preventing threads from being particularly useful; what kind of speedup could be expected from the inclusion of threads?

@shyamcody
Copy link

So I am getting check_recoded, check_correlation, check_correlation_pearson to be non-existing. The correlations dictionary thing with setting the 5 of the correlations as False also didn't work with an error saying correlations. Pearson should be a collection instead of a boolean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants