-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ProfileReport extremly slow on a 40K lines file #76
Comments
Hey, Two things are happening in this case:
Working version of this case:
I will look if there is anything I can do to solve those problems or at least inform the user what's happening. |
Hey, With the above PR #82 merged, you can:
I think we can close this issue after those changes. Best |
Hello, Thanks for the update. I close the issue. |
facing Value error: |
also getting the error above |
From what i have found you are getting that error because that setting is in the latest release of pandas_profiling that is not yet released on pypi. |
Has anyone tried other means to speed up the execution like via PySpark / Apache Spark? Or running it in the cloud on a much bigger machine using? I would love to know more about these steps. I'm only processing 1000+ rows but it's taking ages to come back with a report. I can imagine what will happen when 4000+ rows are processed. Any tips to speed it up? |
I just found pandas-profiling two days ago while I was doing data analysis on file systems I'm translating. I really liked the output on one of my CSV files because I had been doing most of these statistics manually and this package was easier to take on. Yesterday, I wrote a script that would take on 120 CSV files that I'm processing, and it would process each CSV file with df.profile_report() and then export the results to individual HTML files. I ran this last night and it tried to run for over 10 hours, and it only completed four of the files. Several of the files were over 500MB, and this would not run. I updated this based on a random sampling of the CSV files which I've used in the past.
By updating the code I wrote, I was able to run this against over 120 CSV files with a maximum size of 800MB. The process executed against all of the files ran all of the profiles within 40 minutes. I highly recommend that you use data sampling for doing this analysis work. It would be nice to add a new capability in the profile report that could correlate multiple random samples to come up with the best answer. In the future, it will also be good to add DASK to these data sets for faster performance. For now, random sampling of the data allows you to control the size of the data frame that will run on your local computer. If you have a small desktop, sample a thousand rows, and it will run fast for you. Brett |
Thanks for sharing this, I did something similar where I took batches of rows from my large dataset (5M rows by 57 columns) and split them into much smaller batches of 1M rows each and also used a reduced profile, and I was able to profile much quicker. Then taking the results from each of these batches and averaging them or taking the best or using any other method of selection can be done. Although I couldn't do some additional checks like interactions etc... but I'll figure it out next time. |
I'm also finding that processing 10 million+ rows of data is taking hours. Some of these statistics look like they finish quickly, and others look like they take hours. I'm also seeing that graph generation takes a significant chunk of time. Can there be documentation that describes which statistics calculations have more algorithmic requirements than O(n)? I would think that all of the correlations fall into that bucket, but I've disabled all of those and coding still multiple hours to run. In addition, I don't see multiple cores actually in use, despite using the default for |
So I am getting check_recoded, check_correlation, check_correlation_pearson to be non-existing. The correlations dictionary thing with setting the 5 of the correlations as False also didn't work with an error saying correlations. Pearson should be a collection instead of a boolean. |
I'm using an CSV file from opendata.
Data is not perfectly clean, but is used without issue with pandas.
When I try to run pandas_profiling.ProfileReport() on it, jupyter goes into a very very long run using lot of cpu and never giving back hand.
File is not very big, only 36700 records. I've attached it to this defect.
If you can help...
Thanks a lot, and also for your nice idea !
villes_france.csv.zip.
The text was updated successfully, but these errors were encountered: