Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory error #227

Closed
kirancrazy99 opened this issue Jul 30, 2019 · 10 comments
Closed

Memory error #227

kirancrazy99 opened this issue Jul 30, 2019 · 10 comments
Labels
bug 🐛 Something isn't working getting started ☝ Straight-forward for beginning contributors

Comments

@kirancrazy99
Copy link

kirancrazy99 commented Jul 30, 2019

Hi

I've tried to keep data in a Pandas Data-frame around 6 millions of data pulling from Database, but finally it giving me Memory Error ,, is it not possible to do Data Profiling for larger data sets like millions of data? if we have any solution for that please let me know.I have tried to read by chunk size and applied concat there also Memory error

Please find more details below:

Using 16gb RAM
dp = pandas_profiling.ProfileReport(df,check_correlation=False)

@kirancrazy99 kirancrazy99 added the bug 🐛 Something isn't working label Jul 30, 2019
@sbrugman
Copy link
Collaborator

sbrugman commented Jul 30, 2019

The package should automatically detect the size of the dataframe and pick the best configuration. We are not there yet, you could try turning off some features (as mentioned in [#222]):

profile = pandas_profiling.ProfileReport(
check_correlation_pearson= False, 
correlations={'pearson': False,
'spearman': False,
'kendall': False,
'phi_k': False,
'cramers': False,
'recoded':False}, 
plot={'histogram':{'bayesian_blocks_bins': False}})

@colby-vickerson
Copy link

Any updates on using this with large dataframes?

@colby-vickerson
Copy link

What about using a sample for larger data sets? Say your data is over 500,000 rows, only sample 500,000 rows to create the distribution analysis for each column. The summary at the top of the html would still have info on total data, but the by column analysis would only be for the sample population.

@aiqingtian
Copy link

What about using a sample for larger data sets? Say your data is over 500,000 rows, only sample 500,000 rows to create the distribution analysis for each column. The summary at the top of the html would still have info on total data, but the by column analysis would only be for the sample population.

There may be some practical issues here if using a sample for large datasets. For example, for id column, some ids may not be sampled, especially when these ids are small part of the whole dataset

@colby-vickerson
Copy link

Perhaps using Dask instead of Pandas could also help with large data

@hoangthienan95
Copy link

ditto on Dask, I have a huge cluster of 1000 nodes that I can use to process the dataframe, but there's no way to connect with dask.distributed client.

@sbrugman
Copy link
Collaborator

sbrugman commented Sep 6, 2019

Thank you all for the discussion so far. Let's put some structure into solving this problem, so that effective collaboration is possible, as there is clearly demand to resolve this issue.

The first step should be to simulate a dataset of this magnitude for the supported types. This simulation stays constant while pandas-profiling evolves (we are currently working on a new version).
The simulation can be used to experiment and measure performance for large datasets. We can then start optimizing (e.g. via concurrency or other optimizations).

Is anyone interested in taking up providing a dataset simulation?

@sbrugman sbrugman added the getting started ☝ Straight-forward for beginning contributors label Sep 6, 2019
@dylanjcastillo
Copy link

Hey @sbrugman I think I can help with this.

You are thinking on generating a synthetic dataset using sklearn or numpy right? Around what size?

I was thinking on 1 or 2 columns of each type and 100,000 rows. Does this makes sense to you?

@MHUNCHO
Copy link

MHUNCHO commented Nov 6, 2019

Really hope a solution can be found for scaling this for large data sets to avoid this issue

@github-actions
Copy link

Stale issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working getting started ☝ Straight-forward for beginning contributors
Projects
None yet
Development

No branches or pull requests

7 participants