Memory error #227

kirancrazy99 · 2019-07-30T10:08:24Z

Hi

I've tried to keep data in a Pandas Data-frame around 6 millions of data pulling from Database, but finally it giving me Memory Error ,, is it not possible to do Data Profiling for larger data sets like millions of data? if we have any solution for that please let me know.I have tried to read by chunk size and applied concat there also Memory error

Please find more details below:

Using 16gb RAM
dp = pandas_profiling.ProfileReport(df,check_correlation=False)

sbrugman · 2019-07-30T16:30:24Z

The package should automatically detect the size of the dataframe and pick the best configuration. We are not there yet, you could try turning off some features (as mentioned in [#222]):

profile = pandas_profiling.ProfileReport(
check_correlation_pearson= False, 
correlations={'pearson': False,
'spearman': False,
'kendall': False,
'phi_k': False,
'cramers': False,
'recoded':False}, 
plot={'histogram':{'bayesian_blocks_bins': False}})

colby-vickerson · 2019-09-04T13:08:36Z

Any updates on using this with large dataframes?

colby-vickerson · 2019-09-04T17:46:54Z

What about using a sample for larger data sets? Say your data is over 500,000 rows, only sample 500,000 rows to create the distribution analysis for each column. The summary at the top of the html would still have info on total data, but the by column analysis would only be for the sample population.

aiqingtian · 2019-09-04T17:51:43Z

What about using a sample for larger data sets? Say your data is over 500,000 rows, only sample 500,000 rows to create the distribution analysis for each column. The summary at the top of the html would still have info on total data, but the by column analysis would only be for the sample population.

There may be some practical issues here if using a sample for large datasets. For example, for id column, some ids may not be sampled, especially when these ids are small part of the whole dataset

colby-vickerson · 2019-09-06T13:03:04Z

Perhaps using Dask instead of Pandas could also help with large data

hoangthienan95 · 2019-09-06T17:00:27Z

ditto on Dask, I have a huge cluster of 1000 nodes that I can use to process the dataframe, but there's no way to connect with dask.distributed client.

sbrugman · 2019-09-06T19:04:43Z

Thank you all for the discussion so far. Let's put some structure into solving this problem, so that effective collaboration is possible, as there is clearly demand to resolve this issue.

The first step should be to simulate a dataset of this magnitude for the supported types. This simulation stays constant while pandas-profiling evolves (we are currently working on a new version).
The simulation can be used to experiment and measure performance for large datasets. We can then start optimizing (e.g. via concurrency or other optimizations).

Is anyone interested in taking up providing a dataset simulation?

dylanjcastillo · 2019-11-06T09:35:53Z

Hey @sbrugman I think I can help with this.

You are thinking on generating a synthetic dataset using sklearn or numpy right? Around what size?

I was thinking on 1 or 2 columns of each type and 100,000 rows. Does this makes sense to you?

MHUNCHO · 2019-11-06T09:41:31Z

Really hope a solution can be found for scaling this for large data sets to avoid this issue

github-actions · 2020-02-16T00:01:08Z

Stale issue

kirancrazy99 added the bug 🐛 Something isn't working label Jul 30, 2019

sbrugman added the getting started ☝ Straight-forward for beginning contributors label Sep 6, 2019

neomatrix369 mentioned this issue Jan 13, 2020

get_rejected_variables missing after release of v2.4.0 #315

Closed

github-actions bot added the no-issue-activity label Feb 16, 2020

brunompacheco mentioned this issue Feb 19, 2020

Compatibility with Dask.DataFrame #279

Closed

github-actions bot closed this as completed Feb 23, 2020

neomatrix369 mentioned this issue Mar 27, 2020

[Question] What are the different options we can use when running analysis on a big/large datasets? #420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory error #227

Memory error #227

kirancrazy99 commented Jul 30, 2019 •

edited

Loading

sbrugman commented Jul 30, 2019 •

edited

Loading

colby-vickerson commented Sep 4, 2019

colby-vickerson commented Sep 4, 2019

aiqingtian commented Sep 4, 2019

colby-vickerson commented Sep 6, 2019

hoangthienan95 commented Sep 6, 2019

sbrugman commented Sep 6, 2019

dylanjcastillo commented Nov 6, 2019

MHUNCHO commented Nov 6, 2019

github-actions bot commented Feb 16, 2020

Memory error #227

Memory error #227

Comments

kirancrazy99 commented Jul 30, 2019 • edited Loading

sbrugman commented Jul 30, 2019 • edited Loading

colby-vickerson commented Sep 4, 2019

colby-vickerson commented Sep 4, 2019

aiqingtian commented Sep 4, 2019

colby-vickerson commented Sep 6, 2019

hoangthienan95 commented Sep 6, 2019

sbrugman commented Sep 6, 2019

dylanjcastillo commented Nov 6, 2019

MHUNCHO commented Nov 6, 2019

github-actions bot commented Feb 16, 2020

kirancrazy99 commented Jul 30, 2019 •

edited

Loading

sbrugman commented Jul 30, 2019 •

edited

Loading