support another structural output(xml/json)? #224

huaiweicheng · 2019-07-25T05:55:05Z

I use pandas_profiling to check my data every day to get knowlegde of my new prodcution data.
It greaterly improve the effenciency of data quality checking. Thanks to contributers for saving my life! Life is short, use Pandas Profiling!
However, I have found that there are still 2 problems which can not be handled perfectly right now.

When the columns of the dataset are big enough(>200), the generated html report can not be opened using IE or Chrome, the browsers will crash.
The most common scenario is that I wanna commpare the difference of data between 2 days.
For example, I have one dataframe, and one column in this dataframe called A.
The missing rate of A is in the range of 25% to 35% on usual . If the missing rate of A of new producted data is out of this range, I wanna generate a warning.

Both of the problems require recording the statictics info using file formats such as xml/json on a daily basis. Using html, it is not convenient to get the statictics info.
However, I do not find other outputs that panas_profiling but html.
The html is great. But on the one hand, sometimes the html report is too big to open. If I could store the data and chose some important columns to present, it will generate a report only contains these columns and could be opened since the scale of html is relatively small.
Besides, the analysis in problem 2 can be accomplished using a higher layer program by comparing the xml/json generated in two days.

I think the proposal will makes pandas_profiling greater!

sbrugman · 2019-07-25T09:22:48Z

Some quick comments.

Regarding the first point:
#222 also pointed out that there are problems with datasets with more columns. The examples all have around 10-15 columns. I am planning to make the report adaptive to the number of columns.

Anyone can contribute by providing open datasets that are representative for larger number of columns (e.g. 40-100 and 100+).

Regarding the second point:
#198 and #173 propose a solution that could be helpful here.

Providing the output als xml/json is definitely an option. This is a great place for beginning contributors to start. Just call:

profile = df.profile_report()
profile.get_description()

huaiweicheng · 2019-07-26T02:18:41Z

Thanks for comments.
I see the correlated issue and find the second point seems to bother other people.
Also I run the get_description method and see the output dict.

I have a rough idea and welcome to make corrections and other inspirational ideas.
The working steps follow the workflow of dataframe -> description file(xml/json) -> html.

We can recover the raw standalone html report from a single description file. It will generate the current output html.

profile1 = read_json('sample1.json')
profile1.to_file(output_file="sample1.html")

Furthormore, we could implement a compare api.

profile2 = read_json('sample2.json')
profile2.to_file(output_file="sample2.html")
diff_profile = profile1.diff(profile2)
diff_profile.to_file(output_file="diff_between_sample1_sample2.html")

The diff_profile could be a class similar to profile.

Or diff_profile is also a "profile" class. Then we need to extend the api/methods of current profile class to support list/array format and provide more complex comparing methods in this class.

Thus, we can recover the current html report from xml/json, and also provide a more flexible tool to compare between data sets such as training datatsets and testing datasets in ml.

marco-cardoso · 2019-09-20T03:16:29Z

Some quick comments.

Regarding the first point:
#222 also pointed out that there are problems with datasets with more columns. The examples all have around 10-15 columns. I am planning to make the report adaptive to the number of columns.

Anyone can contribute by providing open datasets that are representative for larger number of columns (e.g. 40-100 and 100+).

Regarding the second point:
#198 and #173 propose a solution that could be helpful here.

Providing the output als xml/json is definitely an option. This is a great place for beginning contributors to start. Just call:
profile = df.profile_report()
profile.get_description()

Hello @sbrugman , I'd like to understand better how the method to generate the output as XML/JSON would be. I have some questions :

The idea is just a method that is exactly like to_file() but saving a JSON/XML ? Something that the user could specify the output path.
The resulting file would follow the structure of the get_description() returned dictionary?

Any more details about this functionality would be appreciated.

Thanks for your attention

@marco-cardoso

- Feature as requested in #224 - Many thanks @marco-cardoso for your initial implementation #225

@marco-cardoso

* Progress bar implementation - Feature as requested in #224 - Test for #282 - Many thanks @marco-cardoso for your initial implementation #225 - Display no progress bar for disabled modules (e.g. individual correlations). - Update requirements, notebooks, docs, examples, linting * Decouple notebooks and notebook tests. One test hangs on issue in nbval: computationalmodelling/nbval#136 * Disable missing plots in minimal mode * Create additional demo with Chicago employees data * Compartmentalize column sorting in describe module

- Progress bar added (#224) - Character analysis for Text/NLP (#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (#377, fixed). - Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1) - Improved mixed type detection (#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (#349) - The overview section is tabbed.

- Progress bar added (#224) - Character analysis for Text/NLP (#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (#377, fixed). - Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1) - Improved mixed type detection (#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (#349) - The overview section is tabbed. * Commit for pandas-profiling v2.5.0 - Progress bar added (#224) - Character analysis for Text/NLP (#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (#377, fixed). - Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1) - Improved mixed type detection (#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (#349) - The overview section is tabbed.

sbrugman · 2020-02-14T11:04:20Z

to_file('file.json') and .to_json() are now available, closing

@marco-cardoso

* Progress bar implementation - Feature as requested in ydataai#224 - Test for ydataai#282 - Many thanks @marco-cardoso for your initial implementation ydataai#225 - Display no progress bar for disabled modules (e.g. individual correlations). - Update requirements, notebooks, docs, examples, linting * Decouple notebooks and notebook tests. One test hangs on issue in nbval: computationalmodelling/nbval#136 * Disable missing plots in minimal mode * Create additional demo with Chicago employees data * Compartmentalize column sorting in describe module

- Progress bar added (ydataai#224) - Character analysis for Text/NLP (ydataai#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (ydataai#362; ydataai#281, ydataai#259, ydataai#253, ydataai#234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (ydataai#377, fixed). - Pandas v1.0.X is not yet supported (ydataai#367, ydataai#366, ydataai#363, ydataai#353, pinned pandas to < 1) - Improved mixed type detection (ydataai#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, ydataai#329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (ydataai#349) - The overview section is tabbed. * Commit for pandas-profiling v2.5.0 - Progress bar added (ydataai#224) - Character analysis for Text/NLP (ydataai#278) - Themes: configuration and demo's (Orange, Dark) - Tutorial on modifying the report's structure (ydataai#362; ydataai#281, ydataai#259, ydataai#253, ydataai#234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling. - Toggle descriptions at correlations. Deprecation: - This is the last version to support Python 3.5. Stability: - The order of columns changed when sort="None" (ydataai#377, fixed). - Pandas v1.0.X is not yet supported (ydataai#367, ydataai#366, ydataai#363, ydataai#353, pinned pandas to < 1) - Improved mixed type detection (ydataai#351) - Refactor of report structures. - Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, ydataai#329). - Distinct counts exclude NaNs. - Fixed alerts in notebooks. Other improvements: - Warnings are now sorted. - Links to Binder and Google Colab are added for notebooks (ydataai#349) - The overview section is tabbed.

huaiweicheng added the feature request 💬 Requests for new features label Jul 25, 2019

huaiweicheng changed the title ~~support another structural output?~~ support another structural output(xml/json)? Jul 25, 2019

sbrugman added the getting started ☝ Straight-forward for beginning contributors label Jul 25, 2019

sbrugman added a commit that referenced this issue Jan 21, 2020

Progress bar implementation

9805bca

- Feature as requested in #224 - Many thanks @marco-cardoso for your initial implementation #225

sbrugman mentioned this issue Jan 21, 2020

Progress bar implementation #345

Merged

sbrugman mentioned this issue Feb 14, 2020

Commit for pandas-profiling v2.5.0 #380

Merged

sbrugman closed this as completed Feb 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support another structural output(xml/json)? #224

support another structural output(xml/json)? #224

huaiweicheng commented Jul 25, 2019 •

edited

Loading

sbrugman commented Jul 25, 2019

huaiweicheng commented Jul 26, 2019 •

edited

Loading

marco-cardoso commented Sep 20, 2019 •

edited

Loading

sbrugman commented Feb 14, 2020

support another structural output(xml/json)? #224

support another structural output(xml/json)? #224

Comments

huaiweicheng commented Jul 25, 2019 • edited Loading

sbrugman commented Jul 25, 2019

huaiweicheng commented Jul 26, 2019 • edited Loading

marco-cardoso commented Sep 20, 2019 • edited Loading

sbrugman commented Feb 14, 2020

huaiweicheng commented Jul 25, 2019 •

edited

Loading

huaiweicheng commented Jul 26, 2019 •

edited

Loading

marco-cardoso commented Sep 20, 2019 •

edited

Loading