Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support another structural output(xml/json)? #224

Closed
huaiweicheng opened this issue Jul 25, 2019 · 4 comments
Closed

support another structural output(xml/json)? #224

huaiweicheng opened this issue Jul 25, 2019 · 4 comments
Labels
feature request 💬 Requests for new features getting started ☝ Straight-forward for beginning contributors

Comments

@huaiweicheng
Copy link

huaiweicheng commented Jul 25, 2019

I use pandas_profiling to check my data every day to get knowlegde of my new prodcution data.
It greaterly improve the effenciency of data quality checking. Thanks to contributers for saving my life! Life is short, use Pandas Profiling!
However, I have found that there are still 2 problems which can not be handled perfectly right now.

  1. When the columns of the dataset are big enough(>200), the generated html report can not be opened using IE or Chrome, the browsers will crash.
  2. The most common scenario is that I wanna commpare the difference of data between 2 days.
    For example, I have one dataframe, and one column in this dataframe called A.
    The missing rate of A is in the range of 25% to 35% on usual . If the missing rate of A of new producted data is out of this range, I wanna generate a warning.

Both of the problems require recording the statictics info using file formats such as xml/json on a daily basis. Using html, it is not convenient to get the statictics info.
However, I do not find other outputs that panas_profiling but html.
The html is great. But on the one hand, sometimes the html report is too big to open. If I could store the data and chose some important columns to present, it will generate a report only contains these columns and could be opened since the scale of html is relatively small.
Besides, the analysis in problem 2 can be accomplished using a higher layer program by comparing the xml/json generated in two days.

I think the proposal will makes pandas_profiling greater!

@huaiweicheng huaiweicheng added the feature request 💬 Requests for new features label Jul 25, 2019
@huaiweicheng huaiweicheng changed the title support another structural output? support another structural output(xml/json)? Jul 25, 2019
@sbrugman
Copy link
Collaborator

Some quick comments.

Regarding the first point:
#222 also pointed out that there are problems with datasets with more columns. The examples all have around 10-15 columns. I am planning to make the report adaptive to the number of columns.

Anyone can contribute by providing open datasets that are representative for larger number of columns (e.g. 40-100 and 100+).

Regarding the second point:
#198 and #173 propose a solution that could be helpful here.

Providing the output als xml/json is definitely an option. This is a great place for beginning contributors to start. Just call:

profile = df.profile_report()
profile.get_description()

@sbrugman sbrugman added the getting started ☝ Straight-forward for beginning contributors label Jul 25, 2019
@huaiweicheng
Copy link
Author

huaiweicheng commented Jul 26, 2019

Thanks for comments.
I see the correlated issue and find the second point seems to bother other people.
Also I run the get_description method and see the output dict.

I have a rough idea and welcome to make corrections and other inspirational ideas.
The working steps follow the workflow of dataframe -> description file(xml/json) -> html.

We can recover the raw standalone html report from a single description file. It will generate the current output html.

profile1 = read_json('sample1.json')
profile1.to_file(output_file="sample1.html")

Furthormore, we could implement a compare api.

profile2 = read_json('sample2.json')
profile2.to_file(output_file="sample2.html")
diff_profile = profile1.diff(profile2)
diff_profile.to_file(output_file="diff_between_sample1_sample2.html")

The diff_profile could be a class similar to profile.

Or diff_profile is also a "profile" class. Then we need to extend the api/methods of current profile class to support list/array format and provide more complex comparing methods in this class.

Thus, we can recover the current html report from xml/json, and also provide a more flexible tool to compare between data sets such as training datatsets and testing datasets in ml.

@marco-cardoso
Copy link
Contributor

marco-cardoso commented Sep 20, 2019

Some quick comments.

Regarding the first point:
#222 also pointed out that there are problems with datasets with more columns. The examples all have around 10-15 columns. I am planning to make the report adaptive to the number of columns.

Anyone can contribute by providing open datasets that are representative for larger number of columns (e.g. 40-100 and 100+).

Regarding the second point:
#198 and #173 propose a solution that could be helpful here.

Providing the output als xml/json is definitely an option. This is a great place for beginning contributors to start. Just call:

profile = df.profile_report()
profile.get_description()

Hello @sbrugman , I'd like to understand better how the method to generate the output as XML/JSON would be. I have some questions :

  1. The idea is just a method that is exactly like to_file() but saving a JSON/XML ? Something that the user could specify the output path.
  2. The resulting file would follow the structure of the get_description() returned dictionary?

Any more details about this functionality would be appreciated.

Thanks for your attention

sbrugman added a commit that referenced this issue Jan 21, 2020
- Feature as requested in #224
- Many thanks @marco-cardoso for your initial implementation #225
sbrugman added a commit that referenced this issue Feb 2, 2020
* Progress bar implementation

- Feature as requested in #224
- Test for #282
- Many thanks @marco-cardoso for your initial implementation #225
- Display no progress bar for disabled modules (e.g. individual correlations).
- Update requirements, notebooks, docs, examples, linting

* Decouple notebooks and notebook tests. One test hangs on issue in nbval:
computationalmodelling/nbval#136

* Disable missing plots in minimal mode

* Create additional demo with Chicago employees data

* Compartmentalize column sorting in describe module
sbrugman added a commit that referenced this issue Feb 14, 2020
- Progress bar added (#224)
- Character analysis for Text/NLP (#278)
- Themes: configuration and demo's (Orange, Dark)
- Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling.
- Toggle descriptions at correlations.

Deprecation:

- This is the last version to support Python 3.5.

Stability:

- The order of columns changed when sort="None" (#377, fixed).
- Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1)
- Improved mixed type detection (#351)
- Refactor of report structures.
- Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329).
- Distinct counts exclude NaNs.
- Fixed alerts in notebooks.

Other improvements:

- Warnings are now sorted.
- Links to Binder and Google Colab are added for notebooks (#349)
- The overview section is tabbed.
sbrugman added a commit that referenced this issue Feb 14, 2020
- Progress bar added (#224)
- Character analysis for Text/NLP (#278)
- Themes: configuration and demo's (Orange, Dark)
- Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling.
- Toggle descriptions at correlations.

Deprecation:

- This is the last version to support Python 3.5.

Stability:

- The order of columns changed when sort="None" (#377, fixed).
- Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1)
- Improved mixed type detection (#351)
- Refactor of report structures.
- Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329).
- Distinct counts exclude NaNs.
- Fixed alerts in notebooks.

Other improvements:

- Warnings are now sorted.
- Links to Binder and Google Colab are added for notebooks (#349)
- The overview section is tabbed.
sbrugman added a commit that referenced this issue Feb 14, 2020
- Progress bar added (#224)
- Character analysis for Text/NLP (#278)
- Themes: configuration and demo's (Orange, Dark)
- Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling.
- Toggle descriptions at correlations.

Deprecation:

- This is the last version to support Python 3.5.

Stability:

- The order of columns changed when sort="None" (#377, fixed).
- Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1)
- Improved mixed type detection (#351)
- Refactor of report structures.
- Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329).
- Distinct counts exclude NaNs.
- Fixed alerts in notebooks.

Other improvements:

- Warnings are now sorted.
- Links to Binder and Google Colab are added for notebooks (#349)
- The overview section is tabbed.

* Commit for pandas-profiling v2.5.0

- Progress bar added (#224)
- Character analysis for Text/NLP (#278)
- Themes: configuration and demo's (Orange, Dark)
- Tutorial on modifying the report's structure (#362; #281, #259, #253, #234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling.
- Toggle descriptions at correlations.

Deprecation:

- This is the last version to support Python 3.5.

Stability:

- The order of columns changed when sort="None" (#377, fixed).
- Pandas v1.0.X is not yet supported (#367, #366, #363, #353, pinned pandas to < 1)
- Improved mixed type detection (#351)
- Refactor of report structures.
- Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, #329).
- Distinct counts exclude NaNs.
- Fixed alerts in notebooks.

Other improvements:

- Warnings are now sorted.
- Links to Binder and Google Colab are added for notebooks (#349)
- The overview section is tabbed.
@sbrugman
Copy link
Collaborator

to_file('file.json') and .to_json() are now available, closing

chanedwin pushed a commit to chanedwin/pandas-profiling that referenced this issue Oct 11, 2020
* Progress bar implementation

- Feature as requested in ydataai#224
- Test for ydataai#282
- Many thanks @marco-cardoso for your initial implementation ydataai#225
- Display no progress bar for disabled modules (e.g. individual correlations).
- Update requirements, notebooks, docs, examples, linting

* Decouple notebooks and notebook tests. One test hangs on issue in nbval:
computationalmodelling/nbval#136

* Disable missing plots in minimal mode

* Create additional demo with Chicago employees data

* Compartmentalize column sorting in describe module
chanedwin pushed a commit to chanedwin/pandas-profiling that referenced this issue Oct 11, 2020
- Progress bar added (ydataai#224)
- Character analysis for Text/NLP (ydataai#278)
- Themes: configuration and demo's (Orange, Dark)
- Tutorial on modifying the report's structure (ydataai#362; ydataai#281, ydataai#259, ydataai#253, ydataai#234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling.
- Toggle descriptions at correlations.

Deprecation:

- This is the last version to support Python 3.5.

Stability:

- The order of columns changed when sort="None" (ydataai#377, fixed).
- Pandas v1.0.X is not yet supported (ydataai#367, ydataai#366, ydataai#363, ydataai#353, pinned pandas to < 1)
- Improved mixed type detection (ydataai#351)
- Refactor of report structures.
- Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, ydataai#329).
- Distinct counts exclude NaNs.
- Fixed alerts in notebooks.

Other improvements:

- Warnings are now sorted.
- Links to Binder and Google Colab are added for notebooks (ydataai#349)
- The overview section is tabbed.

* Commit for pandas-profiling v2.5.0

- Progress bar added (ydataai#224)
- Character analysis for Text/NLP (ydataai#278)
- Themes: configuration and demo's (Orange, Dark)
- Tutorial on modifying the report's structure (ydataai#362; ydataai#281, ydataai#259, ydataai#253, ydataai#234). This jupyter notebook also demonstrates how to use the Kaggle api together with pandas-profiling.
- Toggle descriptions at correlations.

Deprecation:

- This is the last version to support Python 3.5.

Stability:

- The order of columns changed when sort="None" (ydataai#377, fixed).
- Pandas v1.0.X is not yet supported (ydataai#367, ydataai#366, ydataai#363, ydataai#353, pinned pandas to < 1)
- Improved mixed type detection (ydataai#351)
- Refactor of report structures.
- Correlations are more stable (e.g. Phi_k color scale now from 0-1, rows and columns with NaN values are dropped, ydataai#329).
- Distinct counts exclude NaNs.
- Fixed alerts in notebooks.

Other improvements:

- Warnings are now sorted.
- Links to Binder and Google Colab are added for notebooks (ydataai#349)
- The overview section is tabbed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request 💬 Requests for new features getting started ☝ Straight-forward for beginning contributors
Projects
None yet
Development

No branches or pull requests

3 participants