Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace DataFrames's default _repr_html_ (closes #76) #175

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

tcurvelo
Copy link
Contributor

@tcurvelo tcurvelo commented Oct 20, 2019

I added a subclass for DataFrame, in order to override its to_html(), allowing us to define some defaults styling, like the clickable URLs from #76 .

I changed my approach on this feature. The way I've tried previously doesn't work on new DataFramess created by common pandas functions (eg. df.head() df[df['url'].notna()]).
Now I'm replacing the default's _repr_html_() method from DataFrames.

Let me know your thougths on this one.

@tcurvelo tcurvelo requested a review from manycoding October 20, 2019 22:34
@tcurvelo tcurvelo changed the title Wrap DataFrames to allow custom styles (closes #76) Wrap DataFrame to allow styling it (closes #76) Oct 20, 2019
@codecov
Copy link

codecov bot commented Oct 20, 2019

Codecov Report

Merging #175 into master will increase coverage by 0.2%.
The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #175     +/-   ##
=========================================
+ Coverage      81%   81.21%   +0.2%     
=========================================
  Files          24       25      +1     
  Lines        1606     1634     +28     
  Branches      279      281      +2     
=========================================
+ Hits         1301     1327     +26     
- Misses        251      252      +1     
- Partials       54       55      +1
Impacted Files Coverage Δ
src/arche/__init__.py 100% <100%> (ø) ⬆️
src/arche/tools/dataframe.py 91.66% <91.66%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 476a9dd...0655121. Read the comment docs.

Copy link
Contributor

@manycoding manycoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome.

  1. Can you time it with a big enough dataset (>100k items) to see if there's a difference?
  2. What about _key which is df.index?

src/arche/readers/items.py Outdated Show resolved Hide resolved
@tcurvelo tcurvelo changed the title Wrap DataFrame to allow styling it (closes #76) Replace DataFrames's default _repr_html_ (closes #76) Oct 24, 2019
@tcurvelo tcurvelo changed the title Replace DataFrames's default _repr_html_ (closes #76) [WIP] Replace DataFrames's default _repr_html_ (closes #76) Oct 25, 2019
@tcurvelo
Copy link
Contributor Author

Here is a simple benchmark I did for measuring the runtime.
Below is the script I used. It loads a dataset of 100K+ items and prints its HTML representation. I forced it to display 100_000 lines instead of truncate them.

# render_links_benchmark.py
import time
import arche
import pandas as pd

df = pd.read_json("./327565_39_252_items.jl", lines=True)

with pd.option_context("display.min_rows", 100_000, "display.max_rows", 100_000):
    t = time.process_time()
    out = df._repr_html_()
    print(f"Time expended on `_repr_html_`: {time.process_time() - t}")
    print(f"Len: {len(out)}")

Executing it:

$ for branch in master clickable_urls; do git checkout $branch; ./render_links_benchmark.py; done
Already on 'master'
Time expended on `_repr_html_`: 164.999099792
Len: 276061183
Switched to branch 'clickable_urls'
Time expended on `_repr_html_`: 208.39284144
Len: 322665630

It turns out that, for that dataset, rendering links generates about 17% more data and takes about 27% longer to complete.

@tcurvelo tcurvelo changed the title [WIP] Replace DataFrames's default _repr_html_ (closes #76) Replace DataFrames's default _repr_html_ (closes #76) Nov 18, 2019
@tcurvelo tcurvelo requested a review from manycoding November 18, 2019 19:15
Copy link
Contributor

@manycoding manycoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't care that much about full rendering performance - people wouldn't want to render it.
I checked normal cases myself https://jupyter.scrapinghub.com/user/v/lab/tree/shared/Experiments/clickable_urls.ipynb, there's no difference

render_links=render_links,
)
formatter.to_html(notebook=True)
return formatter.buf.getvalue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared to the source, trying to find if there's an alternative to this hack and noticed this.

Why buf.getvalue() is returned instead of formatted as in the source https://github.com/pandas-dev/pandas/blob/c23649143781c658f792e8f7a5b4368ed01f719c/pandas/core/frame.py#L724?

@pytest.fixture()
def df_with_urls():
pd.set_option("display.notebook_repr_html", True)
data = {"col1": [1, 2], "col2": ["http://foo.com", "https://bar.com"]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add index here too, that's the main use case.

@@ -0,0 +1,36 @@
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with test files naming - tests/tools/test_dataframe.py

return pd.DataFrame(data)


def test_df_has_clickable_urls(df_with_urls):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention we use is test_function\class_description.

Like test_df_repr_html_true

Does it look better? Mainly it's done for quick search.

assert "<a href=" not in html


def test_large_repr(df_with_urls):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what this test does. Could you please clarify?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants