-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace DataFrames's default _repr_html_
(closes #76)
#175
base: master
Are you sure you want to change the base?
Conversation
DataFrame
s to allow custom styles (closes #76)
Codecov Report
@@ Coverage Diff @@
## master #175 +/- ##
=========================================
+ Coverage 81% 81.21% +0.2%
=========================================
Files 24 25 +1
Lines 1606 1634 +28
Branches 279 281 +2
=========================================
+ Hits 1301 1327 +26
- Misses 251 252 +1
- Partials 54 55 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome.
- Can you time it with a big enough dataset (>100k items) to see if there's a difference?
- What about
_key
which is df.index?
69b9401
to
e3c106b
Compare
_repr_html_
(closes #76)
_repr_html_
(closes #76)_repr_html_
(closes #76)
Here is a simple benchmark I did for measuring the runtime. # render_links_benchmark.py
import time
import arche
import pandas as pd
df = pd.read_json("./327565_39_252_items.jl", lines=True)
with pd.option_context("display.min_rows", 100_000, "display.max_rows", 100_000):
t = time.process_time()
out = df._repr_html_()
print(f"Time expended on `_repr_html_`: {time.process_time() - t}")
print(f"Len: {len(out)}") Executing it: $ for branch in master clickable_urls; do git checkout $branch; ./render_links_benchmark.py; done
Already on 'master'
Time expended on `_repr_html_`: 164.999099792
Len: 276061183
Switched to branch 'clickable_urls'
Time expended on `_repr_html_`: 208.39284144
Len: 322665630 It turns out that, for that dataset, rendering links generates about 17% more data and takes about 27% longer to complete. |
_repr_html_
(closes #76)_repr_html_
(closes #76)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't care that much about full rendering performance - people wouldn't want to render it.
I checked normal cases myself https://jupyter.scrapinghub.com/user/v/lab/tree/shared/Experiments/clickable_urls.ipynb, there's no difference
render_links=render_links, | ||
) | ||
formatter.to_html(notebook=True) | ||
return formatter.buf.getvalue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I compared to the source, trying to find if there's an alternative to this hack and noticed this.
Why buf.getvalue()
is returned instead of formatted as in the source https://github.com/pandas-dev/pandas/blob/c23649143781c658f792e8f7a5b4368ed01f719c/pandas/core/frame.py#L724?
@pytest.fixture() | ||
def df_with_urls(): | ||
pd.set_option("display.notebook_repr_html", True) | ||
data = {"col1": [1, 2], "col2": ["http://foo.com", "https://bar.com"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add index here too, that's the main use case.
@@ -0,0 +1,36 @@ | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with test files naming - tests/tools/test_dataframe.py
return pd.DataFrame(data) | ||
|
||
|
||
def test_df_has_clickable_urls(df_with_urls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention we use is test_function\class_description
.
Like test_df_repr_html_true
Does it look better? Mainly it's done for quick search.
assert "<a href=" not in html | ||
|
||
|
||
def test_large_repr(df_with_urls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what this test does. Could you please clarify?
I added a subclass for DataFrame, in order to override itsto_html()
, allowing us to define some defaults styling, like the clickable URLs from #76 .I changed my approach on this feature. The way I've tried previously doesn't work on new
DataFrames
s created by common pandas functions (eg.df.head()
df[df['url'].notna()]
).Now I'm replacing the default's
_repr_html_()
method fromDataFrames
.Let me know your thougths on this one.