test(python): add parametric tests for groupby_dynamic #9334

MarcoGorelli · 2023-06-12T07:10:19Z

Thanks @honno for helping write these - any more suggestions?

This has already helped uncover:

4 bugs in pandas: BUG: Unnecessary NonExistentTimeError when resampling weekly pandas-dev/pandas#53666, BUG: resampling empty series loses time zone from dtype pandas-dev/pandas#53664, BUG: Timestamp origin takes no effect in resample for 'MS' frequency pandas-dev/pandas#53662, BUG: resample with freq W-Mon and first data point starts on Monday doesn't respect closed pandas-dev/pandas#53612
1 bug in polars (groupby_dynamic doesn't truncate label when the number of periods is greater than 1 #9333) (now fixed! 🚀 )
1 bug converting polars to pandas (Wrong time stamp when converting to pyarrow for dates after 2038 #9315)

~~The two tests presented here take about 20s to execute on my very mediocre 4-core travelling laptop~~ the 3 tests here take about 15s on my laptop

honno

Mind I'm pretty ignorant on the problem domain here, but it looks good to me!

On purposely restricting strategies so they don't trigger known bugs—I think it makes sense here, given the sheer number of permutations. Ideally one would have the strategies cover all supposedly-valid input and slapping a xfail when needed, but that gets annoying fast when you have such a complex domain with multiple known bugs.

I did a diff on the two tests and it does look annoying enough to refactor, not that refactoring does much good when things have only been repeated once. Just to say if there was a way to have these tests neatly refactored (i.e. a big _test_against_pandas function called in both tests, or a parametrized arg like test_weekly: bool in a single test), defo play around with it if you haven't already.

honno · 2023-06-12T07:52:38Z

py-polars/tests/parametric/test_groupby_dynamic.py

+    data: st.DataObject,
+) -> None:
+    pl_every, pd_alias = every_alias
+    assume(timezone in zoneinfo.available_timezones())


Could you instead construct the strategy for timezone to just be the available timezones? e.g. timezone=st.sampled_from(zoneinfo.available_timezones())

alexander-beedie · 2023-06-12T08:20:27Z

At the moment these tests are very bespoke, in that they don't use any of the existing polars hypothesis primitives/strategies; It would be nice if we could pull out the datetime (with timezone) generating strategy as a customisable create_datetime_strategy helper in the same vein as the existing create_list_strategy (timezones have been on my radar for a while, but I hadn't got round to it yet - currently datetime generation just varies by time_unit).

Either I can have a crack at that, or I can leave it to one of you and then review? Either option is fine by me... ;)

MarcoGorelli · 2023-06-12T09:04:48Z

Thanks @honno ! Indeed, it was possible to simplify a bit, thanks for suggesting that

And thanks @alexander-beedie for taking a look! Which part are you suggesting to make a strategy for - creating a polars Series with Datetime dtype?

alexander-beedie · 2023-06-12T09:07:36Z

And thanks @alexander-beedie for taking a look! Which part are you suggesting to make a strategy for - creating a polars Series with Datetime dtype?

We can already do that easily enough, just not with customisable timezones - would be nice to improve that so we don't have to go bespoke once a timezone comes into play :)

> > Co-authored-by: honno <[email protected]>

honno · 2023-06-12T14:17:19Z

Okay to reconstruct the current way of generating the original df example, it goes something like

@st.composite
def timey_wimey_dataframes(draw: st.DrawFn) -> pl.DataFrame:
    datetimes = draw(
        st.lists(
            st.datetimes(
                min_value=dt.datetime(1980, 1, 1),
                # Can't currently go beyond 2038, see
                # https://github.com/pola-rs/polars/issues/9315
                max_value=dt.datetime(2038, 1, 1),
            ),
            min_size=1,
        )
    )
    timezone = draw(st.sampled_from(list(zoneinfo.available_timezones())))
    
    nrows = len(datetimes)
    values = draw(
        st.lists(st.floats(10, 20), min_size=nrows, max_size=nrows), label="values"
    )

    try:
        df = (
            pl.DataFrame({"ts": datetimes, "values": values})
            .sort("ts")
            .with_columns(
                pl.col("ts").dt.replace_time_zone("UTC").dt.convert_time_zone(timezone)
            )
        )
    except pl.exceptions.ComputeError as exp:
        assert "unable to parse time zone" in str(exp)  # sanity check
        reject()

    return df

strat = timey_wimey_dataframes()

Using polars.testing I suppose you do something like

strat = pl.testing.dataframes(
    cols=[
        pl.testing.column(
            name="ts",
            dtype=pl.Datetime(),
            strategy=st.datetimes(
                min_value=dt.datetime(1980, 1, 1),
                max_value=dt.datetime(2038, 1, 1),
                timezones=st.sampled_from(list(zoneinfo.available_timezones())),
            ),
        ),
        pl.testing.column(name="values", dtype=pl.Float64(), strategy=st.floats(-10, 10)),
    ]
).map(lambda df: df.sort("ts"))

? Hopefully this is a good start @MarcoGorelli

In any case, I used the timezones strategy directly in st.datetimes(), but I wonder if that causes problems?

(I haven't actually tried this—I suppose polars.testing doesn't show up from like PyPI builds, so will have to try this later when I set up polars for development)

…dynamic

MarcoGorelli · 2023-06-14T13:21:05Z

cool, I've given this a go - thanks @honno ! any further suggestions?

…dynamic

honno · 2023-06-15T09:04:11Z

py-polars/tests/parametric/test_groupby_dynamic.py

+    number: int,
+) -> None:
+    nrows = len(time_series)
+    values = pl.Series(


I wonder if testing.parametric.series can be utilised here?

Haven't tried this myself yet, but something like

Suggested change

values = pl.Series(

values = data.draw(pl.testing.parametric.series(strategy=st.floats(10, 20)), label="values")

(Incase you missed it, you can annotate draws with labels, which can make debugging easier for more complicated tests. No biggie tho.)

@honno: Interesting; do you think it would be worth putting any labels into the underlying primitives themselves?

@honno: Interesting; do you think it would be worth putting any labels into the underlying primitives themselves?

I think I'm correct in saying that there's no real point putting labels inside of @st.composite-functions, as Hypothesis would only report the overall thing been drawn at the test level, i.e. test_foo(data): data.draw(dataframes(...), label="only the final df is reported"). Labeling seems more useful anyway when a bespoke thing is for a test method.

Also, it's really cool that these primitives exist!

Looking at the tests I think a few more native params might be useful too; some generic min_value & max_value args, for example... Don't hesitate to request anything you think would be a time-saver / useful :)

not sure I see what you mean exactly, could you clarify what you're suggesting I change please?

Oh! That’s more of a note to myself than you, sorry… Just thinking I can extend the primitives a bit to further simplify common cases 🤣

honno

On the usage of Hypothesis, everything LGTM!

Already noted to Marco before that I would name strategy-factories (e.g. @st.composite-decorated functions that return st.SearchStrategy objects) be named "thing-it-generates" (preferably in plural), like dataframes() or series(). But I see the existing convention in polars.testing.parametric.strategies is to name these with the strategy_ prefix, so no biggie.

MarcoGorelli · 2023-06-27T13:43:07Z

thanks!

@alexander-beedie any objections to getting this in? it would add 20 seconds or so to CI, but alongside the unit tests (which this is absolutely not a replacement for) would increase confidence in future refactors of groupby_dynamic a fair bit

EDIT: marking as draft for now, gonna see if I can reduce the runtime whilst keeping the extra safety it provides

MarcoGorelli · 2023-08-09T21:41:20Z

closing for now to clear the queue, will get back to this when I get a chance

tst(python): add parametric tests for groupby_dynamic

8fb5eb2

github-actions bot added python Related to Python Polars test Related to the test suite labels Jun 12, 2023

MarcoGorelli added 3 commits June 12, 2023 08:39

fixup

744fa20

3.7 compat

adc9d1d

fixup

2acd8f3

honno reviewed Jun 12, 2023

View reviewed changes

MarcoGorelli added 2 commits June 12, 2023 09:47

refactor a bit

a38504c

type

a425a2b

MarcoGorelli marked this pull request as ready for review June 12, 2023 09:04

MarcoGorelli requested review from ritchie46, stinodego and alexander-beedie as code owners June 12, 2023 09:04

Add honno as coauthor

c58ba01

> > Co-authored-by: honno <[email protected]>

MarcoGorelli marked this pull request as draft June 12, 2023 09:26

MarcoGorelli added 8 commits June 14, 2023 11:29

Merge remote-tracking branch 'upstream/main' into parametric-groupby-…

f5005b9

…dynamic

create reusable strategy

abd8844

wip

1fd0c78

split out test_daily

13374ea

document better

180fb30

lint

e4b66b8

clean up

9da3aff

lnit

43d4a30

MarcoGorelli marked this pull request as ready for review June 14, 2023 13:20

MarcoGorelli mentioned this pull request Jun 15, 2023

Wrong time stamp when converting to pyarrow for dates after 2038 #9315

Closed

2 tasks

MarcoGorelli added 3 commits June 15, 2023 08:48

Merge remote-tracking branch 'upstream/main' into parametric-groupby-…

0d6b247

…dynamic

fix link

8cabbbc

fix name

5d5a696

honno reviewed Jun 15, 2023

View reviewed changes

use series primitive

5f8f322

MarcoGorelli requested a review from honno June 26, 2023 20:01

honno reviewed Jun 27, 2023

View reviewed changes

MarcoGorelli marked this pull request as draft June 28, 2023 07:29

MarcoGorelli closed this Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(python): add parametric tests for groupby_dynamic #9334

test(python): add parametric tests for groupby_dynamic #9334

MarcoGorelli commented Jun 12, 2023 •

edited

Loading

honno left a comment •

edited

Loading

honno Jun 12, 2023

alexander-beedie commented Jun 12, 2023 •

edited

Loading

MarcoGorelli commented Jun 12, 2023

alexander-beedie commented Jun 12, 2023 •

edited

Loading

honno commented Jun 12, 2023 •

edited

Loading

MarcoGorelli commented Jun 14, 2023

honno Jun 15, 2023

alexander-beedie Jun 16, 2023

honno Jun 16, 2023

alexander-beedie Jun 16, 2023

MarcoGorelli Jun 16, 2023

alexander-beedie Jun 17, 2023

honno left a comment

MarcoGorelli commented Jun 27, 2023 •

edited

Loading

MarcoGorelli commented Aug 9, 2023

	values = pl.Series(
	values = data.draw(pl.testing.parametric.series(strategy=st.floats(10, 20)), label="values")

test(python): add parametric tests for groupby_dynamic #9334

test(python): add parametric tests for groupby_dynamic #9334

Conversation

MarcoGorelli commented Jun 12, 2023 • edited Loading

honno left a comment • edited Loading

Choose a reason for hiding this comment

honno Jun 12, 2023

Choose a reason for hiding this comment

alexander-beedie commented Jun 12, 2023 • edited Loading

MarcoGorelli commented Jun 12, 2023

alexander-beedie commented Jun 12, 2023 • edited Loading

honno commented Jun 12, 2023 • edited Loading

MarcoGorelli commented Jun 14, 2023

honno Jun 15, 2023

Choose a reason for hiding this comment

alexander-beedie Jun 16, 2023

Choose a reason for hiding this comment

honno Jun 16, 2023

Choose a reason for hiding this comment

alexander-beedie Jun 16, 2023

Choose a reason for hiding this comment

MarcoGorelli Jun 16, 2023

Choose a reason for hiding this comment

alexander-beedie Jun 17, 2023

Choose a reason for hiding this comment

honno left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Jun 27, 2023 • edited Loading

MarcoGorelli commented Aug 9, 2023

MarcoGorelli commented Jun 12, 2023 •

edited

Loading

honno left a comment •

edited

Loading

alexander-beedie commented Jun 12, 2023 •

edited

Loading

alexander-beedie commented Jun 12, 2023 •

edited

Loading

honno commented Jun 12, 2023 •

edited

Loading

MarcoGorelli commented Jun 27, 2023 •

edited

Loading