Reduce execution time of Python ORC tests #14776

vuule · 2024-01-17T23:01:40Z

Description

Reduced size of the excessively large tests, making sure to keep the code coverage.
Also fixed a few tests to provide better coverage (original intent unclear).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vuule · 2024-01-18T19:41:11Z

python/cudf/cudf/tests/test_orc.py

-    fail_df["col"][500000] = None
+    # Generate a boolean column longer than a single row group
+    fail_df = cudf.DataFrame({"col": gen_rand_series("bool", 20000)})
+    # Invalidate a row in the first row group


the old comment was incorrect, the test file had a single stripe

vuule · 2024-01-18T19:43:57Z

python/cudf/cudf/tests/test_orc.py

-    fail_df = cudf.DataFrame({"col": gen_rand_series("bool", 600000)})
-    # Invalidate the first row in the second stripe to break encoding
-    fail_df["col"][500000] = None
+    # Generate a boolean column longer than a single row group


Modified this test based on the actual checks we perform on bool columns - all row groups except for the last one in each stripe need to have the number of valid elements divisible by 8. The row group size is 10k, so a single null fails this check and the writer should throw.
I have no idea what I meant with the original comments, they don't match the code at all 🤷‍♂️

vuule · 2024-01-18T19:44:53Z

python/cudf/cudf/tests/test_orc.py

@@ -1130,7 +1131,7 @@ def test_pyspark_struct(datadir):
    assert_eq(pdf, gdf)


-def gen_map_buff(size=10000):
+def gen_map_buff(size):


default value was unused

bdice

Seems like a reasonable set of changes - I have a couple questions about the underlying issues that are commented in these tests.

bdice · 2024-02-07T00:06:40Z

python/cudf/cudf/tests/test_orc.py

@@ -604,13 +604,13 @@ def normalized_equals(value1, value2):


 @pytest.mark.parametrize("stats_freq", ["STRIPE", "ROWGROUP"])
-@pytest.mark.parametrize("nrows", [1, 100, 6000000])
+@pytest.mark.parametrize("nrows", [1, 100, 100000])
 def test_orc_write_statistics(tmpdir, datadir, nrows, stats_freq):
    from pyarrow import orc

    supported_stat_types = supported_numpy_dtypes + ["str"]
    # Can't write random bool columns until issue #6763 is fixed


@vuule Do you think we should consider putting #6763 back on the queue of things to do? Seems like a bug worth fixing.

It's sad that we don't fully support bool columns, but we haven't had any users ask for this (that I know of).
If there's demand, I'll gladly add it to the ~~pile~~backlog. Not sure if the issue conveys this, but it's not a trivial feature.

Can we explicitly disable writing bool columns? This seems like we're writing bad data, and silent corruption isn't something I feel comfortable waiting for users to discover and report.

Already done #7261 ;)

Ahhhhh. That totally changes my perspective. Can you update the comments to say something like "Writing bool columns exceeding one row group are disabled in libcudf until #6763 is fixed"?

Also we should update this test to check that an error is raised in this case, rather than removing the column!

This test is for writing statistics, so we really want it to write a table with multiple stripes and verify the written statistics. Letting a test case throw does not contribute to this.
We do have a separate test for throwing with bool columns (as opposed to silent corruption).

Thanks. Apologies, I should have looked first. I have no other concerns.

bdice · 2024-02-07T00:08:19Z

python/cudf/cudf/tests/test_orc.py

 def test_orc_write_statistics(tmpdir, datadir, nrows, stats_freq):
    from pyarrow import orc

    supported_stat_types = supported_numpy_dtypes + ["str"]
    # Can't write random bool columns until issue #6763 is fixed
-    if nrows == 6000000:
+    if nrows == 100000:


Why does this work for nrows=1 or nrows=100?

We can write a single row group of random bools, just not multiple (at least not in the way that does not cause issues with other readers). So anything below 10k rows is fine. I know this is very hacky :(

davidwendt · 2024-02-07T00:35:41Z

So how much faster is it now?

vuule · 2024-02-07T04:35:02Z

So how much faster is it now?

Down from 114s to 55s on my system.

python/cudf/cudf/tests/test_orc.py

Co-authored-by: Bradley Dice <[email protected]>

vuule · 2024-02-08T19:47:48Z

/merge

vuule added 2 commits January 17, 2024 14:59

add stripe size support to chunked orc writer

b1f60d8

reduce py ORC tests

ac7ba75

vuule added tests Unit testing for project cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 17, 2024

vuule self-assigned this Jan 17, 2024

github-actions bot added the Python Affects Python cuDF API. label Jan 17, 2024

Merge branch 'branch-24.02' into impr-orc-py-tests

af56dd4

vuule commented Jan 18, 2024

View reviewed changes

vuule changed the base branch from branch-24.02 to branch-24.04 February 6, 2024 17:47

vuule added 2 commits February 6, 2024 09:48

Merge branch 'branch-24.04' into impr-orc-py-tests

dc021d7

merge fix

0d4de48

vuule marked this pull request as ready for review February 7, 2024 00:01

vuule requested a review from a team as a code owner February 7, 2024 00:01

vuule requested review from vyasr and bdice February 7, 2024 00:01

bdice approved these changes Feb 7, 2024

View reviewed changes

Update comment

1f9b727

bdice approved these changes Feb 7, 2024

View reviewed changes

bdice reviewed Feb 7, 2024

View reviewed changes

python/cudf/cudf/tests/test_orc.py Outdated Show resolved Hide resolved

vuule and others added 2 commits February 7, 2024 11:44

Update another comment

fc3a005

Co-authored-by: Bradley Dice <[email protected]>

Merge branch 'branch-24.04' into impr-orc-py-tests

eedfc0d

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 7, 2024

rapids-bot bot merged commit c3cf7c6 into rapidsai:branch-24.04 Feb 8, 2024
69 checks passed

vuule deleted the impr-orc-py-tests branch February 8, 2024 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce execution time of Python ORC tests #14776

Reduce execution time of Python ORC tests #14776

vuule commented Jan 17, 2024 •

edited

Loading

vuule Jan 18, 2024

vuule Jan 18, 2024

vuule Jan 18, 2024

bdice left a comment

bdice Feb 7, 2024

vuule Feb 7, 2024

bdice Feb 7, 2024

vuule Feb 7, 2024

bdice Feb 7, 2024

bdice Feb 7, 2024

vuule Feb 7, 2024

bdice Feb 7, 2024

bdice Feb 7, 2024 •

edited

Loading

vuule Feb 7, 2024

davidwendt commented Feb 7, 2024

vuule commented Feb 7, 2024

vuule commented Feb 8, 2024

Reduce execution time of Python ORC tests #14776

Reduce execution time of Python ORC tests #14776

Conversation

vuule commented Jan 17, 2024 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidwendt commented Feb 7, 2024

vuule commented Feb 7, 2024

vuule commented Feb 8, 2024

vuule commented Jan 17, 2024 •

edited

Loading

bdice Feb 7, 2024 •

edited

Loading