Improve the test data for pylibcudf I/O tests #16247

lithomas1 · 2024-07-10T23:21:21Z

Description

Don't just use random integers for every data type.

Decided not to use hypothesis since I don't think there's a good way to re-use the table across calls
(and I would like to keep the runtime of pylibcudf tests down).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

lithomas1 · 2024-07-10T23:22:46Z

python/cudf/cudf/pylibcudf_tests/io/test_json.py

@@ -182,20 +182,6 @@ def test_read_json_basic(
        source_or_sink, pa_table, lines=lines, compression=compression_type
    )

-    request.applymarker(


Now that the data is more well-formed I guess, this isn't crashing anymore.

I think there still might be an issue in libcudf's JSON reader, though.
(will open a followup issue if I can still reproduce, but it's a little hard to reproduce)

lithomas1 · 2024-07-10T23:30:13Z

python/cudf/cudf/pylibcudf_tests/conftest.py

+        # Generate random ASCII strings
+        strs = []
+        for _ in range(length):
+            chrs = np.random.randint(33, 128, length)


I didn't start from 0 since 0-33 is ASCII control characters, and that can interfere with some of the text formats like CSV/JSON.

brandon-b-miller · 2024-07-11T15:24:21Z

python/cudf/cudf/pylibcudf_tests/conftest.py

+    """
+    if pa_type == pa.int64():
+        half = length // 2
+        negs = np.random.randint(-length, 0, half, dtype=np.int64)


I think it's best practice to use a generator with a seed for reproducibility rather than using np.randint

…f-io-tests

lithomas1 · 2024-07-11T19:47:47Z

python/cudf/cudf/pylibcudf_tests/conftest.py

+@pytest.fixture(
+    params=set(CompressionType).difference(unsupported_text_compression_types)
+)
+def text_compression_type(request):


Added this since most formats don't support all of the compression types, so it probably makes sense to break it out into text (CSV/JSON) vs binary (ORC/Parquet)

lithomas1 · 2024-07-11T19:50:05Z

OK, comments should be addressed now. I also brought over some of my changes from the CSV PR in here as well.
(mainly just pulling out the utilities that I added for testing the JSON reader)

brandon-b-miller

LGTM

lithomas1 · 2024-07-12T15:12:14Z

/merge

Improve the test data for pylibcudf I/O tests

2a705e2

lithomas1 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 10, 2024

lithomas1 requested a review from a team as a code owner July 10, 2024 23:21

lithomas1 requested review from mroeschke and Matt711 July 10, 2024 23:21

github-actions bot added the Python Affects Python cuDF API. label Jul 10, 2024

lithomas1 commented Jul 10, 2024

View reviewed changes

brandon-b-miller reviewed Jul 11, 2024

View reviewed changes

lithomas1 added 2 commits July 11, 2024 19:31

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

c4204ca

…f-io-tests

address comments and bring over some more changes

c88deca

lithomas1 commented Jul 11, 2024

View reviewed changes

lithomas1 requested a review from brandon-b-miller July 12, 2024 14:36

brandon-b-miller approved these changes Jul 12, 2024

View reviewed changes

rapids-bot bot merged commit 1ff7461 into rapidsai:branch-24.08 Jul 12, 2024
79 checks passed

lithomas1 deleted the pylibcudf-io-tests branch July 12, 2024 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the test data for pylibcudf I/O tests #16247

Improve the test data for pylibcudf I/O tests #16247

lithomas1 commented Jul 10, 2024 •

edited

Loading

lithomas1 Jul 10, 2024

lithomas1 Jul 11, 2024

lithomas1 Jul 10, 2024

brandon-b-miller Jul 11, 2024

lithomas1 Jul 11, 2024

lithomas1 commented Jul 11, 2024

brandon-b-miller left a comment

lithomas1 commented Jul 12, 2024

Improve the test data for pylibcudf I/O tests #16247

Improve the test data for pylibcudf I/O tests #16247

Conversation

lithomas1 commented Jul 10, 2024 • edited Loading

Description

Checklist

lithomas1 Jul 10, 2024

Choose a reason for hiding this comment

lithomas1 Jul 11, 2024

Choose a reason for hiding this comment

lithomas1 Jul 10, 2024

Choose a reason for hiding this comment

brandon-b-miller Jul 11, 2024

Choose a reason for hiding this comment

lithomas1 Jul 11, 2024

Choose a reason for hiding this comment

lithomas1 commented Jul 11, 2024

brandon-b-miller left a comment

Choose a reason for hiding this comment

lithomas1 commented Jul 12, 2024

lithomas1 commented Jul 10, 2024 •

edited

Loading