Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Drop force_nullable_schema from chunked parquet writer #12996

Merged

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Mar 22, 2023

Description

force_nullable_schema was introduced in #12952, however strangely only after it has been merged to branch-23.04 we are seeing the following pytest failure occur locally:

(cudfdev) pgali@dt07:/nvme/0/pgali/cudf$ pytest python/dask_cudf/dask_cudf/io/tests/test_parquet.py::test_cudf_list_struct_write
====================================================================================== test session starts =======================================================================================
platform linux -- Python 3.10.9, pytest-7.2.2, pluggy-1.0.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /nvme/0/pgali/cudf/python/dask_cudf
plugins: cases-3.6.14, anyio-3.6.2, benchmark-4.0.0, xdist-3.2.1, hypothesis-6.70.0, cov-4.0.0
collected 1 item                                                                                                                                                                                 

python/dask_cudf/dask_cudf/io/tests/test_parquet.py F                                                                                                                                      [100%]

============================================================================================ FAILURES ============================================================================================
__________________________________________________________________________________ test_cudf_list_struct_write ___________________________________________________________________________________

tmpdir = local('/tmp/pytest-of-pgali/pytest-84/test_cudf_list_struct_write0')

    def test_cudf_list_struct_write(tmpdir):
        df = cudf.DataFrame(
            {
                "a": [1, 2, 3],
                "b": [[[1, 2]], [[2, 3]], None],
                "c": [[[["a", "z"]]], [[["b", "d", "e"]]], None],
            }
        )
        df["d"] = df.to_struct()
    
        ddf = dask_cudf.from_cudf(df, 3)
        temp_file = str(tmpdir.join("list_struct.parquet"))
    
>       ddf.to_parquet(temp_file)

python/dask_cudf/dask_cudf/io/tests/test_parquet.py:493: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../envs/cudfdev/lib/python3.10/contextlib.py:79: in inner
    return func(*args, **kwds)
python/dask_cudf/dask_cudf/core.py:252: in to_parquet
    return to_parquet(self, path, *args, **kwargs)
../envs/cudfdev/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py:1061: in to_parquet
    out = out.compute(**compute_kwargs)
../envs/cudfdev/lib/python3.10/site-packages/dask/base.py:314: in compute
    (result,) = compute(self, traverse=False, **kwargs)
../envs/cudfdev/lib/python3.10/site-packages/dask/base.py:599: in compute
    results = schedule(dsk, keys, **kwargs)
../envs/cudfdev/lib/python3.10/site-packages/dask/threaded.py:89: in get
    results = get_async(
../envs/cudfdev/lib/python3.10/site-packages/dask/local.py:511: in get_async
    raise_exception(exc, tb)
../envs/cudfdev/lib/python3.10/site-packages/dask/local.py:319: in reraise
    raise exc
../envs/cudfdev/lib/python3.10/site-packages/dask/local.py:224: in execute_task
    result = _execute_task(task, data)
../envs/cudfdev/lib/python3.10/site-packages/dask/core.py:119: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
../envs/cudfdev/lib/python3.10/site-packages/dask/optimization.py:990: in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
../envs/cudfdev/lib/python3.10/site-packages/dask/core.py:149: in get
    result = _execute_task(task, cache)
../envs/cudfdev/lib/python3.10/site-packages/dask/core.py:119: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
../envs/cudfdev/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py:171: in __call__
    return self.engine.write_partition(
python/dask_cudf/dask_cudf/io/parquet.py:349: in write_partition
    md = df.to_parquet(
../envs/cudfdev/lib/python3.10/site-packages/cudf/core/dataframe.py:6322: in to_parquet
    return parquet.to_parquet(
../envs/cudfdev/lib/python3.10/contextlib.py:79: in inner
    return func(*args, **kwds)
../envs/cudfdev/lib/python3.10/site-packages/cudf/io/parquet.py:783: in to_parquet
    return _write_parquet(
../envs/cudfdev/lib/python3.10/contextlib.py:79: in inner
    return func(*args, **kwds)
../envs/cudfdev/lib/python3.10/site-packages/cudf/io/parquet.py:105: in _write_parquet
    write_parquet_res = libparquet.write_parquet(
../envs/cudfdev/lib/python3.10/contextlib.py:79: in inner
    return func(*args, **kwds)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   RuntimeError: CUDF failure at: /nvme/0/pgali/cudf/cpp/src/io/parquet/writer_impl.cu:513: Mismatch in metadata prescribed nullability and input column nullability. Metadata for nullable input column cannot prescribe nullability = false

parquet.pyx:432: RuntimeError
==================================================================================== short test summary info =====================================================================================
FAILED python/dask_cudf/dask_cudf/io/tests/test_parquet.py::test_cudf_list_struct_write - RuntimeError: CUDF failure at: /nvme/0/pgali/cudf/cpp/src/io/parquet/writer_impl.cu:513: Mismatch in metadata prescribed nullability and input column nullability. Metadata for nullable inpu...
======================================================================================= 1 failed in 3.90s ========================================================================================

This PR fixes the issue by dropping force_nullable_schema from chunked parquet writer.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@galipremsagar galipremsagar added bug Something isn't working 4 - Needs cuDF (Python) Reviewer non-breaking Non-breaking change labels Mar 22, 2023
@galipremsagar galipremsagar requested a review from a team as a code owner March 22, 2023 23:50
@galipremsagar galipremsagar self-assigned this Mar 22, 2023
@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 22, 2023
@@ -698,7 +698,14 @@ cdef _set_col_metadata(
column_in_metadata& col_meta,
bool force_nullable_schema
):
col_meta.set_nullability(force_nullable_schema or col.nullable)
if force_nullable_schema or not col.nullable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure if I agree with the comment below. Here is what I expected:

  1. Col is nullable and forced nullable. This will be written as nullable by default so no action is needed.
  2. Col is nullable and not forced nullable. This will be written as nullable by default so no action is needed.
  3. Col is not nullable and forced nullable. The call to tell C++ to write nulls is needed.
  4. Col is not nullable and not forced nullable. No action is needed.

So can we write this?

if force_nullable_schema and not col.nullable:
    col_meta.set_nullability(True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All are right, except this:

  1. Col is not nullable and not forced nullable. No action is needed.

We need action for this incase of chunked parquet writer. Because if _nullability isn't defined, the default libcudf behavior is to return null schema. So when someone asks for force_nullable_schema=False & column isn't having any nulls the current logic will yield not null as requested by the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. The conditional can still be simplified if only cases 3 and 4 (with non-nullable columns) need action:

Suggested change
if force_nullable_schema or not col.nullable:
if not col.nullable:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically I feel this is right, but seems like there is something else going on too..investigating..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing some investigation @vuule found a bug on libcudf parquet writer side for a sliced struct column. We decided to scope down the force_nullable_schema to only single writer instead. Thus, dropping it's support in chunked parquet writer will not let the bug surface. We don't want to introduce wide-ranging changes at parquet writer side to address the actual problem. So this PR should now be ready for review.

@galipremsagar galipremsagar requested a review from bdice March 24, 2023 17:32
@galipremsagar galipremsagar changed the title [REVIEW] Fix force_nullable_schema interaction with libcudf API [REVIEW] Drop force_nullable_schema from chunked parquet writer Mar 24, 2023
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that was short-lived. The changes seem okay. Thanks for describing the problem in further detail here: #12996

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Python) Reviewer labels Mar 24, 2023
@galipremsagar
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 4c4fdd2 into rapidsai:branch-23.04 Mar 24, 2023
@vuule
Copy link
Contributor

vuule commented Mar 24, 2023

Well, that was short-lived. The changes seem okay. Thanks for describing the problem in further detail here: #12996

Hey, we still got the relevant use case supported :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants