Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_iceberg_parquet_read_round_trip FAILED "TypeError: object of type 'NoneType' has no len()" #6718

Closed
NvTimLiu opened this issue Oct 7, 2022 · 7 comments · Fixed by rapidsai/cudf#11910 or #6783
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Oct 7, 2022

Describe the bug

iceberg_test.py::test_iceberg_parquet_read_round_trip[COALESCING-[Byte, Short, Integer, ...

TypeError: object of type 'NoneType' has no len()

spark_tmp_table_factory = <conftest.TmpTableFactory object at 0x7f93403b2670> data_gens = [Byte, Short, Integer, Long, Float, Double, ...] reader_type = 'COALESCING' @iceberg @ignore_order(local=True) 
# Iceberg plans with a thread pool and is not deterministic in file ordering @pytest.mark.parametrize("data_gens", iceberg_gens_list, ids=idfn) @pytest.mark.parametrize('reader_type', rapids_reader_types) 
    def test_iceberg_parquet_read_round_trip(spark_tmp_table_factory, data_gens, reader_type): gen_list = [('_c' + str(i), gen) for i, gen in enumerate(data_gens)] table = spark_tmp_table_factory.get() tmpview = spark_tmp_table_factory.get() 
    def setup_iceberg_table(spark): df = gen_df(spark, gen_list) df.createOrReplaceTempView(tmpview) spark.sql("CREATE TABLE {} USING ICEBERG AS SELECT * FROM {}".format(table, tmpview)) with_cpu_session(setup_iceberg_table) > 
    assert_gpu_and_cpu_are_equal_collect( lambda spark : spark.sql("SELECT * FROM {}".format(table)), conf={'spark.rapids.sql.format.parquet.reader.type': reader_type}) 
../../src/main/python/iceberg_test.py:88:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    ../../src/main/python/asserts.py:548: in assert_gpu_and_cpu_are_equal_collect _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first) 
    ../../src/main/python/asserts.py:479: in _assert_gpu_and_cpu_are_equal assert_equal(from_cpu, from_gpu) 
    ../../src/main/python/asserts.py:106: in assert_equal _assert_equal(cpu, gpu, float_check=get_float_check(), path=[]) 
    ../../src/main/python/asserts.py:42: in _assert_equal _assert_equal(cpu[index], gpu[index], float_check, path + [index]) 
    ../../src/main/python/asserts.py:35: in _assert_equal _assert_equal(cpu[field], gpu[field], float_check, path + [field])
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 cpu = Row(child0=[110, -70, 109, -17, 97, 0, -66], child1=108, child2=-4.712884395247157e+25, child3=Decimal('3306845829.53')) 
 gpu = None float_check = <function get_float_check.<locals>.<lambda> at 0x7f9336140550> path = [0, '_c19'] 
 def _assert_equal(cpu, gpu, float_check, path): t = type(cpu) if (t is Row): > 
 assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu)) 
 E TypeError: object of type 'NoneType' has no len() 
    ../../src/main/python/asserts.py:31: TypeError</failure>
**Steps/Code to reproduce bug**
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.
@NvTimLiu NvTimLiu added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 7, 2022
@tgravescs
Copy link
Collaborator

I can reproduce locally on 22.12 but wasn't able to on 22.10

@tgravescs tgravescs added the P0 Must have for release label Oct 7, 2022
@tgravescs
Copy link
Collaborator

gpu output seems to be missing a column entry, gpu has this as None cpu has it as: _c19=Row(child0=[110, -70, 109, -17, 97, 0, -66], child1=108, child2=-4.712884395247157e+25, child3=Decimal('3306845829.53')), on the first row of output

@tgravescs
Copy link
Collaborator

cudf from 10/4 works so something must have changed there

@tgravescs
Copy link
Collaborator

fails with spark-rapids-jni jar from 10/6 so likely something on 4th or 5th that went in

@tgravescs
Copy link
Collaborator

Note this is happening when data type is:
StructGen([['child0', ArrayGen(float_gen)]])

And it only happens when you select enough data to make the coalescing kick in. It also only happens with iceberg, reading the raw parquet files the coalescing reader works fine.

With iceberg a ton of these columns come back with null instead of the actual values.
Selecting the exact same iceberg table with 22.10 works fine.

@tgravescs
Copy link
Collaborator

I finally got a parquet file that would reproduce this and sent to cudf folks

@tgravescs tgravescs self-assigned this Oct 10, 2022
@tgravescs tgravescs removed the ? - Needs Triage Need team to review and classify label Oct 10, 2022
@tgravescs
Copy link
Collaborator

going to xfail the test temporarily

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Oct 13, 2022
Fixes NVIDIA/spark-rapids#6718

There was a bug introduced recently #11752 where an insufficient check for whether an input column contained repetition information could cause incorrect results for column hierarchies with structs at the root.

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Jim Brennan (https://github.com/jbrennan333)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #11910
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
3 participants