Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap scalar generation into spark session in integration test #9405

Merged
merged 7 commits into from
Oct 18, 2023

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Oct 9, 2023

Fixes #9404

When calling f.lit, error message:

pyspark.sql.utils.AnalysisException: decimal can only support precision up to 38

will be reported when spark.sql.legacy.allowNegativeScaleOfDecimal is unset.

This config is in the default config of integration test, but those config will only be set when calling with_spark_session, but calling f.lit can happen before any of them in some edge cases when CI running IT parallel.

This PR add the negative scale config before calling f.lit when generating scalars.

Also fixed the cache_repr of TimestampGen to add the new parameter.

@thirtiseven
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator

res-life commented Oct 9, 2023

LGTM

@revans2
Copy link
Collaborator

revans2 commented Oct 9, 2023

What is the exception that actually triggered this? This is essentially the reason that we turned it on by default everywhere.

'spark.sql.legacy.allowNegativeScaleOfDecimal': 'true',

I just want to understand why the default settings are not applying.

@thirtiseven
Copy link
Collaborator Author

@revans2 For example, in following case form the issue 9404:

@pytest.mark.parametrize('data_gen', [DecimalGen(34, -5)], ids=idfn)
def test_greatest1(data_gen):
    num_cols = 20
    s1 = gen_scalar(data_gen, force_no_nulls=not isinstance(data_gen, NullGen))
    # we want lots of nulls
    gen = StructGen([('_c' + str(x), data_gen.copy_special_case(None, weight=100.0))
        for x in range(0, num_cols)], nullable=False)
    command_args = [f.col('_c' + str(x)) for x in range(0, num_cols)]
    command_args.append(s1)
    data_type = data_gen.data_type
    assert_gpu_and_cpu_are_equal_collect(
            lambda spark : gen_df(spark, gen).select(
                f.greatest(*command_args)))

If we run this case individually, it will fail because f.lit was called in gen_scalar, but the config was set in assert_gpu_and_cpu_are_equal_collect for the first time.

In the pre-merge CI job, there are 4 xdist agents running in parallel, and the cases are assigned to different agents by round robin. So if this case happens to be assigned to the first one on the list for a particular agent, the CI will fail.

We believe that this is the reason why #9288's pre-merge keeps failing.

@jlowe
Copy link
Contributor

jlowe commented Oct 9, 2023

Seems to me the issue is that one or more tests is generating data outside of the normal spark session context that sets up the configs properly. (i.e.: move the data generation to within the dataframe callback that is currently a lambda).

Personally, I'm not a fan of the current PR approach where data_gen can silently smash config values and leave them in that smashed state. Can be very surprising behavior and annoying to track down if it bites someone explicitly trying to test without that config setting.

@thirtiseven
Copy link
Collaborator Author

@jlowe Ok, I moved the gen_scalars into a spark session.

src = _gen_scalars_common(data_gen, count, seed=seed)
data_type = src.data_type
return (_mark_as_lit(src.gen(force_no_nulls=force_no_nulls), data_type) for i in range(0, count))
return with_cpu_session(lambda spark: gen_scalars_help(data_gen=data_gen,
Copy link
Collaborator

@revans2 revans2 Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting this in a cpu_session fixes the current problem, but it adds a new one. If gen_scalars is called from inside a with_*_session it will have other problems. with_spark_session calls reset_spark_session_conf which does more than just reset the conf. It clears out the catalog too with no way to get the original config or catalog back after it exits. That means with_gpu_session -> gen_scalars will result in the query running on the CPU after the gen_scalars.

I see a few ways to properly fix this.

  1. We set spark.sql.legacy.allowNegativeScaleOfDecimal when launching spark and have the test framework throw an exception if it is not set. Then we remove references to it in all of the tests for consistency. Then we file a follow on issue to fix with_spark_session to not allow nesting and to throw an exception if it is nested.
  2. We fix with_spark_session to throw an exception if it is ever nested and do what you are doing today + update the docs for it to be clear that it can never be called from within a with_spark_session
  3. We fix the test to call gen_scalar from within a with_spark_session and add a doc fix for gen_scalar to indicate that negative scale decimals can have problems if called from outside of with_spark_session block. Then we file a follow on issue to fix with_spark_session to not allow nesting and to throw an exception if it is nested.

I personally prefer option 1 but I am fine with option 2 or 3. Talking to @jlowe he really prefers option 3. The main difference between option 3 and option 2 for me really about the amount of code that needs to change. If we just fix the one test and add some docs, that feels like a really small change. If we have to fix nesting/etc that feels a bit larger, but it is something we need to do either way and would mean all tests that use gen_scalar would be good to deal with all decimal values properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of 2. It's again surprising behavior (who would expect it to spawn a Spark session?). I'm fine with either 1 or 3, and even with 1, I still think we should fix the test(s). We should be putting all data generation inside a spark session context of some kind.

Copy link
Collaborator Author

@thirtiseven thirtiseven Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated code to option 3.

Now I wrap all scalar generation with a with_cpu_session, no matter if it calls f.lit or uses DecimalGen. Not sure if we only want to move the cases that are possible to fail into Spark sessions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-on issue: #9412

@thirtiseven
Copy link
Collaborator Author

build

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

build

2 similar comments
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven
Copy link
Collaborator Author

build

Copy link
Contributor

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR headline should be updated to reflect the new approach.

integration_tests/README.md Outdated Show resolved Hide resolved
integration_tests/README.md Outdated Show resolved Hide resolved
integration_tests/src/main/python/conditionals_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/ast_test.py Outdated Show resolved Hide resolved
thirtiseven and others added 2 commits October 11, 2023 09:19
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven thirtiseven changed the title Add negative scale config before calling f.lit in integration test Warp scalar generation into spark session in integration test Oct 11, 2023
@thirtiseven
Copy link
Collaborator Author

@jlowe Thanks for review, all done.

@thirtiseven
Copy link
Collaborator Author

build

jlowe
jlowe previously approved these changes Oct 11, 2023
@@ -67,8 +67,8 @@ def test_concat_double_list_with_lit(dg):

@pytest.mark.parametrize('data_gen', non_nested_array_gens, ids=idfn)
def test_concat_list_with_lit(data_gen):
lit_col1 = f.lit(gen_scalar(data_gen)).cast(data_gen.data_type)
lit_col2 = f.lit(gen_scalar(data_gen)).cast(data_gen.data_type)
lit_col1 = f.lit(with_cpu_session(lambda spark: gen_scalar(data_gen))).cast(data_gen.data_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is intended to put f.lit into with_cpu_session, not only the f.lit in the gen_scala but also the f.lit in other places. Maybe change to the following:

with_cpu_session(lambda spark: f.lit(gen_scalar(data_gen)).cast(data_gen.data_type))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check f.lit in other places in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, checked and fixed.

@thirtiseven
Copy link
Collaborator Author

build

@tgravescs tgravescs changed the title Warp scalar generation into spark session in integration test Wrap scalar generation into spark session in integration test Oct 11, 2023
@thirtiseven thirtiseven requested a review from revans2 October 12, 2023 01:34
@thirtiseven thirtiseven self-assigned this Oct 12, 2023
@thirtiseven
Copy link
Collaborator Author

Hi @revans2 please take another look thanks.

@thirtiseven thirtiseven changed the base branch from branch-23.10 to branch-23.12 October 13, 2023 05:14
@revans2 revans2 merged commit 6334ece into NVIDIA:branch-23.12 Oct 18, 2023
@sameerz sameerz added the test Only impacts tests label Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Spark reports a decimal error when create lit scalar when generate Decimal(34, -5) data.
5 participants