Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-730238: The SQL simplifier may generate a bad SQL when renaming a column to an existing column #658

Closed
sfc-gh-yixie opened this issue Jan 20, 2023 · 1 comment
Assignees
Labels
bug Something isn't working needs triage Initial RCA is required

Comments

@sfc-gh-yixie
Copy link
Collaborator

sfc-gh-yixie commented Jan 20, 2023

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    Python 3.8.11 (default, Aug 6 2021, 08:56:27)
    [Clang 10.0.0 ]

  2. What operating system and processor architecture are you using?

    macOS-10.16-x86_64-i386-64bit

  3. What are the component versions in the environment (pip freeze)?

    snowflake-snowpark-python 1.0.0

  4. What did you do?

In the following code, we update column a with the value of column c. The column names of df3 are wrong.

    session.sql_simplifier_enabled = True
    df1 = session.create_dataframe([[1, 2, 3]], schema=["a", "b", "c"])
    df2 = df1.withColumn("a", df1["c"])
    print("df2 columns: ", df2.columns)  # correct, ['B', 'C', 'A']
    df3 = df2.withColumn("b", F.sql_expr("1"))
    print("df3 columns:", df3.columns)  # should be ['C', 'A', 'B']. But output is ['A', 'A', 'B']

In the following join and renaming case, exception snowflake.snowpark.exceptions.SnowparkSQLAmbiguousJoinException: (1303): The reference to the column 'A' is ambiguous... are thrown.

    session.sql_simplifier_enabled = True
    df1 = session.create_dataframe([[1]], schema=["a"])
    df2 = session.create_dataframe([[1]], schema=["a"])
    df3 = df1.join(df2, df1["a"] == df2["a"])
    df4 = df3.select(df1["a"].as_("a"))
    df1_converted = df1.select(df1["a"])
    df5 = df1_converted.join(df4, df1_converted["a"] == df4["a"])
    df5.collect()

Both cases have the same root cause. There is a bug dealing with renaming column names. snowpark-python automatically renames dataframe columns when the two joined dataframes have overlapping column names. The joining case above falls into the same code branch.

The problem is fixed in snowflake-snowpark-python 1.1.0 by PR #649

@sfc-gh-yixie sfc-gh-yixie added bug Something isn't working needs triage Initial RCA is required labels Jan 20, 2023
@github-actions github-actions bot changed the title SNOW-697912: The SQL simplifier may generate a bad SQL when renaming a column to an existing column SNOW-730238: SNOW-697912: The SQL simplifier may generate a bad SQL when renaming a column to an existing column Jan 20, 2023
@sfc-gh-yixie sfc-gh-yixie self-assigned this Jan 20, 2023
@sfc-gh-yixie
Copy link
Collaborator Author

sfc-gh-yixie commented Jan 20, 2023

It's fixed with PR #649 . To be released with snowflake-snowpark-python 1.1.0.

@sfc-gh-yixie sfc-gh-yixie changed the title SNOW-730238: SNOW-697912: The SQL simplifier may generate a bad SQL when renaming a column to an existing column SNOW-730238: The SQL simplifier may generate a bad SQL when renaming a column to an existing column Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Initial RCA is required
Projects
None yet
Development

No branches or pull requests

2 participants