Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1865926: Infer schema for StructType columns from nested Rows #2805

Merged

Conversation

sfc-gh-jrose
Copy link
Contributor

@sfc-gh-jrose sfc-gh-jrose commented Dec 20, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1865926

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  3. Please describe how your code solves the related issue.

    With the new semantic mode enabled structured MapType and ArrayType are already used by default. This PR makes it so that it's possible to express StructType objects in a create_dataframe call by passing them as Row objects similar to how pyspark would handle such data.

sfc-gh-jrose and others added 25 commits December 6, 2024 13:44
)

1. Which Jira issue is this PR addressing? Make sure that there is an
accompanying issue to your PR.

   Fixes SNOW-1852779

2. Fill out the following pre-review checklist:

- [ ] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.
- [x] I acknowledge that I have ensured my changes to be thread-safe.
Follow the link for more information: [Thread-safe Developer
Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)

3. Please describe how your code solves the related issue.

Fixed AST encoding for Column `in_`, `asc`, and `desc`. Removed an
unused entity and renamed `sp_column_in__seq` to `sp_column_in`. Changed
the `nulls_first` optional boolean param to be an enum.
Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jrose sfc-gh-jrose requested a review from a team December 20, 2024 22:56
@sfc-gh-jrose sfc-gh-jrose added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Dec 20, 2024
Base automatically changed from jrose_snow_1829870_structured_by_default to main January 4, 2025 00:05
@sfc-gh-jrose sfc-gh-jrose marked this pull request as ready for review January 4, 2025 00:07
@sfc-gh-jrose sfc-gh-jrose requested review from a team as code owners January 4, 2025 00:07
Comment on lines +445 to +446
elif isinstance(obj, Row) and context._should_use_structured_type_semantics():
return infer_schema(obj)
Copy link
Contributor

@sfc-gh-aling sfc-gh-aling Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a chance that the given datatype is not StructType while users still input a Row as data? if so what would happen, do we error out?

or Row data always auto inferred as StructType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what that scenario would look like. Can you give an example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I didn't make it clear, actually I had two questions:

  1. when the infer schema logic would be triggered for Row values -- is it only when schema is not explicitly set?

  2. for my own learning purposes, will following Row input + MapType be a valid input?

    struct = Row(f1="v1", f2=2)
    df = structured_type_session.create_dataframe(
        [
            (struct),
        ],
        schema=[StructureType(MapType(xxx)],
    )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, the schema is only inferred if no explicitly set.
  2. Today this example would give an error like this:
>>> struct = Row(f1="v1", f2=2)
>>> df = session.create_dataframe([(struct,),], schema=StructType([StructField("test", MapType())]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/snowflake/snowpark/_internal/utils.py", line 960, in func_call_wrapper
    return func(*args, **kwargs)
  File "src/snowflake/snowpark/session.py", line 3318, in create_dataframe
    raise TypeError(
TypeError: Cannot cast <class 'snowflake.snowpark.row.Row'>(Row(f1='v1', f2=2)) to MapType(StringType(), StringType()).

Currently createDataFrame does not know how to handle casting Rows to Maps. You could get it to work by calling Row.as_dict if you wanted it to be a MapType.

After this change it's improved slightly:

# Inferred
>>> df = session.create_dataframe([(struct),])
>>> df.schema
StructType([StructField('F1', StringType(), nullable=False), StructField('F2', LongType(), nullable=False)])
 
# or explicit
>>> df = session.create_dataframe([(struct),], schema=StructType([StructField('F1', StringType(), nullable=False), StructField('F2', LongType(), nullable=False)]))

Without this change the explicit schema still works.

@sfc-gh-jrose sfc-gh-jrose merged commit 7494363 into main Jan 7, 2025
42 checks passed
@sfc-gh-jrose sfc-gh-jrose deleted the jrose_snow_1865926_create_dataframe_default_structured branch January 7, 2025 18:22
@github-actions github-actions bot locked and limited conversation to collaborators Jan 7, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants