-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1865926: Infer schema for StructType columns from nested Rows #2805
SNOW-1865926: Infer schema for StructType columns from nested Rows #2805
Conversation
) 1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR. Fixes SNOW-1852779 2. Fill out the following pre-review checklist: - [ ] I am adding a new automated test(s) to verify correctness of my new code - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing - [ ] I am adding new logging messages - [ ] I am adding a new telemetry message - [ ] I am adding new credentials - [ ] I am adding a new dependency - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes. - [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development) 3. Please describe how your code solves the related issue. Fixed AST encoding for Column `in_`, `asc`, and `desc`. Removed an unused entity and renamed `sp_column_in__seq` to `sp_column_in`. Changed the `nulls_first` optional boolean param to be an enum.
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
elif isinstance(obj, Row) and context._should_use_structured_type_semantics(): | ||
return infer_schema(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a chance that the given datatype is not StructType while users still input a Row as data? if so what would happen, do we error out?
or Row data always auto inferred as StructType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what that scenario would look like. Can you give an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I didn't make it clear, actually I had two questions:
-
when the infer schema logic would be triggered for
Row
values -- is it only when schema is not explicitly set? -
for my own learning purposes, will following Row input + MapType be a valid input?
struct = Row(f1="v1", f2=2)
df = structured_type_session.create_dataframe(
[
(struct),
],
schema=[StructureType(MapType(xxx)],
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, the schema is only inferred if no explicitly set.
- Today this example would give an error like this:
>>> struct = Row(f1="v1", f2=2)
>>> df = session.create_dataframe([(struct,),], schema=StructType([StructField("test", MapType())]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/snowflake/snowpark/_internal/utils.py", line 960, in func_call_wrapper
return func(*args, **kwargs)
File "src/snowflake/snowpark/session.py", line 3318, in create_dataframe
raise TypeError(
TypeError: Cannot cast <class 'snowflake.snowpark.row.Row'>(Row(f1='v1', f2=2)) to MapType(StringType(), StringType()).
Currently createDataFrame does not know how to handle casting Rows to Maps. You could get it to work by calling Row.as_dict if you wanted it to be a MapType.
After this change it's improved slightly:
# Inferred
>>> df = session.create_dataframe([(struct),])
>>> df.schema
StructType([StructField('F1', StringType(), nullable=False), StructField('F2', LongType(), nullable=False)])
# or explicit
>>> df = session.create_dataframe([(struct),], schema=StructType([StructField('F1', StringType(), nullable=False), StructField('F2', LongType(), nullable=False)]))
Without this change the explicit schema still works.
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-1865926
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
With the new semantic mode enabled structured MapType and ArrayType are already used by default. This PR makes it so that it's possible to express StructType objects in a create_dataframe call by passing them as Row objects similar to how pyspark would handle such data.