Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1829870: Allow structured types to be enabled by default #2727

Merged
merged 24 commits into from
Jan 4, 2025

Conversation

sfc-gh-jrose
Copy link
Contributor

@sfc-gh-jrose sfc-gh-jrose commented Dec 6, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1829870

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  3. Please describe how your code solves the related issue.

    This PR addresses a few misc issue involved with enabling structured types by default.

  • Change to_sql to support structured types. There were some edge cases where it was generating incorrect sql.
  • When the structured type semantic mode is enabled structured types are assumed to be structured and to have semi-structured types you have to set the element types to None for ArrayType, MapType, and StructType.
  • In cases where StringType was being passed to init for those types None is now passed by default allowing the type to decide if it should default to StringType.
  • When the new mode is enabled types will prefer to be structured if element typing is available.

Before this change the in order to have a structured schema you would have to set the structured parameter to True on each type that was expected to be structured:

structured_schema = StructType([
    StructField("array", ArrayType(IntegerType(), structured=True)),
    StructField("map", MapType(IntegerType(), IntegerType(), structured=True)),
    StructField("struct", StructType([StructField("field", IntegerType())], structured=True)),
])

semi_structured_schema = StructType([
    StructField("array", ArrayType(IntegerType())),
    StructField("map", MapType(IntegerType(), IntegerType())),
    StructField("struct", StructType([StructField("field", IntegerType())])),
])

also_semi_structured_schema = StructType([
    StructField("array", ArrayType()),
    StructField("map", MapType()),
    StructField("struct", StructType()),
])

# Both of the semi-structured schemas actually materialize to this:
StructType([
	StructField('ARRAY', ArrayType(StringType()), nullable=True),
	StructField('MAP', MapType(StringType(), StringType()), nullable=True),
	StructField('STRUCT', MapType(StringType(), StringType()), nullable=True)
])

After this change the second case above will instead be inferred to be structured which should be more understandable to users. We've seen confusion in the past with the inner types being converted to StringType being non-inuitive.

After the change the definitions will look like this:

structured_schema = StructType([
    StructField("array", ArrayType(IntegerType())),
    StructField("map", MapType(IntegerType(), IntegerType())),
    StructField("struct", StructType([StructField("field", IntegerType())])),
])

semi_structured_schema = StructType([
    StructField("array", ArrayType()),
    StructField("map", MapType()),
    StructField("struct", StructType()),
])

# Semi structured actually materializes to what it is set to.

@sfc-gh-jrose sfc-gh-jrose added DO-NOT-MERGE NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md labels Dec 6, 2024
Copy link

github-actions bot commented Dec 6, 2024

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

Copy link

github-actions bot commented Dec 6, 2024

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jrose sfc-gh-jrose requested a review from a team December 6, 2024 21:57
@sfc-gh-jrose sfc-gh-jrose marked this pull request as ready for review December 16, 2024 18:56
@sfc-gh-jrose sfc-gh-jrose requested review from a team as code owners December 16, 2024 18:56
Comment on lines +343 to +344
self.element_type = element_type
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can element_type be None here? What does it mean for the column type to be so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If element_type is None than this is a semi-structured array column and could contain anything.

sfc-gh-vbudati and others added 3 commits December 16, 2024 15:33
)

1. Which Jira issue is this PR addressing? Make sure that there is an
accompanying issue to your PR.

   Fixes SNOW-1852779

2. Fill out the following pre-review checklist:

- [ ] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.
- [x] I acknowledge that I have ensured my changes to be thread-safe.
Follow the link for more information: [Thread-safe Developer
Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)

3. Please describe how your code solves the related issue.

Fixed AST encoding for Column `in_`, `asc`, and `desc`. Removed an
unused entity and renamed `sp_column_in__seq` to `sp_column_in`. Changed
the `nulls_first` optional boolean param to be an enum.
Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jrose sfc-gh-jrose requested a review from a team December 17, 2024 00:10
@sfc-gh-jrose sfc-gh-jrose merged commit 888cec5 into main Jan 4, 2025
42 checks passed
@sfc-gh-jrose sfc-gh-jrose deleted the jrose_snow_1829870_structured_by_default branch January 4, 2025 00:05
@github-actions github-actions bot locked and limited conversation to collaborators Jan 4, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants