Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StructuredDatasetTransformerEngine should derive default protocol from raw output prefix #1107

Merged
merged 8 commits into from
Jul 21, 2022

Conversation

wild-endeavor
Copy link
Contributor

@wild-endeavor wild-endeavor commented Jul 20, 2022

TL;DR

Instead of relying on the default protocol in most cases, let's infer the protocol from the raw output prefix.

One downside of this change is that we no longer set the default storage format either for the encoders/decoders that we provide. Fortunately these all just use the parquet format, which is the default that's provided in various places if missing. We may have to add a function to the transformer engine api in the future to just set the default format.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

  • Update Spark and Polars encoders/decoders to also not set the default. Both of these use the Parquet storage format so it should be okay to do.

Tracking Issue

flyteorg/flyte#2684

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
@wild-endeavor wild-endeavor changed the title wip StructuredDatasetTransformerEngine should derive default protocol from raw output prefix Jul 21, 2022
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
@wild-endeavor wild-endeavor marked this pull request as ready for review July 21, 2022 17:16
@codecov
Copy link

codecov bot commented Jul 21, 2022

Codecov Report

Merging #1107 (c404075) into master (c5a9468) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1107      +/-   ##
==========================================
+ Coverage   86.93%   86.95%   +0.02%     
==========================================
  Files         275      276       +1     
  Lines       25448    25492      +44     
  Branches     2862     2865       +3     
==========================================
+ Hits        22123    22167      +44     
  Misses       2847     2847              
  Partials      478      478              
Impacted Files Coverage Δ
flytekit/core/data_persistence.py 75.23% <100.00%> (+0.35%) ⬆️
flytekit/types/structured/basic_dfs.py 100.00% <100.00%> (ø)
flytekit/types/structured/structured_dataset.py 93.04% <100.00%> (+0.14%) ⬆️
tests/flytekit/unit/core/test_data_persistence.py 100.00% <100.00%> (ø)
tests/flytekit/unit/core/test_dataclass.py 100.00% <100.00%> (ø)
...ests/flytekit/unit/core/test_structured_dataset.py 99.19% <100.00%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5a9468...c404075. Read the comment docs.

@wild-endeavor wild-endeavor merged commit c6e7237 into master Jul 21, 2022
wild-endeavor added a commit that referenced this pull request Aug 2, 2022
wild-endeavor added a commit that referenced this pull request Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants