-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotated StructuredDataset: support nested_types
#2252
Annotated StructuredDataset: support nested_types
#2252
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@gitgraghu , @dylanwilder |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2252 +/- ##
===========================================
+ Coverage 83.04% 97.40% +14.36%
===========================================
Files 324 9 -315
Lines 24861 231 -24630
Branches 3547 0 -3547
===========================================
- Hits 20645 225 -20420
+ Misses 3591 6 -3585
+ Partials 625 0 -625 ☔ View full report in Codecov by Sentry. |
@austin362667 it doesn't work for me. Try below example, from typing import Annotated
import pandas as pd
from flytekit import StructuredDataset, kwtypes, task, workflow, ImageSpec
data = {
'company': 'XYZ pvt ltd',
'location': 'London',
'info': {
'president': 'Rakesh Kapoor',
'contacts': {
'email': '[email protected]',
'tel': '9876543210'
}
}
}
# MyDataset = Annotated[StructuredDataset, kwtypes(company=str)]
MyDataset = Annotated[StructuredDataset, kwtypes(info={"president": str})]
@task
def create_bq_table() -> StructuredDataset:
df = pd.json_normalize(data, max_level=0)
print("dataframe: \n", df)
return StructuredDataset(dataframe=df)
# return StructuredDataset(
# dataframe=df, uri="bq://dogfood-gcp-dataplane:dataset.nested_type"
# )
@task
def print_table(sd: MyDataset) -> pd.DataFrame:
t = sd.open(pd.DataFrame).all()
print(t)
return t
@workflow
def wf():
sd = create_bq_table()
print_table(sd=sd) |
9c1dc41
to
664c32a
Compare
@pingsutw Thank you for providing me useful use cases. |
This comment was marked as outdated.
This comment was marked as outdated.
bab8b55
to
4d95e19
Compare
nested_types
support
@gitgraghu @dylanwilder It worked when I passed it in my own from typing import Annotated
from dataclasses import dataclass
import pandas as pd
from flytekit import StructuredDataset, kwtypes, task, workflow, ImageSpec
flytekit_dev_version = "https://github.com/austin362667/flytekit.git@90a19fc51d1b0eb77b020140810883a317432675"
image = ImageSpec(
packages=[
"pandas",
"google-cloud-bigquery",
"google-cloud-bigquery-storage",
f"git+{flytekit_dev_version}",
f"git+{flytekit_dev_version}#subdirectory=plugins/flytekit-bigquery",
],
apt_packages=["git"],
files=["./keys/gcp-service-account.json"],
env={"GOOGLE_APPLICATION_CREDENTIALS": "./gcp-service-account.json"},
platform="linux/arm64",
registry="localhost:30000",
)
data = [{
'company': 'XYZ pvt ltd',
'location': 'London',
'info': {
'president': 'Rakesh Kapoor',
'contacts': {
'email': '[email protected]',
'tel': '9876543210'
}
}
},
{
'company': 'ABC pvt ltd',
'location': 'USA',
'info': {
'president': 'Kapoor Rakesh',
'contacts': {
'email': '[email protected]',
'tel': '0123456789'
}
}
}
]
@dataclass
class ContactsField():
# email: str
tel: str
@dataclass
class InfoField():
# president: str
contacts: ContactsField
@dataclass
class CompanyField():
company: str
# location: str
# info: InfoField
# MyArgDataset = Annotated[StructuredDataset, kwtypes(company=str)]
# MyDictDataset = Annotated[StructuredDataset, kwtypes(info={"president": str})]
# MyDataClassDataset = Annotated[StructuredDataset, kwtypes(info=InfoField)]
# MyDataClassDataset = Annotated[StructuredDataset, kwtypes(CompanyField)]
# MyDictDataset = Annotated[StructuredDataset, kwtypes(info={"president": str, "contacts":{"email":str}})]
MyDictDataset = Annotated[StructuredDataset, kwtypes(info={"contacts":{"tel":str}})]
MyDataClassDataset = Annotated[StructuredDataset, kwtypes(info=kwtypes(contacts=ContactsField))]
@task(container_image=image)
def create_bq_table() -> StructuredDataset:
df = pd.json_normalize(data, max_level=0)
print("original dataframe: \n", df)
# return StructuredDataset(dataframe=df)
return StructuredDataset(
dataframe=df,
# uri="gs://flyte_austin362667_bucket/nested_types"
uri= "bq://flyte-austin362667-gcp:dataset.nested_type"
)
@task(container_image=image)
def print_table_by_dict(sd: MyDictDataset) -> pd.DataFrame:
t = sd.open(pd.DataFrame).all()
print("MyDictDataset dataframe: \n", t)
return t
@task(container_image=image)
def print_table_by_dataclass(sd: MyDataClassDataset) -> pd.DataFrame:
t = sd.open(pd.DataFrame).all()
print("MyDataClassDataset dataframe: \n", t)
return t
@workflow
def wf():
sd = create_bq_table()
print_table_by_dict(sd=sd)
print_table_by_dataclass(sd=sd) |
4d95e19
to
90a19fc
Compare
Signed-off-by: Austin Liu <[email protected]> wip Signed-off-by: Austin Liu <[email protected]> fmt Signed-off-by: Austin Liu <[email protected]> fix Signed-off-by: Austin Liu <[email protected]> fix Signed-off-by: Austin Liu <[email protected]> fix Signed-off-by: Austin Liu <[email protected]>
f5cd70d
to
d2e0821
Compare
d2e0821
to
8c8cad4
Compare
Signed-off-by: Austin Liu <[email protected]> fmt Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
8c8cad4
to
bf892f7
Compare
my_cols = kwtypes(Name=str, Age=int) | ||
my_dataclass_cols = kwtypes(MyCols) | ||
my_dict_cols = kwtypes({"Name": str, "Age": int}) | ||
fields = [("Name", pa.string()), ("Age", pa.int32())] | ||
arrow_schema = pa.schema(fields) | ||
pd_df = pd.DataFrame({"Name": ["Tom", "Joseph"], "Age": [20, 22]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add more nested
dataframes to cover extreme test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just add your example to the unit test? we can add a new file (test_structured_dataset_workflow_with_nested_type.py) to tests/flytekit/unit/types/structured_dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
my_cols = kwtypes(Name=str, Age=int) | ||
my_dataclass_cols = kwtypes(MyCols) | ||
my_dict_cols = kwtypes({"Name": str, "Age": int}) | ||
fields = [("Name", pa.string()), ("Age", pa.int32())] | ||
arrow_schema = pa.schema(fields) | ||
pd_df = pd.DataFrame({"Name": ["Tom", "Joseph"], "Age": [20, 22]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just add your example to the unit test? we can add a new file (test_structured_dataset_workflow_with_nested_type.py) to tests/flytekit/unit/types/structured_dataset
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
pd_df = pd.DataFrame({"Name": ["Tom", "Joseph"], "Age": [20, 22]}) | ||
|
||
|
||
class MockBQEncodingHandlers(StructuredDatasetEncoder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this class to ./types/structured_dataset/conftest.py
, and add fixture decorator to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | ||
|
||
|
||
class MockBQDecodingHandlers(StructuredDatasetDecoder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd better just remove the useless duplicated code (MockBQEncodingHandlers
, MockBQDecodingHandlers
).
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Austin Liu <[email protected]> Signed-off-by: Kevin Su <[email protected]> Co-authored-by: Kevin Su <[email protected]>
Signed-off-by: Austin Liu <[email protected]> Signed-off-by: Kevin Su <[email protected]> Co-authored-by: Kevin Su <[email protected]>
Signed-off-by: Austin Liu <[email protected]> Signed-off-by: Kevin Su <[email protected]> Co-authored-by: Kevin Su <[email protected]> Signed-off-by: Jan Fiedler <[email protected]>
Tracking issue
flyteorg/flyte#4241
Why are the changes needed?
Currently StructuredDatasets only support flat schemas.
This PR aims to support nested types as form of
dict
/json
,dataclass
,named args
/kwargs
.What changes were proposed in this pull request?
a
flatten_dict()
tool function instructured_dataset.py
kwtypes()
to pass in types. Check comments.After we get a list of
SUPPORTED_TYPES
, we select them by series of key joined by.
.to
{'a.b.c.d': 'vvv', 'e.f': 'www'}
"c.d"
.How was this patch tested?
please take a look at screenshots and examples below.
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link