Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Can't use int types on tasks, flyte casts to float between tasks #4505

Open
2 tasks done
jiwidi opened this issue Nov 30, 2023 · 30 comments
Open
2 tasks done

[BUG] Can't use int types on tasks, flyte casts to float between tasks #4505

jiwidi opened this issue Nov 30, 2023 · 30 comments
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working flytekit FlyteKit Python related issue

Comments

@jiwidi
Copy link

jiwidi commented Nov 30, 2023

Describe the bug

I'm trying to use the int type along my workflow but it keeps being parsed as float and complaining about it when passing it downstream to other tasks. I even try to cast it within a task but it casts it back to float.

Expected behavior

For objects within tasks to remain typed as int or flyte to cast them properly (complains about not being able to cast a float with no decimals to int, which is possible)

Additional context to reproduce

I have this main.py file containing my workflow:

from flytekit import task, workflow
import os
import yaml

@task
def read_yaml(yaml_path: str) -> dict:
    with open(yaml_path, "r") as file:
        config = yaml.safe_load(file)
        config['id'] = int(config['id'])
        return config

@task
def check_int(x: int) -> int:
    return x

@workflow
def workflow(yaml_path: str) -> int:
    config = read_yaml(yaml_path=yaml_path)
    return check_int(x=config['id'])

And a config.yaml where the int should be read from:

id: !!int 1

When I run this with the command

pyflyte run main.py workflow --yaml_path config.yaml

I get the error:

Running Execution on local.
2023-11-30 15:36:01,527455 ERROR    {"asctime": "2023-11-30 15:36:01,527", "name": "flytekit", "levelname": "ERROR", "message": "Failed to convert inputs of task 'main.check_int':\n  Error  base_task.py:621
                                    converting input 'x' at position 0:\n  Cannot convert literal <FlyteLiteral scalar { primitive { float_value: 1 } }> to <class 'int'>"}
Failed with Unknown Exception <class 'flytekit.core.type_engine.TypeTransformerFailedError'> Reason: Encountered error while executing workflow 'main.workflow':
  Error encountered while executing 'workflow':
  Failed to convert inputs of task 'main.check_int':
  Error converting input 'x' at position 0:
  Cannot convert literal <FlyteLiteral scalar { primitive { float_value: 1 } }> to <class 'int'>
Encountered error while executing workflow 'main.workflow':
  Error encountered while executing 'workflow':
  Failed to convert inputs of task 'main.check_int':
  Error converting input 'x' at position 0:
  Cannot convert literal <FlyteLiteral scalar { primitive { float_value: 1 } }> to <class 'int'>

The yaml is typed to be int and i even cast it again within the read task but it still gets casted to float.

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@jiwidi jiwidi added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Nov 30, 2023
Copy link

dosubot bot commented Nov 30, 2023

🤖

Hello @jiwidi,

Nice to meet you! I'm Dosu, an AI bot here to assist you with bugs, answer your questions, and guide you as a contributor while we wait for a human maintainer. I'm here to help, so don't hesitate to ask if you have any further questions or issues!

I understand that you're having an issue with integer types being cast to float in your Flyte workflow. This seems to be a known issue that has been reported in the past. Specifically, there are two similar issues that have been resolved:

  1. Dataclass int field becomes float when passed to task
  2. int values in Optional[int] dataclass fields are converted to floats when passed into Flyte tasks

Both of these issues were addressed by the following pull requests:

  1. Fix the type of optional[int] in nested dataclass
  2. Fix the type of optional[int] in dataclass

Based on the error message you provided, it seems that the integer value is being interpreted as a float when it's passed to the check_int task. This could be due to the way the read_yaml task is returning the value.

In Flyte, the type conversion process is handled by the AreTypesCastable function, which checks if the source type can be cast to the target type. According to the source code, integers should not be castable to floats and vice versa.

Given this, it's possible that there might be a bug in Flyte causing this issue. However, it's also possible that the issue might be due to the way the read_yaml task is returning the value.

To help narrow down the issue, could you please provide the following additional information:

  1. The version of Flyte you're using.
  2. The version of the Flytekit Python SDK you're using.
  3. Any additional context or code that might help reproduce the issue.

This information will help us better understand the issue and provide a more accurate solution. Thank you for your patience and cooperation.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@eapolinario eapolinario added flytekit FlyteKit Python related issue backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Nov 30, 2023
@jasonlai1218
Copy link
Contributor

I would like to work on this, can you assign me, please?

@jasonlai1218
Copy link
Contributor

#take

@jiwidi
Copy link
Author

jiwidi commented Jan 9, 2024

@jasonlai1218 hey! did you had the time to look on this?

@jiwidi
Copy link
Author

jiwidi commented Jan 18, 2024

@eapolinario @jasonlai1218 hello?

@jasonlai1218 jasonlai1218 removed their assignment Jan 25, 2024
@jasonlai1218
Copy link
Contributor

Sorry, I'm busy with the company's work recently and don't have time to look on this

@fellhorn
Copy link
Contributor

fellhorn commented Feb 8, 2024

Please note that the flytekit type engine serializes untyped dicts (dict[str, Any]) like the one in your example to a json string.
json does not know about different number types and only uses float.

You may either use a dataclass or an univariate dictionary to keep the int type. No matter what, you will need to provide additional hints to the type engine. I can't figure out on its own how to serialize the untyped dictionary correctly.

Here an example for an univariate dict:

from flytekit import workflow, task

@task
def a() -> dict[str, int]:
    return {
        "a": 1,
    }

@task
def b(inp: dict[str, int]):
    print(inp)
    print(type(inp["a"]))


@workflow
def test():
    result = a()
    b(inp=result)

output of task b:

{'a': 1}
<class 'int'>

@Future-Outlier Future-Outlier self-assigned this Feb 15, 2024
@jiwidi
Copy link
Author

jiwidi commented Feb 20, 2024

Please note that the flytekit type engine serializes untyped dicts (dict[str, Any]) like the one in your example to a json string. json does not know about different number types and only uses float.

You may either use a dataclass or an univariate dictionary to keep the int type. No matter what, you will need to provide additional hints to the type engine. I can't figure out on its own how to serialize the untyped dictionary correctly.

Here an example for an univariate dict:

from flytekit import workflow, task

@task
def a() -> dict[str, int]:
    return {
        "a": 1,
    }

@task
def b(inp: dict[str, int]):
    print(inp)
    print(type(inp["a"]))


@workflow
def test():
    result = a()
    b(inp=result)

output of task b:

{'a': 1}
<class 'int'>

so dictionaries in flyte are strict to keep same value type for all keys otherwise the typing fails? I cant use dict[str, int] because some other keys have string values :/

@ai-rnatour
Copy link

We're seeing a similar issue - in our case we have dataclasses that may contain arbitrary JSON user input, and Flyte keeps converting ints to floats every time we pass the object between tasks. It would be nice if Flyte respected the runtime type in this case given that both integers and floats are valid JSON.

@fellhorn
Copy link
Contributor

We're seeing a similar issue - in our case we have dataclasses that may contain arbitrary JSON user input, and Flyte keeps converting ints to floats every time we pass the object between tasks. It would be nice if Flyte respected the runtime type in this case given that both integers and floats are valid JSON.

JSON does not have distinct types for integers and floating-point values. Therefore, the presence or absence of a decimal point is not enough to distinguish between integers and non-integers. For example, 1 and 1.0 are two ways to represent the same value in JSON.

Source

There is no way to distinguish between integers and floats in pure vanilla JSON without adding an additional schema, which in the end also means providing additional type information.

@fellhorn
Copy link
Contributor

so dictionaries in flyte are strict to keep same value type for all keys otherwise the typing fails? I cant use dict[str, int] because some other keys have string values :/

You can also use dataclasses, which I would usually recommend here. Generally you will need to provide some more type information at registration time of the workflow. A dict which can be anything at runtime is not handled in a pure python way in Flyte and probably causes confusion for python developers (especially as workflows behave differently when executed locally or remote). I am not sure why they are serialized as JSON. Maybe to support other languages but python or to allow accessing elements from the backend written in Go.

You can take a look at the overview how Flyte serializes certain python datatypes here.

Serializing the dict by pickling it would keep all python types but make it less versatile. See the Flyte documentation about this:

Pickle can only be used to send objects between the exact same version of Python, and we strongly recommend to use python type that flyte support or register a custom transformer

If you really want to use pickle and keep the untyped dicts, you could probably wrap them in a python class to enforce the Flyte Pickle type transformer. See https://docs.flyte.org/projects/cookbook/en/v0.3.66/auto/core/type_system/flyte_pickle.html

As I said, I don't think this is the best way to solve the problem but it should work.

@jiwidi
Copy link
Author

jiwidi commented Feb 23, 2024

We're seeing a similar issue - in our case we have dataclasses that may contain arbitrary JSON user input, and Flyte keeps converting ints to floats every time we pass the object between tasks. It would be nice if Flyte respected the runtime type in this case given that both integers and floats are valid JSON.

JSON does not have distinct types for integers and floating-point values. Therefore, the presence or absence of a decimal point is not enough to distinguish between integers and non-integers. For example, 1 and 1.0 are two ways to represent the same value in JSON.

Source

There is no way to distinguish between integers and floats in pure vanilla JSON without adding an additional schema, which in the end also means providing additional type information.

I dont think you need to provide that additional typing information to make a multi type dictionary work. Flyte could definitely implement something in its code to look at types, infer and consequently load afterwards.

You are correct with those json limitations but I've seen them overcome by libraries multiple times in the pasts. This is not an excuse for flyte to not fix this issue.

@fellhorn
Copy link
Contributor

fellhorn commented Feb 23, 2024

I noticed that Union is also a valid univariate type for the default dict type transformer. Thus this also works:

from typing import Union

from flytekit import workflow, task

@task
def a() -> dict[str, Union[str, int, float]]:
    return {"str": "string", "int": 1, "float": 1.0}

@task
def b(inp: dict[str, Union[str, int, float]]):
    print(inp)
    print(type(inp["float"]))

@workflow
def main():
    b(inp=a())

Assuming you only use simple types within your dict, this might solve your problem without having to use dataclasses?

@ai-rnatour
Copy link

We're seeing a similar issue - in our case we have dataclasses that may contain arbitrary JSON user input, and Flyte keeps converting ints to floats every time we pass the object between tasks. It would be nice if Flyte respected the runtime type in this case given that both integers and floats are valid JSON.

JSON does not have distinct types for integers and floating-point values. Therefore, the presence or absence of a decimal point is not enough to distinguish between integers and non-integers. For example, 1 and 1.0 are two ways to represent the same value in JSON.

Source

There is no way to distinguish between integers and floats in pure vanilla JSON without adding an additional schema, which in the end also means providing additional type information.

Point taken - but is there a way to inject this type information at runtime? In our case we are passing this user input into a map_task inside a dynamic, so we could inspect the runtime type and provide that information. The task running within the map_task must be able to handle arbitrary JSON but invokes some C++ code that is sensitive to floats vs integer input, so we can't have all out numeric data converted to floats.

@ai-rnatour
Copy link

I noticed that Union is also a valid univariate type for the default dict type transformer. Thus this also works:

from typing import Union

from flytekit import workflow, task

@task
def a() -> dict[str, Union[str, int, float]]:
    return {"str": "string", "int": 1, "float": 1.0}

@task
def b(inp: dict[str, Union[str, int, float]]):
    print(inp)
    print(type(inp["float"]))

@workflow
def main():
    b(inp=a())

Assuming you only use simple types within your dict, this might solve your problem without having to use dataclasses?

Using a dataclass with a union type still shows z converted to a float when passed an int:

@dataclass
class Foo(DataClassJsonMixin):
    x: int
    y: float
    z: int | float


@task
def print_foo(*, foo: Foo) -> Foo:
    print(foo)
    print(f"x={type(foo.x)}")
    print(f"y={type(foo.y)}")
    print(f"z={type(foo.z)}")
    return foo


@dynamic
def _print_foo_dynamic(*, foo: Foo) -> Foo:
    return print_foo(foo=foo)


@workflow
def print_foo_wf(*, foo: Foo) -> Foo:
    return _print_foo_dynamic(foo=foo)

@ai-rnatour
Copy link

ai-rnatour commented Feb 29, 2024

Picking this work back up, this is something else I've noticed. In the following code:

import marshmallow as mm

@dataclass
class Foo(DataClassJsonMixin):
    x: int
    y: float
    z: Any = field(metadata=config(mm_field=mm.fields.Integer()))


@task
def print_foo(*, foo: Foo) -> Foo:
    print(foo)
    print(f"x={type(foo.x)}")
    print(f"y={type(foo.y)}")
    print(f"z={type(foo.z)}")
    return foo


@dynamic
def _print_foo_dynamic(*, foo: Foo) -> Foo:
    return print_foo(foo=foo)


@workflow
def print_foo_wf(*, foo: Foo) -> Foo:
    return _print_foo_dynamic(foo=foo)

The Flyte UI will show that the type of z is an int and that the entire dataclass is a well-defined struct, however when examining the task logs the runtime type of z will still be cast to a float when the workflow input is an integer.

@ai-rnatour
Copy link

ai-rnatour commented Mar 1, 2024

I found a quick fix that seems to be working for us, any reason why it would be a bad idea to do this?

from flytekit.types.pickle import FlytePickle

@dataclass
class Foo(DataClassJsonMixin, FlytePickle):
    x: int
    y: float
    z: int | float


@task
def print_foo(*, foo: Foo) -> Foo:
    print(foo)
    print(f"x={type(foo.x)}")
    print(f"y={type(foo.y)}")
    print(f"z={type(foo.z)}")
    return foo


@dynamic
def _print_foo_dynamic(*, foo: Foo) -> Foo:
    return print_foo(foo=foo)


@workflow
def print_foo_wf(*, foo: Foo) -> Foo:
    return _print_foo_dynamic(foo=foo)

Appears like this allows us to use JSON dataclasses as we currently are, but forces Flyte to pickle them without any other code changes.

@davidmirror-ops
Copy link
Contributor

cc @wild-endeavor / @EngHabu

@gvashishtha
Copy link
Contributor

While FlytePickle did get us unblocked, it causes flyte caching to break.

@gvashishtha
Copy link
Contributor

gvashishtha commented Apr 16, 2024

If I switch to using the mashumaro mixin as recommended in the flyte docs then x always gets cast to an int (even if I pass in a float). I think this is a bug?

from flytekit import task, workflow, dynamic
from mashumaro.mixins.json import DataClassJSONMixin
from dataclasses import dataclass

@dataclass
class Foo(DataClassJSONMixin):
    x: float | int


@task
def print_foo(*, foo: Foo) -> Foo:
    print(foo)
    print(f"x={type(foo.x)}")
    return foo


@dynamic
def _print_foo_dynamic(*, foo: Foo) -> Foo:
    return print_foo(foo=foo)


@workflow
def print_foo_wf(*, foo: Foo) -> Foo:
    return _print_foo_dynamic(foo=foo)

Fairly confident this is a flyte issue because it doesn't happen with just mashumaro:

>>> from mashumaro.mixins.json import DataClassJSONMixin
>>> from dataclasses import dataclass
>>> from typing import Any
>>> 
>>> @dataclass
... class Foo(DataClassJSONMixin):
...     x: float | int
... 
>>> test = Foo(5)
>>> test1 = Foo.from_json(test.to_json())
>>> test1
Foo(x=5)
>>> type(test1.x)
<class 'int'>

@granthamtaylor
Copy link

granthamtaylor commented Apr 16, 2024

If you really, really want to be able to store the type of the value, you could use a more complex data class that also stores the type of the value as well:

from dataclasses import dataclass
from mashumaro.mixins.json import DataClassJSONMixin

@dataclass
class MyOutput(DataClassJSONMixin):
    
    _value: float
    _is_int: bool
    
    @classmethod
    def from_value(cls, value: float|int) -> "MyOutput":
        
        assert isinstance(value, (float, int)), \
            f"value must be of type int or float, not {type(value)}"
        
        if isinstance(value, int):
            return cls(_is_int =True, _value=float(value))
        else:
            return cls(_is_int =False, _value=value)
    
    @property
    def value(self) -> float|int:
        
        if self._is_int:
            return int(self._value)
        else:
            return float(self._value)
        
    def __repr__(self):
        
        _type = "int" if self._is_int else "float"
        return f"{self.__class__.__name__}({self.value}: {_type})"


my_output_instance = MyOutput.from_value(3.1232)
serialized = my_output_instance.to_json()
deserialized = MyOutput.from_json(serialized)

print(deserialized) # MyOutput(3.1232: float)

my_output_instance = MyOutput.from_value(33)
serialized = my_output_instance.to_json()
deserialized = MyOutput.from_json(serialized)

print(deserialized) # MyOutput(33: int)

I do frequently use such @classmethods with my data classes to automate complex initiation logic such that I am able to cleanly use the data classes within my tasks.

@gvashishtha
Copy link
Contributor

grantham, did you see my last comment? We are using DataClassJSONMixin, but we still see this bug

@ai-rnatour
Copy link

If you really, really want to be able to store the type of the value, you could use a more complex data class that also stores the type of the value as well:

from dataclasses import dataclass
from mashumaro.mixins.json import DataClassJSONMixin

@dataclass
class MyOutput(DataClassJSONMixin):
    
    _value: float
    _is_int: bool
    
    @classmethod
    def from_value(cls, value: float|int) -> "MyOutput":
        
        assert isinstance(value, (float, int)), \
            f"value must be of type int or float, not {type(value)}"
        
        if isinstance(value, int):
            return cls(_is_int =True, _value=float(value))
        else:
            return cls(_is_int =False, _value=value)
    
    @property
    def value(self) -> float|int:
        
        if self._is_int:
            return int(self._value)
        else:
            return float(self._value)
        
    def __repr__(self):
        
        _type = "int" if self._is_int else "float"
        return f"{self.__class__.__name__}({self.value}: {_type})"


my_output_instance = MyOutput.from_value(3.1232)
serialized = my_output_instance.to_json()
deserialized = MyOutput.from_json(serialized)

print(deserialized) # MyOutput(3.1232: float)

my_output_instance = MyOutput.from_value(33)
serialized = my_output_instance.to_json()
deserialized = MyOutput.from_json(serialized)

print(deserialized) # MyOutput(33: int)

I do frequently use such @classmethods with my data classes to automate complex initiation logic such that I am able to cleanly use the data classes within my tasks.

We've looked at similar approaches as well as custom encodings to force the input to look like an arbitrary string but this all seems like an unnecessary burden.

Fundamentally, if I type a dataclass field as Any:

@dataclass
class Foo(DataClassJSONMixin):
    x: Any

or as a numeric union type:

@dataclass
class Foo(DataClassJSONMixin):
    x: int | float

Flyte should not cast integer inputs to floats. It is no longer following the schema I've provided.

In our case, a simpleis_int is not satisfactory because we need to handle complex nested data that could contain integers as well as floats. We can provide that detailed type information at runtime, but there's no documentation for how to communicate this to Flyte.

When typed as Any, I would expect Flyte to not make any type assumptions and instead respect whatever the serialized JSON of my field is. When typed as int | float, I would expect Flyte to respect either an integer or floating point number. This is how dataclasses-json works too.

@granthamtaylor
Copy link

I do not believe you understand the functionality of MyOutput as it is implemented above. In the above example, it will automatically coerce those floats back to integers upon deserialization, thus still conforming to its original datatype:

my_output_instance = MyOutput.from_value(33)
serialized = my_output_instance.to_json()
deserialized = MyOutput.from_json(serialized)

print(deserialized) # MyOutput(33: int)

You may also return list[MyOutput] and/or dict[str, MyOutput] using the data class I had previous described. Additionally, you may create other custom data classes around MyOutput as well. The serialization / deserialization operations will work as expected

However, if you are literally trying to return arbitrarily nested dictionaries / lists of floats and/or ints, the idea of having a task that returns an output like dict[str, list[float|int] | float | int] | list[dict[str, list[float | int] | float | int] | float | int seems quite dangerous. Having worked with KFP for several years and Flyte for the last few months, I very much appreciate their strict type enforcement. They force me to avoid bad patterns.

If you are working with complex, arbitrarily nested data, you might consider writing it to an actual file (flytekit.types.file.FlyteFile). Data formats such as avro works quite well with arbitrarily nested data. Avro can also differentiate between floats and ints.

@ai-rnatour
Copy link

I do not believe you understand the functionality of MyOutput as it is implemented above. In the above example, it will automatically coerce those floats back to integers upon deserialization, thus still conforming to its original datatype:

my_output_instance = MyOutput.from_value(33)
serialized = my_output_instance.to_json()
deserialized = MyOutput.from_json(serialized)

print(deserialized) # MyOutput(33: int)

You may also return list[MyOutput] and/or dict[str, MyOutput] using the data class I had previous described. Additionally, you may create other custom data classes around MyOutput as well. The serialization / deserialization operations will work as expected

However, if you are literally trying to return arbitrarily nested dictionaries / lists of floats and/or ints, the idea of having a task that returns an output like dict[str, list[float|int] | float | int] | list[dict[str, list[float | int] | float | int] | float | int seems quite dangerous. Having worked with KFP for several years and Flyte for the last few months, I very much appreciate their strict type enforcement. They force me to avoid bad patterns.

If you are working with complex, arbitrarily nested data, you might consider writing it to an actual file (flytekit.types.file.FlyteFile). Data formats such as avro works quite well with arbitrarily nested data. Avro can also differentiate between floats and ints.

I do understand the dataclass you wrote - as I said we have a similar approach in mind - but consider the difference in maintainability between what you have above and simply writing the union type int | float (which dataclasses-json already serializes and deserializes correctly as expected). Or having to hit an external service like S3 to persist one small part of our input in a bespoke way.

Our pipeline and CLI tooling handles the transformations on this arbitrary JSON input as well as the business logic around our strongly typed inputs quite well with dataclasses-json, except for this one hiccup where Flyte coerces all our numeric types to floats whenever we expect to handle an int | float.

My point is not that it is impossible for us to hack around this limitation: it's that this is not a good UX and not at all an intuitive way to handle a type annotated as int | float.

@kumare3
Copy link
Contributor

kumare3 commented Apr 17, 2024

@jiwidi / @gvashishtha / @fellhorn / @ai-rnatour
All of you folks are hitting correct and distinct points. Let me help understand Flyte's type system, some limitations, some solutions and also disambiguate various questions

** Caching offloaded, pickle, file, directory, dataframes with custom hashmethods **
Flyte caching uses shallow references. this is for speed. Hashing an entire directory contents can be super expensive and not worth it. But, users can override this behaviour by providing their own custom hash using a hashmethod.
@gvashishtha If the problem is simply caching - even for pickle you can use a custom HashMethod, Only a bad example is found in our docs.

def hash_pandas_dataframe(df: pd.DataFrame) -> str:
    return str(pd.util.hash_pandas_object(df))

# Type hint to use for pandas dataframes in flyte. This adds information for how to hash them so they can be cached.
CacheablePandasDataFrameT = Annotated[pd.DataFrame, HashMethod(hash_pandas_dataframe)]

For various things we will consider producer (one that produces the data object) and consumer (one that consumes the data object). Note they can be different languages, different machines etc. So Flyte uses Protocol buffers to transfer the data,

** dict passing **
dicts are untyped. Problem is Flyte cannot determine it within its type system. So ideally it should use the escape hatch of using Pickle!
Also protobuf Maps (equal to dicts) can only be str: Any value.
We should probably just revert back to using Pickle for dict type when origin key type is not str
But we have a weird error
Good News: Not a backwards incompatible change - lets fix this. Please help / contribute

** dataclasses and ints **
Again dataclasses are passed using protobuf structs (json repr) If you note sadly only number value possible is double. So the way Flytekit handles this, is that it uses the receiver type to actually identify the type and force cast into it.

So if you have a dataclass

class Foo:
   x: int

It can be easily forced to an int.

But now consider if you have a type

class Foo:
   x: int | Foo

The data gets transferred as a double/float value. And on the receiver side it is ambiguous to identify the type.
Infact Mashumuro will also not be able to correctly convert.

>>> print(Foo.from_json('{"x": 1.0}'))
Foo(x=1.0)

As it can only use the widest type.

And as @fellhorn correctly pointed this sticks to sadness that is JSON schema.

A solution might be to add RunTime type identification. But this is very expensive and may have significant overhead at runtime. One solution is to avoid union types where types are ambiguous - ints/floats/doubles

@Fatal1ty
Copy link

Fatal1ty commented Apr 17, 2024

But now consider if you have a type

class Foo:
   x: int | Foo

The data gets transferred as a double/float value. And on the receiver side it is ambiguous to identify the type. Infact Mashumuro will also not be able to correctly convert.

>>> print(Foo.from_json('{"x": 1.0}'))
Foo(x=1.0)

I think you have a mistake here. Here "x" can't become a float value unless you override serialization strategy for int:

from dataclasses import dataclass
from mashumaro.mixins.json import DataClassJSONMixin

@dataclass
class Foo(DataClassJSONMixin):
   x: int

print(Foo.from_json('{"x": 1.0}'))  # Foo(x=1)

Anyway, if you need to differentiate between integers and floats during deserialization from JSON, you can override the strategy globally. I can see that flytekit is deserializing JSON data into a python object in two ways:

  1. using JSONDecoder from mashumaro for dataclasses without from_json method (i.e. without DataClassJSONMixin)
  2. using existing from_json classmethod

In the first case you need a dialect:

from typing import Any, TypeVar, Type, Callable

from dataclasses import dataclass

from mashumaro.dialect import Dialect
from mashumaro.codecs.json import JSONDecoder

T = TypeVar("T", bound=Type)


def exact_type_unpacker(typ: T) -> Callable[[Any], T]:
    def unpack_value(value: Any) -> T:
        if isinstance(value, typ):
            return value
        raise ValueError(f"{value!r} is not of type {typ}")

    return unpack_value


class FlyteKitDialect(Dialect):
    serialization_strategy = {
        int: {"deserialize": exact_type_unpacker(int)},
        float: {"deserialize": exact_type_unpacker(float)},
    }


@dataclass
class Foo:
    x: int | float
    y: float | int


decoder = JSONDecoder(Foo, default_dialect=FlyteKitDialect)
print(decoder.decode('{"x": 42, "y": 42}'))  # Foo(x=42, y=42)
print(decoder.decode('{"x": 42.0, "y": 42.0}'))  # Foo(x=42.0, y=42.0)

decoder = JSONDecoder(int | float, default_dialect=FlyteKitDialect)
print(decoder.decode("42"))  # 42
print(decoder.decode("42.0"))  # 42.0

decoder = JSONDecoder(float | int, default_dialect=FlyteKitDialect)
print(decoder.decode("42"))  # 42
print(decoder.decode("42.0"))  # 42.0

In the second case this decoder can also be used but you would need to more clearly distinguish between from_json from mashumaro and from_json from others (simple issubclass check will work) if you want to continue using other json decoders with a similar interface.

If for some reason you don't want to use JSONDecoder for dataclasses with or without DataClassJSONMixin (from mashumaro), the flytekit end user would need to ensure that their dataclasses have the same rules as in FlyteKitDialect in one of the following ways:

I would personally prefer to use JSONDecoder or ORJSONDEcoder for any data type regardless of the presence of the mixin because it removes the burden of configuration from the user.


And one more thing. If you've bitten by casting to the first variant type in int | float or float | int during serialization, this is no longer the case in mashumaro 3.13. Before 3.13 version it is still possible to register pass_through serialization strategy for int and float in JSONEncoder here. Or just modify FlyteKitDialect and use it:

class FlyteKitDialect(Dialect):
    serialization_strategy = {
        int: {
            "deserialize": exact_type_unpacker(int),
            "serialize": pass_through,
        },
        float: {
            "deserialize": exact_type_unpacker(float),
            "serialize": pass_through,
        },
    }

encoder = JSONEncoder(Foo, default_dialect=FlyteKitDialect)

The same configuration burden will fall on the end user if this encoder is not used to serialize dataclasses with mixins.

@kumare3
Copy link
Contributor

kumare3 commented May 3, 2024

Folks I just made an issue that might solve all these problems check it out and please comment

@jiwidi
Copy link
Author

jiwidi commented May 22, 2024

somehow related issue i found now with pydantic objects

#5409

@eapolinario
Copy link
Contributor

@jiwidi , starting with flytekit 1.14.0 you should now be able to use ints in untyped dictionaries. In fact, we added better support for a bunch of json-like structures, e.g. pydantic and dataclasses, in that release. Can you give it a try?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working flytekit FlyteKit Python related issue
Projects
None yet
Development

No branches or pull requests