-
Notifications
You must be signed in to change notification settings - Fork 665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Plugin] TypeTransformer for TensorFlow tf.data.Dataset #3038
Comments
@dennisobrien, thanks for creating the issue! Have a few questions:
cc: @cosmicBboy |
Curious about the answers to @samhita-alla's questions above. My additional question:
It would also help @dennisobrien if you could come up with some pseudocode snippets for how
The main thing I'm interested in is how @task
def t1(...) -> tf.data.Dataset:
dataset = tf.data.Dataset.range(100)
return (
dataset
... # a bunch of transformations, potentially with local functions
)
@task
def t2(dataset: tf.data.Dataset):
... Suppose the Flyte TypeTransformer takes the output of On the flipside, when |
Sorry for the delay in responding.
The
If I understand this correctly the flow would go
I think this would work, but it would have the same effect as using
I think using I don't know if there is a way around this -- I haven't had the need to serialize/deserialize a dataset before using it, so I've really only researched it while thinking about using it with Flyte. Serializing/Deserializing a
I have only used |
Um, gotcha. I think that's expected cause the compute is serialized, too. And I think that's okay as long as we let the user know about it and the compute is supported by serialization.
I agree. I was referring to handling @task
def produce_record(...) -> TFRecordFile:
return tf.data.Dataset(...) |
Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏 |
Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏 |
Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. |
In the documentation for Distributed Tensorflow Training some tasks use |
@nightscape the type isn't available yet, so the data will be pickled. |
@nightscape , contributions are welcome. Here's some documentation on how to extend Flyte's types to cover custom types. |
Motivation: Why do you think this is important?
The
tf.data.Dataset
object encapsulates data as well as a preprocessing pipeline. It can be used in modelfit
,predict
, andevaluate
methods. It is widely used in Tensorflow tutorials and documentation and is considered a best practice when creating pipelines that saturate GPU resources.Goal: What should the final outcome look like, ideally?
Flyte tasks should be able to pass
tf.data.Dataset
objects as parameters and accept them as return types.Describe alternatives you've considered
There are caveats to passing
tf.data.Dataset
objects between tasks. Since atf.data.Dataset
object can have steps in the pipelines that use local Python functions (e.g., amap
orfilter
step), there doesn't seem to be a way to serialize the object without effectively "computing" the graph pipeline. There are times this could be beneficial (doing an expensive preprocessing pipeline once can free up the CPU during training) but this could also be confusing to the Flyte end user.So while adding a type transformer for
tf.data.Dataset
is certainly possible, it's still a good question if Flyte should actually support it given all the caveats. The alternative to consider here is to not supporttf.data.Dataset
. This seems like a question for the core Flyte team.Propose: Link/Inline OR Additional context
There are at least three main ways to serialize/deserialize
tf.data.Dataset
objects.These are probably in order of least complex to most complex. But determining the method of serialization/deserialization is an open question.
Some additional links:
tf.data.Dataset
as a deep copy without having the side-effect of "computing" the pipeline.Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: