-
Notifications
You must be signed in to change notification settings - Fork 680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core Feature] Support Modin data types #1294
Comments
Thank you for opening your first issue here! 🛠 |
Hi @anton-malakhov firstly, thank you for opening this issue. I have been following modin and does seems really promising. When we started with the 2020 update of flytekit (this is the version that you see right now), we made a decision that Flytekit core itself will be very minimal, lightweight and devoid of major dependencies. We broke this rule for pandas, because we wanted a simplistic way to support schema types. We would like to keep the core this way. But, I recommend that we add, Modin as a flytekit plugin. I guess the problem with this approach might be that you have to actually This can be overcome by using pkgutil Currently flytekitplugins follow a convention of namespace packages, but are not automatically loaded. This is done to avoid loading of packages that are not directly relevant to the current execution unit (ensure that we do not incur performance penalties). But, for data plugins and some type plugins, we would like them to be loaded automatically as soon as the module is installed. For doing this, the proposal is to use explicit entrypoint specification in the plugin package that wants implicit loading. Thus most task type plugins that are auto loaded when they are explicitly imported by the user will continue to work as they are, but for implicitly loading a package, the plugin author will write an entrypoint that looks like the following setup(
...
entry_points={'flytekitplugins: 's3_data = flytekitplugins-data-s3'},
...
) Correspondingly, we will add the following entrypoint loader to flytekit.init from importlib.metadata import entry_points
discovered_plugins = entry_points(group='myapp.plugins')
for p in discovered_plugins:
p.load() This solution is explained in the Data Persistence plugin specification for Flytekit as here This is being tackled in #809 @anton-malakhov would you be open to contributing Modin as a |
Also cc @cosmicBboy, @wild-endeavor we should probably update pandera plugins setup.py to be auto-loaded? - https://github.com/flyteorg/flytekit/tree/master/plugins/flytekit-pandera. Samething for GE plugin @samhita-alla ? |
thank you @kumare3 for the fast and full answer! We are looking into what we can contribute, no promises though yet. |
@anton-malakhov absolutely. We are working on the data persistence and auto load systems. No hurry and any contributions are super welcome. Also please join our slack channel. The community would love to pitch in help wherever we can |
Motivation
Modin is an emerging drop-in replacement or rather extension of Pandas. But it is not directly supported by Flyte, which is asking to register a new transformer. That disrupts out-of-the box experience for data scientists. Can we integrate Modin as a first-class citizen like Pandas?
Goal
Support Modin data types with pre-packaged transformers.
Alternatives
Let users to write transformers for Modin themselves.
Additional context
This is the test program which is broken by absence of corresponding transformers:
https://gist.github.com/anton-malakhov/35c27cba59a8db16ffc0c6482c2ef714
The text was updated successfully, but these errors were encountered: