Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core Feature] Support Modin data types #1294

Closed
anton-malakhov opened this issue Jul 28, 2021 · 5 comments
Closed

[Core Feature] Support Modin data types #1294

anton-malakhov opened this issue Jul 28, 2021 · 5 comments
Labels
enhancement New feature or request

Comments

@anton-malakhov
Copy link
Contributor

Motivation
Modin is an emerging drop-in replacement or rather extension of Pandas. But it is not directly supported by Flyte, which is asking to register a new transformer. That disrupts out-of-the box experience for data scientists. Can we integrate Modin as a first-class citizen like Pandas?

Goal
Support Modin data types with pre-packaged transformers.

Alternatives
Let users to write transformers for Modin themselves.

Additional context
This is the test program which is broken by absence of corresponding transformers:
https://gist.github.com/anton-malakhov/35c27cba59a8db16ffc0c6482c2ef714

@anton-malakhov anton-malakhov added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Jul 28, 2021
@welcome
Copy link

welcome bot commented Jul 28, 2021

Thank you for opening your first issue here! 🛠

@kumare3
Copy link
Contributor

kumare3 commented Jul 29, 2021

Hi @anton-malakhov firstly, thank you for opening this issue. I have been following modin and does seems really promising. When we started with the 2020 update of flytekit (this is the version that you see right now), we made a decision that Flytekit core itself will be very minimal, lightweight and devoid of major dependencies. We broke this rule for pandas, because we wanted a simplistic way to support schema types. We would like to keep the core this way.

But, I recommend that we add, Modin as a flytekit plugin. I guess the problem with this approach might be that you have to actually shadow import the transformer.

This can be overcome by using pkgutil

Currently flytekitplugins follow a convention of namespace packages, but are not automatically loaded. This is done to avoid loading of packages that are not directly relevant to the current execution unit (ensure that we do not incur performance penalties). But, for data plugins and some type plugins, we would like them to be loaded automatically as soon as the module is installed. For doing this, the proposal is to use explicit entrypoint specification in the plugin package that wants implicit loading. Thus most task type plugins that are auto loaded when they are explicitly imported by the user will continue to work as they are, but for implicitly loading a package, the plugin author will write an entrypoint that looks like the following

setup(
    ...
    entry_points={'flytekitplugins: 's3_data = flytekitplugins-data-s3'},
    ...
)

Correspondingly, we will add the following entrypoint loader to flytekit.init

from importlib.metadata import entry_points

discovered_plugins = entry_points(group='myapp.plugins')
for p in discovered_plugins:
   p.load()

This solution is explained in the Data Persistence plugin specification for Flytekit as here

This is being tackled in #809

cc @wild-endeavor

@anton-malakhov would you be open to contributing Modin as a flytekitplugins-modin plugin?

@kumare3
Copy link
Contributor

kumare3 commented Jul 29, 2021

Also cc @cosmicBboy, @wild-endeavor we should probably update pandera plugins setup.py to be auto-loaded? - https://github.com/flyteorg/flytekit/tree/master/plugins/flytekit-pandera. Samething for GE plugin @samhita-alla ?

@anton-malakhov
Copy link
Contributor Author

thank you @kumare3 for the fast and full answer! We are looking into what we can contribute, no promises though yet.

@kumare3
Copy link
Contributor

kumare3 commented Jul 29, 2021

@anton-malakhov absolutely. We are working on the data persistence and auto load systems. No hurry and any contributions are super welcome.

Also please join our slack channel. The community would love to pitch in help wherever we can

@kumare3 kumare3 removed the untriaged This issues has not yet been looked at by the Maintainers label Jul 31, 2021
@Tat-V Tat-V mentioned this issue Sep 3, 2021
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants