Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kedro-datasets] fsspec mixin #4314

Closed
fdroessler opened this issue May 7, 2023 · 6 comments
Closed

[kedro-datasets] fsspec mixin #4314

fdroessler opened this issue May 7, 2023 · 6 comments
Labels
Community Issue/PR opened by the open-source community

Comments

@fdroessler
Copy link
Contributor

fdroessler commented May 7, 2023

Description and context

I am looking into integrating the relatively new azureml-fsspec bindings to allow loading (and potentially saving) of datasets directly from an azure machine learning studio datastore. This is slightly different to blob storage or adlfs and just got their 1.0.0 release recently. Due to their implementation of fsspec, when initialising the file system, one needs to pass the uri of the dataset that is to be read. This is not necessary in any other fsspec implementation and I'm unsure if it is a feature or a but, this I am still investigating.

However, as part of this I figured that changing something in the fsspec-filesystem initialisation, which is used in the majority of file-based datasets, would result in the need to copy and paste the changed code into all of the datasets that use that functionality. This is potentially error prone and could lead to accidental mistakes further down the line.

One solution I could come up with, and this certainly is not necessarily the right one, was providing a FileSystemMixin class. This could be used to provide the fsspec-filesystem capability in all datasets that need it while allowing for modifications if necessary. I guess one could use a function that returns the file system as well but I always liked the Mixin pattern for some reason. Also this would need further work in the datasets __init__ function that I have not thought through.

Happy to contribute a PR if this is of interest.

Possible Implementation

A minimal working example that I used to try out the azureml-fsspec. The if self._protocol == "azureml": ... can be ignored but would be necessary to support azureml-fsspec in their current implementation.

class FileSystemMixin:
    def __init__(self, protocol: str, filepath: str, credentials: dict = None, fs_args: dict = None):
        _fs_args = deepcopy(fs_args) or {}
        _credentials = deepcopy(credentials) or {}

        if protocol == "file":
            _fs_args.setdefault("auto_mkdir", True)

        self._protocol = protocol
        self._storage_options = {**_credentials, **_fs_args}
        if self._protocol == "azureml":
            self._fs = fsspec.filesystem(self._protocol, **self._storage_options, uri=filepath)
        else:    
            self._fs = fsspec.filesystem(self._protocol, **self._storage_options)

    def exists(self, path):
        return self._fs.exists(path)

    def glob(self, path):
        return self._fs.glob(path)

Modified pandas.CSVDataset omitting docstrings for brevity.

class CSVDataSet(AbstractVersionedDataSet[pd.DataFrame, pd.DataFrame], FileSystemMixin):
    
    DEFAULT_LOAD_ARGS = {}  # type: Dict[str, Any]
    DEFAULT_SAVE_ARGS = {"index": False}  # type: Dict[str, Any]

    def __init__(
        self,
        filepath: str,
        load_args: Dict[str, Any] = None,
        save_args: Dict[str, Any] = None,
        version: Version = None,
        credentials: Dict[str, Any] = None,
        fs_args: Dict[str, Any] = None,
    ) -> None:
        protocol, path = get_protocol_and_path(filepath, version)
        FileSystemMixin.__init__(
            self,
            protocol,
            filepath,
            credentials,
            fs_args
        )
        AbstractVersionedDataSet.__init__(
            self,
            filepath=PurePosixPath(path),
            version=version,
            exists_function=self.exists,
            glob_function=self.glob,
        )

        # Handle default load and save arguments
        self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
        if load_args is not None:
            self._load_args.update(load_args)
        self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
        if save_args is not None:
            self._save_args.update(save_args)

        if "storage_options" in self._save_args or "storage_options" in self._load_args:
            logger.warning(
                "Dropping 'storage_options' for %s, "
                "please specify them under 'fs_args' or 'credentials'.",
                self._filepath,
            )
            self._save_args.pop("storage_options", None)
            self._load_args.pop("storage_options", None)

Possible Alternatives

Not tried but I guess something like this could work too, although I guess this would need some additional definitions in the Dataset __init__ such as : self._protocol etc for use in load or save etc.

def create_fs(protocol: str, filepath: str, credentials: dict = None, fs_args: dict = None):
        fs_args = deepcopy(fs_args) or {}
        credentials = deepcopy(credentials) or {}

        if protocol == "file":
            fs_args.setdefault("auto_mkdir", True)

        if protocol == "azureml":
            fs = fsspec.filesystem(protocol, **credentials, **fs_args, uri=filepath)
        else:    
            fs = fsspec.filesystem(protocol, **credentials, **fs_args)
        return fs
@astrojuanlu
Copy link
Member

Hi @fdroessler, thanks a lot for your interest in Kedro!

Sharing some thoughts here: Personally I'm not a big fan of mixins and multiple inheritance. If we went for a composition-over-inheritance approach maybe we could have something like an FSSpecAdapter (bad name, bikesheddable) so that exists_function and glob_function is not part of the parent class, but of a new .adapter property.

Beware that the engineering team might have a different opinion regarding this mixin. But more importantly, it's up to debate whether this relatively small refactor is worth taking in the face of #1778

Waiting for other colleagues to chime in.

@fdroessler
Copy link
Contributor Author

Thanks @astrojuanlu, I was not aware of this major refactor. I guess this is some ways off though or do you have more details on the timeline?
In case there is no point in refactoring either through multiple inheritance or composition, maybe a PR to allow (at least read) functionality for azureml fsspec would be considered?

@noklam
Copy link
Contributor

noklam commented May 15, 2023

@fdroessler https://github.com/kedro-org/kedro/milestone/12

Can you see this? I believe we open access for everyone but let me know if this is not the case.

We are aware of the datasets problem, it's not quite flexible as you can see the fsspec wasn't part of the Abstract class contract, but we almost used it everywhere.

Can you show an example why you need to refactor all datasets? In the past we have added new protocol and it's usually quite straight forward to do so, let's see if we can do something about it as this refactoring is not likely happen very soon.

@AhdraMeraliQB AhdraMeraliQB added the Community Issue/PR opened by the open-source community label May 17, 2023
@fdroessler
Copy link
Contributor Author

@noklam Yes I can see that, thanks for pointing that out. The problem with azureml-fsspec is that they did not implement it similar to adfs or adl ... but such that you have to provide the url to the storage when instantiating the fs. Might be a bug but not sure tbh.

This leads to the following modifications on my side:

Also note that azureml-fsspec just supports reading at this point, they have an option to "upload" but that has to go through a separate method on the fs instance.

add azureml to cloud protocols:
CLOUD_PROTOCOLS = ("s3", "s3n", "s3a", "gcs", "gs", "adl", "abfs", "abfss", "gdrive", "azureml")

and then:

        if self._protocol == "azureml":
            self._fs = fsspec.filesystem(self._protocol, **self._storage_options, uri=filepath)
        else:    
            self._fs = fsspec.filesystem(self._protocol, **self._storage_options)

This is what I would have to add to all the datasets in order to support access to them for azureml. This got me thinking on how to simplify modifications like this and avoid code redundancy. However, at the time of first writing I was not aware of the major refactor of datasets.

Nonetheless it would be great to have something added to support azureml-fsspec if possible. This does not have to be multiple inheritance or composition at this stage, I'm happy to add the code snippet in all the relevant datasets if necessary.

Might also be that the read-only restriction is too limited for kedro 🤷 however I am working also to add some functionality to kedro-azureml plugin to better support remote and local execution.

@deepyaman
Copy link
Member

@fdroessler Without thinking about it more, I generally like the idea of splitting some of these functionalities into mix-ins, as you propose. There's some precedent for that (see #15), even though the DefaultArgumentsMixIn specifically seems to have been disappeared in 0.16.0.

Haven't consider the adapters proposal from @astrojuanlu at all; need to learn design patterns again to give good insight into mix-in vs. adapter vs. something else. 😅

@astrojuanlu
Copy link
Member

I think excessive copy-pasting in datasets is still an issue (see loooong discussion in #1936)

And yet, I stand by what I said in other places: I think inheritance is mostly bad, and multiple inheritance is borderline evil and difficult to reason about.

For now, I'm moving this to Discussions.

@astrojuanlu astrojuanlu transferred this issue from kedro-org/kedro-plugins Nov 8, 2024
@kedro-org kedro-org locked and limited conversation to collaborators Nov 8, 2024
@astrojuanlu astrojuanlu converted this issue into discussion #4315 Nov 8, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

No branches or pull requests

5 participants