Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Expose better API for file I/O in DataModules #1298

Closed
marrrcin opened this issue Apr 19, 2022 · 4 comments · Fixed by #1387
Closed

Expose better API for file I/O in DataModules #1298

marrrcin opened this issue Apr 19, 2022 · 4 comments · Fixed by #1387
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@marrrcin
Copy link

🚀 Feature

Expose better API for file I/O handling that will allow easier usage of fsspec / external file systems.

Motivation

We have a use case where training needs to stream data from cloud storage (S3 or GCS). Usually, we use fsspec for that.

In the current version of lightning-flash we're using ImageClassificationData.from_data_frame where the input DataFrame contains a list of paths to files stored in the cloud (with s3:// or gcs:// prefixes). Although it's possible to override the behaviour of the ImageClassificationFilesInput to handle loading from external file systems, it seems like it's API was not designed to do so, especially because flash.image.data.image_loader cannot be easily replaced (e.g. via kwargs) in the flash.image.data.ImageFilesInput.load_sample. Right now I need to override 2 methods: resolver (via constructor) and load_sample (via inheritance), like in the example below.

Pitch

Introduce parameters to high level API of the *FilesInput classes that will allow to specify actual file loading instead of relying on built-in Python's open and os.path APIs, assuming that everything is local.
This will make cloud training deployments easier.

Alternatives

  • force the user to always have full dataset locally (or on mounted volume)
  • always use fsspec for file I/O across the framework (might be too large change for such simple case)
  • leave the API as is, leading to more code on the project level

Current solution

class FSSpecMixin:
    @lru_cache()
    def _get_fs(self, protocol: str):
        return fsspec.filesystem(protocol)

    def get_fs(self, protocol: str, **storage_options) -> AbstractFileSystem:
        if storage_options:
            return fsspec.filesystem(protocol, **storage_options)
        else:
            return self._get_fs(protocol)

    def get_fs_for_path(self, filepath: str, **storage_options) -> AbstractFileSystem:
        protocol, _ = get_protocol_and_path(filepath)
        return self.get_fs(protocol, **storage_options)

class ImageFSSpecInput(ImageFilesInput, FSSpecMixin):
    def _load_image_using_fspec(self, filepath: str):
        with self.get_fs_for_path(filepath).open(filepath, "rb") as f:
            return Image.open(f).convert("RGB")

    def load_sample(self, sample: Dict[str, Any]) -> Dict[str, Any]:

        # HERE YOU CANNOT PLUG-IN THE FSSPEC LOGIC WITHOUT COPYING THE CODE FROM THE PARENT CLASS
        # DUE TO THE flash.image.data.image_loader BEING "HARD-CODED"

        filepath = sample[DataKeys.INPUT]
        sample[DataKeys.INPUT] = self._load_image_using_fspec(filepath)

        sample = ImageInput.load_sample(self, sample) # <--- NOTE THE CALL TO EXPLICIT PARENT CLASS
        
        sample[DataKeys.METADATA]["filepath"] = filepath
        return sample

class ImageClassificationDataFrameInputFSSpec(
    ImageClassificationDataFrameInput, ImageFSSpecInput
):
    def load_data(
        self,
        data_frame: pd.DataFrame,
        input_key: str,
        target_keys: Optional[Union[str, List[str]]] = None,
        root: Optional[PATH_TYPE] = None,
        resolver: Optional[Callable[[Optional[PATH_TYPE], Any], PATH_TYPE]] = None,
        target_formatter: Optional[TargetFormatter] = None,
    ) -> List[Dict[str, Any]]:
        def _fspec_resolver(_: Any, filepath: str) -> str:
            if not self.get_fs_for_path(filepath).exists(filepath):
                raise ValueError(f"File {filepath} does not exist")
            return filepath

        return super().load_data(
            data_frame, input_key, target_keys, root, _fspec_resolver, target_formatter
        )

Usage:

datamodule = ImageClassificationData.from_data_frame(input_cls=ImageClassificationDataFrameInputFSSpec, **kwargs)

Alternative nasty workaround

One can also monkey-patch the flash.image.data.image_loader using from unittest.mock import patch in a context manager, but I don't want such code in production environment.

Additional context

If there is a better entrypoint for overriding the file I/O in DataModules, I would be glad to learn about it. Thanks!

@marrrcin marrrcin added enhancement New feature or request help wanted Extra attention is needed labels Apr 19, 2022
@ethanwharris
Copy link
Collaborator

Hi @marrrcin thanks for the feature request! This is definitely a usecase we should support. I think there are two things we should do:

  1. switch to using fsspec for all file handling within flash (this has a big upside and not really any downside as far as I can tell since this is the behaviour in lightning anyway)
  2. expose a *_loader kwarg and just have the current loaders as defaults. I think this is probably the cleanest way to allow full customization but happy to hear other suggestions

Let me know if you agree or have some other ideas 😃

@ethanwharris ethanwharris added this to the 0.8.0 milestone Apr 19, 2022
@marrrcin
Copy link
Author

Thanks for your response @ethanwharris .

  1. If you could switch the flash to use fsspec that would be great and it would be definitely the cleanest as well as the most transparent (to the end-users) solution.

  2. There is this *_loader but also resolver, so in general there will be more and more kwargs to pass to flash's APIs.

@marrrcin
Copy link
Author

marrrcin commented Jun 7, 2022

@ethanwharris - is it possible to provide some timeline for this?

@ethanwharris
Copy link
Collaborator

Hey @marrrcin sorry for the delay in getting back to you, we got caught up in some other things. We're targeting this for our 0.8 release which we're looking to make in the next few weeks 😃

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants