Expose better API for file I/O in DataModules #1298

marrrcin · 2022-04-19T10:00:47Z

🚀 Feature

Expose better API for file I/O handling that will allow easier usage of fsspec / external file systems.

Motivation

We have a use case where training needs to stream data from cloud storage (S3 or GCS). Usually, we use fsspec for that.

In the current version of lightning-flash we're using ImageClassificationData.from_data_frame where the input DataFrame contains a list of paths to files stored in the cloud (with s3:// or gcs:// prefixes). Although it's possible to override the behaviour of the ImageClassificationFilesInput to handle loading from external file systems, it seems like it's API was not designed to do so, especially because flash.image.data.image_loader cannot be easily replaced (e.g. via kwargs) in the flash.image.data.ImageFilesInput.load_sample. Right now I need to override 2 methods: resolver (via constructor) and load_sample (via inheritance), like in the example below.

Pitch

Introduce parameters to high level API of the *FilesInput classes that will allow to specify actual file loading instead of relying on built-in Python's open and os.path APIs, assuming that everything is local.
This will make cloud training deployments easier.

Alternatives

force the user to always have full dataset locally (or on mounted volume)
always use fsspec for file I/O across the framework (might be too large change for such simple case)
leave the API as is, leading to more code on the project level

Current solution

class FSSpecMixin:
    @lru_cache()
    def _get_fs(self, protocol: str):
        return fsspec.filesystem(protocol)

    def get_fs(self, protocol: str, **storage_options) -> AbstractFileSystem:
        if storage_options:
            return fsspec.filesystem(protocol, **storage_options)
        else:
            return self._get_fs(protocol)

    def get_fs_for_path(self, filepath: str, **storage_options) -> AbstractFileSystem:
        protocol, _ = get_protocol_and_path(filepath)
        return self.get_fs(protocol, **storage_options)

class ImageFSSpecInput(ImageFilesInput, FSSpecMixin):
    def _load_image_using_fspec(self, filepath: str):
        with self.get_fs_for_path(filepath).open(filepath, "rb") as f:
            return Image.open(f).convert("RGB")

    def load_sample(self, sample: Dict[str, Any]) -> Dict[str, Any]:

        # HERE YOU CANNOT PLUG-IN THE FSSPEC LOGIC WITHOUT COPYING THE CODE FROM THE PARENT CLASS
        # DUE TO THE flash.image.data.image_loader BEING "HARD-CODED"

        filepath = sample[DataKeys.INPUT]
        sample[DataKeys.INPUT] = self._load_image_using_fspec(filepath)

        sample = ImageInput.load_sample(self, sample) # <--- NOTE THE CALL TO EXPLICIT PARENT CLASS
        
        sample[DataKeys.METADATA]["filepath"] = filepath
        return sample

class ImageClassificationDataFrameInputFSSpec(
    ImageClassificationDataFrameInput, ImageFSSpecInput
):
    def load_data(
        self,
        data_frame: pd.DataFrame,
        input_key: str,
        target_keys: Optional[Union[str, List[str]]] = None,
        root: Optional[PATH_TYPE] = None,
        resolver: Optional[Callable[[Optional[PATH_TYPE], Any], PATH_TYPE]] = None,
        target_formatter: Optional[TargetFormatter] = None,
    ) -> List[Dict[str, Any]]:
        def _fspec_resolver(_: Any, filepath: str) -> str:
            if not self.get_fs_for_path(filepath).exists(filepath):
                raise ValueError(f"File {filepath} does not exist")
            return filepath

        return super().load_data(
            data_frame, input_key, target_keys, root, _fspec_resolver, target_formatter
        )

Usage:

datamodule = ImageClassificationData.from_data_frame(input_cls=ImageClassificationDataFrameInputFSSpec, **kwargs)

Alternative nasty workaround

One can also monkey-patch the flash.image.data.image_loader using from unittest.mock import patch in a context manager, but I don't want such code in production environment.

Additional context

If there is a better entrypoint for overriding the file I/O in DataModules, I would be glad to learn about it. Thanks!

The text was updated successfully, but these errors were encountered:

ethanwharris · 2022-04-19T10:34:55Z

Hi @marrrcin thanks for the feature request! This is definitely a usecase we should support. I think there are two things we should do:

switch to using fsspec for all file handling within flash (this has a big upside and not really any downside as far as I can tell since this is the behaviour in lightning anyway)
expose a *_loader kwarg and just have the current loaders as defaults. I think this is probably the cleanest way to allow full customization but happy to hear other suggestions

Let me know if you agree or have some other ideas 😃

marrrcin · 2022-04-19T10:56:03Z

Thanks for your response @ethanwharris .

If you could switch the flash to use fsspec that would be great and it would be definitely the cleanest as well as the most transparent (to the end-users) solution.
There is this *_loader but also resolver, so in general there will be more and more kwargs to pass to flash's APIs.

marrrcin · 2022-06-07T06:15:17Z

@ethanwharris - is it possible to provide some timeline for this?

ethanwharris · 2022-06-29T13:21:54Z

Hey @marrrcin sorry for the delay in getting back to you, we got caught up in some other things. We're targeting this for our 0.8 release which we're looking to make in the next few weeks 😃

marrrcin added enhancement New feature or request help wanted Extra attention is needed labels Apr 19, 2022

ethanwharris added this to the 0.8.0 milestone Apr 19, 2022

ethanwharris self-assigned this Jul 11, 2022

ethanwharris mentioned this issue Jul 13, 2022

Refactor file loading to use fsspec #1387

Merged

8 tasks

ethanwharris closed this as completed in #1387 Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose better API for file I/O in DataModules #1298

Expose better API for file I/O in DataModules #1298

marrrcin commented Apr 19, 2022

ethanwharris commented Apr 19, 2022

marrrcin commented Apr 19, 2022

marrrcin commented Jun 7, 2022

ethanwharris commented Jun 29, 2022

Expose better API for file I/O in DataModules #1298

Expose better API for file I/O in DataModules #1298

Comments

marrrcin commented Apr 19, 2022

🚀 Feature

Motivation

Pitch

Alternatives

Current solution

Alternative nasty workaround

Additional context

ethanwharris commented Apr 19, 2022

marrrcin commented Apr 19, 2022

marrrcin commented Jun 7, 2022

ethanwharris commented Jun 29, 2022