You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.
Expose better API for file I/O handling that will allow easier usage of fsspec / external file systems.
Motivation
We have a use case where training needs to stream data from cloud storage (S3 or GCS). Usually, we use fsspec for that.
In the current version of lightning-flash we're using ImageClassificationData.from_data_frame where the input DataFrame contains a list of paths to files stored in the cloud (with s3:// or gcs:// prefixes). Although it's possible to override the behaviour of the ImageClassificationFilesInput to handle loading from external file systems, it seems like it's API was not designed to do so, especially because flash.image.data.image_loader cannot be easily replaced (e.g. via kwargs) in the flash.image.data.ImageFilesInput.load_sample. Right now I need to override 2 methods: resolver (via constructor) and load_sample (via inheritance), like in the example below.
Pitch
Introduce parameters to high level API of the *FilesInput classes that will allow to specify actual file loading instead of relying on built-in Python's open and os.path APIs, assuming that everything is local.
This will make cloud training deployments easier.
Alternatives
force the user to always have full dataset locally (or on mounted volume)
always use fsspec for file I/O across the framework (might be too large change for such simple case)
leave the API as is, leading to more code on the project level
Current solution
classFSSpecMixin:
@lru_cache()def_get_fs(self, protocol: str):
returnfsspec.filesystem(protocol)
defget_fs(self, protocol: str, **storage_options) ->AbstractFileSystem:
ifstorage_options:
returnfsspec.filesystem(protocol, **storage_options)
else:
returnself._get_fs(protocol)
defget_fs_for_path(self, filepath: str, **storage_options) ->AbstractFileSystem:
protocol, _=get_protocol_and_path(filepath)
returnself.get_fs(protocol, **storage_options)
classImageFSSpecInput(ImageFilesInput, FSSpecMixin):
def_load_image_using_fspec(self, filepath: str):
withself.get_fs_for_path(filepath).open(filepath, "rb") asf:
returnImage.open(f).convert("RGB")
defload_sample(self, sample: Dict[str, Any]) ->Dict[str, Any]:
# HERE YOU CANNOT PLUG-IN THE FSSPEC LOGIC WITHOUT COPYING THE CODE FROM THE PARENT CLASS# DUE TO THE flash.image.data.image_loader BEING "HARD-CODED"filepath=sample[DataKeys.INPUT]
sample[DataKeys.INPUT] =self._load_image_using_fspec(filepath)
sample=ImageInput.load_sample(self, sample) # <--- NOTE THE CALL TO EXPLICIT PARENT CLASSsample[DataKeys.METADATA]["filepath"] =filepathreturnsampleclassImageClassificationDataFrameInputFSSpec(
ImageClassificationDataFrameInput, ImageFSSpecInput
):
defload_data(
self,
data_frame: pd.DataFrame,
input_key: str,
target_keys: Optional[Union[str, List[str]]] =None,
root: Optional[PATH_TYPE] =None,
resolver: Optional[Callable[[Optional[PATH_TYPE], Any], PATH_TYPE]] =None,
target_formatter: Optional[TargetFormatter] =None,
) ->List[Dict[str, Any]]:
def_fspec_resolver(_: Any, filepath: str) ->str:
ifnotself.get_fs_for_path(filepath).exists(filepath):
raiseValueError(f"File {filepath} does not exist")
returnfilepathreturnsuper().load_data(
data_frame, input_key, target_keys, root, _fspec_resolver, target_formatter
)
One can also monkey-patch the flash.image.data.image_loader using from unittest.mock import patch in a context manager, but I don't want such code in production environment.
Additional context
If there is a better entrypoint for overriding the file I/O in DataModules, I would be glad to learn about it. Thanks!
The text was updated successfully, but these errors were encountered:
Hi @marrrcin thanks for the feature request! This is definitely a usecase we should support. I think there are two things we should do:
switch to using fsspec for all file handling within flash (this has a big upside and not really any downside as far as I can tell since this is the behaviour in lightning anyway)
expose a *_loader kwarg and just have the current loaders as defaults. I think this is probably the cleanest way to allow full customization but happy to hear other suggestions
Let me know if you agree or have some other ideas 😃
If you could switch the flash to use fsspec that would be great and it would be definitely the cleanest as well as the most transparent (to the end-users) solution.
There is this *_loader but also resolver, so in general there will be more and more kwargs to pass to flash's APIs.
Hey @marrrcin sorry for the delay in getting back to you, we got caught up in some other things. We're targeting this for our 0.8 release which we're looking to make in the next few weeks 😃
🚀 Feature
Expose better API for file I/O handling that will allow easier usage of fsspec / external file systems.
Motivation
We have a use case where training needs to stream data from cloud storage (S3 or GCS). Usually, we use fsspec for that.
In the current version of
lightning-flash
we're usingImageClassificationData.from_data_frame
where the inputDataFrame
contains a list of paths to files stored in the cloud (withs3://
orgcs://
prefixes). Although it's possible to override the behaviour of theImageClassificationFilesInput
to handle loading from external file systems, it seems like it's API was not designed to do so, especially becauseflash.image.data.image_loader
cannot be easily replaced (e.g. viakwargs
) in theflash.image.data.ImageFilesInput.load_sample
. Right now I need to override 2 methods:resolver
(via constructor) andload_sample
(via inheritance), like in the example below.Pitch
Introduce parameters to high level API of the
*FilesInput
classes that will allow to specify actual file loading instead of relying on built-in Python'sopen
andos.path
APIs, assuming that everything is local.This will make cloud training deployments easier.
Alternatives
Current solution
Usage:
Alternative nasty workaround
One can also monkey-patch the
flash.image.data.image_loader
usingfrom unittest.mock import patch
in a context manager, but I don't want such code in production environment.Additional context
If there is a better entrypoint for overriding the file I/O in DataModules, I would be glad to learn about it. Thanks!
The text was updated successfully, but these errors were encountered: