Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network loading API: support for binary streams #401

Closed
sylvlecl opened this issue May 23, 2022 · 13 comments
Closed

Network loading API: support for binary streams #401

sylvlecl opened this issue May 23, 2022 · 13 comments
Assignees

Comments

@sylvlecl
Copy link
Contributor

sylvlecl commented May 23, 2022

  • Do you want to request a feature or report a bug?

Feature.

  • What is the current behavior?

We have 2 loading methods:

  • load: takes a file path as argument
  • load_from_string: takes the content of a file, as a string, as argument. Does not support byte entries.

So, in particular, in-memory "blobs" cannot be provided as arguments.
Streaming content is not supported either.

  • What is the expected behavior?

As proposed in #144, we should support byte streams (file object) as arguments to load.

Type checking and runtime behaviour
After some digging, it seems there is not a very standard way of type cheking for file objects, neither at runtime nor at typecheck time.
We have typing.BinaryIO and various classes in io module. But io.BinaryIO exposes much more methods than necessary:
we could go for a lighter protocol.
As an example, pandas lib seems to perform typechecking by using a union of many types (see comment below), and at runtime will only check for the presence of a read method.

--> a good, mixed approach could be to define a simple protocol with at least read method, and check for their presence at runtime.

String input handling

We should deprecate the load_from_string method:
users can provide an in memory buffer instead, with io.BytesIO for example.

Once question remains: do we allow only to input binary IO (such as io.BytesIO), or also text IO (such as io.StringIO) ?
The latter could be handy for text formats.
If we want to distinguish between the 2, there is no standard way ... For example, pandas ends up looking for 'b' character in "mode", and also checking the actual class of the object (against a predefined set of class including io.TextIOWrapper etc).

  • What is the motivation / use case for changing the behavior?

A use case is being able to load a network from in memory zip content.

@sylvlecl
Copy link
Contributor Author

For reference, pandas uses those unions for typing file inputs:

# filenames and file-like-objects
Buffer = Union[IO[AnyStr], RawIOBase, BufferedIOBase, TextIOBase, TextIOWrapper, mmap]
FileOrBuffer = Union[str, Buffer[AnyStr]]
FilePathOrBuffer = Union["PathLike[str]", FileOrBuffer[AnyStr]]

@sylvlecl sylvlecl self-assigned this May 24, 2022
@Haigutus
Copy link

Haigutus commented May 31, 2022

  • I would propose to support only io.BytestIO
  • The same should be done for export
  • On the packaging of files I would propose that the function would take in list of binary objects and find all XML-s independent of packaging

Network.load_bytest(list(object1, object2, objectN)

where object could be a XML file, Zip file with single XML inside, Zip file with multiple XML inside

@Haigutus
Copy link

any updates on this?

@Haigutus
Copy link

It seems there should be change done already on JAVA side on things

Currenlty:
filePath -> Network.load -> DataSource -> Imporetr find -> Import

Proposal to change Network.load implementation:

  1. if filePath: use current solution
  2. if not filePath and DataSource
    DataSource -> Network.load -> Importer find -> Import

Then on PY side of things one could implement creation of DataSource from file_like objects
so either
DataSourece(open(filePath))
or
DataSource(io.BytesIO)

This would help integrations in any language as it would remove main filesystem dependacy form the source code

https://github.com/powsybl/powsybl-core/blob/8b2850d803738d471c76b1f6ac7c8903af70e3db/commons/src/main/java/com/powsybl/commons/datasource/DataSourceUtil.java#L65

@geofjamg
Copy link
Member

geofjamg commented May 2, 2023

An easy solution could be to fully load the byte stream on python side and then transfer as it is to Java side but it would not be memory efficient. The right way (and hardest!) is to somehow connect python BytesIO.read to Java InputStream.read to be able to really stream the bytes from end to end.

@geofjamg
Copy link
Member

geofjamg commented May 2, 2023

  • I would propose to support only io.BytestIO
  • The same should be done for export
  • On the packaging of files I would propose that the function would take in list of binary objects and find all XML-s independent of packaging

Network.load_bytest(list(object1, object2, objectN)

where object could be a XML file, Zip file with single XML inside, Zip file with multiple XML inside

If we want to support this, we have to first detect if the byte stream is a zip or not (using magic number) and then we will be able to map it to the right DataSource implementation on Java side.

@Haigutus
Copy link

Haigutus commented May 3, 2023

@obrix
Copy link
Member

obrix commented May 25, 2023

FYI following discussions with @geofjamg I am working to support io.BytesIO as an input to load a network. This mean the content have to be fully loaded in memory but it should work with ascii based network description and also zip file as a memory blob (require a pull request on powsybl core to add a InMemoryZipFileDataSource as the existing ZipFileDataSource reload from a file).
I'll try to investigate a completely streamed pipeline but it will probably be an issue with zip support.

@sylvlecl sylvlecl removed their assignment Jun 6, 2023
@geofjamg
Copy link
Member

@Haigutus last release 0.24 supports BytesIO for network loading. You can provides to load_from_binary_buffers a list of BytesIO containing for instance individually zipped profiles and boundaries. Let us known if it solves your issue.

@Haigutus
Copy link

Haigutus commented Aug 3, 2023

@geofjamg I can confirm, the new API works for import, we have been using/testing for a week and have not encountered any issues, thankyou. As for export, should we make another ticket or will that also be handled under this ticket?

@geofjamg
Copy link
Member

geofjamg commented Aug 3, 2023

We can let this issue open until export is done

@Haigutus
Copy link

@geofjamg do you have any information, when the export to binary buffers could be expected?

@Haigutus
Copy link

Both import and export are implemented, thanks. This issue can be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants