Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for hybrid data #308

Closed
4 tasks
blythed opened this issue Jun 20, 2023 · 1 comment
Closed
4 tasks

Support for hybrid data #308

blythed opened this issue Jun 20, 2023 · 1 comment

Comments

@blythed
Copy link
Collaborator

blythed commented Jun 20, 2023

Why

In training and otherwise, it should be possible to load images separately (for example) client side, to enable more memory efficient.

How

Adding ability to download data from DB, and add into that the larger blobs post-hoc - loaded from disk.

What

  • Add configuration allowing images to be downloaded to disk, rather than to DB
  • Modify Downloader adding file saving option
  • Modify SuperDuperCursor
  • Add "out of memory" option to QueryDataset

This should be facilitated by #422.

@blythed
Copy link
Collaborator Author

blythed commented Jul 14, 2023

Currently data is saved like this in DB:

{"_content": {
  "bytes": b"...",
  "uri": "...",
  "encoder": "image"
}}

As a configurable option, do:

{"_content": {
  "local_uri": "file://... OR s3://",
  "uri": "https://...",
  "encoder": "image"
}}

Still when we perform db.execute(collection.find_one()), we would get data loaded.
This is a bit like functionality of EvaDB.

IDEA is: user experience in either case is just like performing queries, but DB might do this in a hybrid way.

Key modules:

  • datalayer/base/downloads.py
  • datalayer/core/documents.py

Currently in Document we see decoding from DB blob. Alternatively, could be a reference.

@classmethod
def _decode(cls, r: t.Dict, encoders: t.Dict):
    if isinstance(r, dict) and '_content' in r:
        type = encoders[r['_content']['encoder']]
        try:
            return type.decode(r['_content']['bytes'])
        except KeyError:
            return r
    elif isinstance(r, list):
        return [cls._decode(x, encoders) for x in r]
    elif isinstance(r, dict):
        for k in r:
            r[k] = cls._decode(r[k], encoders)
    return r

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant