Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AEP: Add a schema to ORM classes #40

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jan 18, 2024

AiiDA's Python API provides an object relational mapper (ORM) that abstracts the various entities that can be stored inside the provenance graph and the relationships between them. In most use cases, users use this ORM directly in Python to construct new instances of entities and retrieve existing ones, in order to get access to their data and manipulate it.

A current shortcoming of the ORM is that it is not possible to programmatically introspect the schema of each entity: that is to say, what data each entity stores. This makes it difficult for external applications to provide interfaces to create and or retrieve entity instances. It also makes it difficult to take the data outside of the Python environment since the data would have to be serialized. However, without a well-defined schema, doing this without an ad-hoc solution is practically impossible.

Clear data schemas for all ORM entities would enable the creation of external applications to work with the data stored in AiiDA's provenance graph. A typical example use case would be a web API whose interface, to create and return ORM entities from an AiiDA profile, would be dynamically generated by programmatically introspecting the schema of all ORM entities stored within it. Currently, the interface has to be manually generated for each ORM entity, and data (de)serialization has to be implemented.

I have already implemented the proposal, which can be found here: https://github.com/sphuber/aiida-core/tree/feature/orm-pydantic-models
All tests pass, so it seems that these changes would be perfectly backwards compatible. There were a few changes in the existing API that were necessary, but these should also be backwards compatible. Essentially these were ORM class constructors not allowing to specify all fields directly, such as extras, or the frontend/backend ORM classes didn't implement properties for all fields.

@chrisjsewell
Copy link
Member

Heya cheers,

My main question though would be; what about the object repository of data classes?
This proposal seems to rely on every data class being JSON serializable, i.e. only data classes that do not use the object repository?
To my mind, how to serialise/deserialise data classes with repository objects has always been the blocker for this, but I don't see any solutions here?

Also, the from_model method seems to assume that every data class has an __init__ with parameters that directly map to the ORM fields, which is not the case for all data classes?

@sphuber
Copy link
Contributor Author

sphuber commented Jan 18, 2024

To my mind, how to serialise/deserialise data classes with repository objects has always been the blocker for this, but I don't see any solutions here?

Good point, I had thought about this indeed, but kind of lost sight of this the last day when working towards this AEP. I will see if I can come up with a proposal to solve it.

Also, the from_model method seems to assume that every data class has an init with parameters that directly map to the ORM fields, which is not the case for all data classes?

True, and I already have corrected this for some. But I would say that this is a "mistake" of the ORM constructor. For example, the Group has a time field, but the constructor didn't allow to set it. It had to be set through Group.set_time method. But this seems unnecessary. I think the constructor should allow setting all fields in the schema. So if they are lacking in some classes, we can simply add them.

@sphuber
Copy link
Contributor Author

sphuber commented Jan 19, 2024

Alright, I have a potential solution. It is possible to represent repository content simply as a dictionary of bytes with the key being the relative filepath in the repo. Given that only the Node class has a file repo, I simply updated the Node.Model with the following field:

repository_content: Optional[dict[str, bytes]] = MetadataField(
    None,
    description='Dictionary of file repository content',
    orm_to_model=lambda node: node.base.repository.serialize_content(),
)

The NodeRepository.serialize_content method is implemented as follows:

def serialize_content(self) -> dict[str, bytes]:
    """Serialize the content of the repository content into a JSON-serializable format.

    :return: dictionary with the content metadata.
    """
    serialized = {}

    for dirpath, _, filenames in self.walk():
        for filename in filenames:
            filepath = dirpath / filename
            serialized[str(filepath)] = self.get_object_content(filepath, mode='rb')

    return serialized

The final modification required was to override Node.from_model:

@classmethod
def from_model(cls, model: Model) -> 'Node':
    """Return an entity instance from an instance of its model."""
    fields = cls.model_to_orm_field_values(model)

    repository_content = fields.pop('repository_content', {})
    node = cls(**fields)

    for filepath, content in repository_content.items():
        node.base.repository.put_object_from_bytes(content, filepath)

    return node

With these changes, it is now possible to create data nodes using the simple REST API endpoint I implemented in this AEP:

$ curl - X POST http: //127.0.0.1:8000/data/ -H 'Content-Type: application/json' -d '{"user": 1, "repository_content": {"file.txt": "some-content"}}'
{
    "pk": 157205,
    "uuid": "2c078850-f7ce-4d2d-a05f-fc87cc2ca256",
    "node_type": "data.Data.",
    "process_type": null,
    "repository_metadata": {
        "o": {
            "file.txt": {
                "k": "0a8cac771ca188eacc57e2c96c31f5611925c5ecedccb16b8c236d6c0d325112"
            }
        }
    },
    "ctime": "2024-01-19T15:20:26.577566+01:00",
    "mtime": "2024-01-19T15:20:26.669225+01:00",
    "label": "",
    "description": "",
    "extras": {
        "_aiida_hash": "274711a31a2eafaf55b6fe57f9eda272f6c00c7fbf0bdebd6d4dd72dc3596123"
    },
    "computer": null,
    "user": 1,
    "repository_content": {
        "file.txt": "some-content"
    },
    "source": null
}

Two big questions come to mind straight away:

  • This approach is reading all repo content into memory. This may be problematic for big files. I think it is possible for web APIs to maybe stream big files in chunks. I am not sure how REST APIs typically handle this, and if we could build support in for this. This is definitely worth looking into, but I think the current solution is already valuable for many use cases, if we make the caveat of big files clear to users.
  • The Node base class now defines repository_content as a field. However, subclasses, may define specific files that would be defined. For example, SinglefileData limits the files written to a single one, and the constructor has a dedicated argument for this. Same goes for ArrayData where the arrays argument in the constructor takes a numpy array or dictionary of arrays that are written as binary files to the repo. Should the model for these classes explicitly define these fields? The attributes of the Node class are currently not added as a field, because the current design (actually design choice of @chrisjsewell from this PR ✨ NEW: Add orm.Entity.fields interface for QueryBuilder aiida-core#5088) states that the subclasses should add explicit fields for the attributes that they know they (can) store.

I think that we cannot go the road of the attributes and not include the repository_content field on the Node base class. This is because certain node subclasses can store an arbitrary number of files that cannot be define a priori. Example is FolderData where it is impossible to define exactly what files will or can be stored. The equivalent for data is kind of the Dict class, so this problem is not as clear for the attributes field. Although I still think there is something to be said for allowing the Data base class to define the attributes field, because you can just store a Data instance with arbitrary attributes and it would be nice to get those attributes when serializing the data instance, which is currently not the case.

Taking the SinglefileData as example, I implemented the following:

class Model(Data.Model):
    content: bytes = MetadataField(
        description='The file content.', model_to_orm=lambda content: io.BytesIO(content)
    )
    filename: t.Optional[str] = MetadataField(None, description='The filename. Defaults to `file.txt`.')

This works just fine, except for a few modifications to the SinglefileData class. Unfortunately, the original desiner chose file as the main constructor argument, which is a protected keyword. Since we need to implement that constructor argument as a property, that would be subideal. As an alternative, I added the constructor argument content and added that as a property.

Another point is that when serializing a SinglefileData like this, the content would be returned twice. Once as the content field, and once in the repository_content. This is not ideal and we really want to remove the inherited repository_content field in this case. Unfortunately, this seems to no longer be supported in pydantic v2, even though it was in pydantic v1. See this discussion: pydantic/pydantic#2686
There is a dirty workaround for now, by adding the following after the SinglefileData definition:

del SinglefileData.Model.__fields__['repository_content']

This works for SinglefileData and the model now no longer has the repository_content field. Unfortunately, this doesn't seem to work for subclasses of SinglefileData like CifData which is very annoying.

As another test, I added support for ArrayData and the following seems to work:

class ArrayData(Data):

    class Model(Data.Model):
        model_config = ConfigDict(arbitrary_types_allowed=True)
        arrays: Optional[dict[str, ndarray]] = MetadataField(None, description='The dictionary of numpy arrays.')

Demonstration of functionality:

In [1]: from aiida.orm import ArrayData

In [2]: import numpy as np

In [3]: array = ArrayData(arrays=np.array([1, 2, 3]))

In [4]: array.to_model()
Out[4] Model(pk=None, uuid='47fb4799-e9f7-4a83-a9f5-580abb6250b7', node_type='data.core.array.ArrayData.', process_type=None, repository_metadata={}, ctime=datetime.datetime(2024, 1, 19, 15, 45, 45, 38034, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600), 'CET')), mtime=None, label='', description='', extras={}, computer=None, user=1, repository_content={'default.npy': b"\x93NUMPY\x01\x00v\x00{'descr': '<i8', 'fortran_order': False, 'shape': (3,), }                                                            \n\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00"}, source=None, arrays={'default': array([1, 2, 3])})

Note that here once again we would want to remove the inherited repository_content field.

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 19, 2024

But I would say that this is a "mistake" of the ORM constructor.
I think the constructor should allow setting all fields in the schema. So if they are lacking in some classes, we can simply add them.

We can't change aiida plugins though (with custom data classes) 😬, and since this constraint was never stipulated, probably most of them will be broken. Plus just in general

  1. These will probably be breaking changes
  2. We would need to think of a way to check for this new constraint of the __init__ signatures

A question would be; can plugin developer subclass the from_serialized classmethod, to do their own thing, or is meant to remain fixed?

It is possible to represent repository content simply as a dictionary of bytes with the key being the relative filepath in the repo.
The final modification required was to override Node.from_model
With these changes, it is now possible to create data nodes using the simple REST API endpoint I implemented in this AEP

I think you are missing a step here, in interchanging binary with strings without any explicit conversion?
For example, what does it look like to serialize ArrayData to legitimate JSON that can be used with REST

Although I still think there is something to be said for allowing the Data base class to define the attributes field

I feel, you should never be storing arbitrary things in attributes, because that kinda defeats the purpose of having data subclasses 😅
Storing anything in attributes (or in the object repository) is more of an implementation detail of the Dataclass, and should not need to be accessed by users. This is one of the purposes of #5088

I think ideally repository_metadata should not be directly exposed either; this will be very fragile if you start trying to change repository objects, which then invalidates their hash keys

I think it is possible for web APIs to maybe stream big files in chunks. I am not sure how REST APIs typically handle this, and if we could build support in for this.

We already do: https://github.com/aiidateam/aiida-restapi/blob/3f14ea2ac88e06ce9ce8d4cfb02ec16fea0463f7/aiida_restapi/routers/nodes.py#L68

I think the current solution is already valuable for many use cases, if we make the caveat of big files clear to users.

Yeh, not to say that it can't be useful 👍, but it is certainly a major caveat, that should be carefully considered.
I would just suggest being careful to not introduce anything that makes it even more difficult to implement a "full" solution at a later date.

@sphuber
Copy link
Contributor Author

sphuber commented Jan 19, 2024

We can't change aiida plugins though (with custom data classes) 😬, and since this constraint was never stipulated, probably most of them will be broken. Plus just in general

1. These will probably be breaking changes
2. We would need to think of a way to check for this new constraint of the `__init__` signatures

Fair enough. First off, the number of data plugins out there is quite limited. Additionally, I am not saying that we are promising that this new functionality will work with all plugins out there, out of the box. I also don't think we have to, because all of this functionality is additive and existing functionality remains completely unaffected. Even if no external data plugins would be compatible with the new functionality, just having this for all core ORM classes would be a huge win.

That being said...

A question would be; can plugin developer subclass the from_serialized classmethod, to do their own thing, or is meant to remain fixed?

yes, if their constructor does not accept all the fields defined by the model, and they cannot or don't want to change the constructor signature, they can simply override the from_model method on their data plugin class. The requirement of fields mapping on constructor argument is just such that the from_model method defined on the Entity base class can work. This is just a convenience and they can simply override this.

Still, I think that in most cases, the fields of the model not lining up with the constructor of the class is indicative of a design mistake. There are valid exceptions, for example, where certain fields are derived from other files. Let's say field a is required for the constructor, but b is not, because b is derived from a and set automatically in the constructor. The implementation allows for this though, and field b can simply set exclude_to_orm=True.

I think you are missing a step here, in interchanging binary with strings without any explicit conversion?
For example, what does it look like to serialize ArrayData to legitimate JSON that can be used with REST

Not quite sure what you mean with this. Do you mean that it assumes the encoding. But you are right, the way I do it now is incorrect. It works for the examples I added which add simple text files, but for binary data it is probably not correct. pydantic supports bytes fields and will dump it to bytes quite happily, but then it is not JSON serializable. I think the typical approach is to base64 encode the content to send it over the wire. It seems this approach should be perfectly acceptable for our case, shouldn't it?

I changed the Node implementation to the following:

class Node(Entity['BackendNode', NodeCollection], metaclass=AbstractNodeMeta):

    class Model(Entity.Model):
        repository_content: Optional[dict[str, bytes]] = MetadataField(
            None,
            description='Dictionary of file repository content. Keys are relative filepaths and values are binary file '
                'contents encoded as base64.',
            orm_to_model=lambda node: {key: base64.encodebytes(content) for key, content in node.base.repository.serialize_content().items()},
        )

    @classmethod
    def from_model(cls, model: Model) -> 'Node':
        """Return an entity instance from an instance of its model."""
        fields = cls.model_to_orm_field_values(model)

        repository_content = fields.pop('repository_content', {})
        node = cls(**fields)

        for filepath, encoded in repository_content.items():
            node.base.repository.put_object_from_bytes(base64.decodebytes(encoded), filepath)

        return node

So the byte content of the repo is base64 encoded when going from ORM to the pydantic model, which will allow it to be sent over the wite. The from_model then first base64 decodes the payload, before writing the decoded bytes to the repo. This seems to work just fine. When passing file content to the endpoint, the caller will just have to make sure to base64 encode the bytes:

$ echo 'test' | base64
dGVzdAo=
$ curl -X POST http: //127.0.0.1:8000/data/ -H 'Content-Type: application/json' -d '{"user": 1, "repository_content": {"file.txt": "dGVzdAo="}}'
{
    "pk": 157208,
    "repository_metadata": {
        "o": {
            "file.txt": {
                "k": "f2ca1bb6c7e907d06dafe4687e579fce76b37e4e93b7605022da52e6ccc26fd2"
            }
        }
    },
    "repository_content": {
        "file.txt": "dGVzdAo=\n"
    },
}
$ verdi node repo cat 157208 file.txt
test

Something similar can be implemented for the ArrayData class:

class ArrayData(Data):

    class Model(Data.Model):
        model_config = ConfigDict(arbitrary_types_allowed=True)
        arrays: Optional[dict[str, bytes]] = MetadataField(
            None,
            description='The dictionary of numpy arrays.',
            orm_to_model=lambda node: ArrayData.save_arrays(node.arrays),
            model_to_orm=lambda value: ArrayData.load_arrays(value)
        )

    @staticmethod
    def save_arrays(arrays: dict[str, np.ndarray]) -> dict[str, bytes]:
        results = {}

        for key, array in arrays.items():
            stream = io.BytesIO()
            np.save(stream, array)
            stream.seek(0)
            results[key] = base64.encodebytes(stream.read())

        return results

    @staticmethod
    def load_arrays(arrays: dict[str, bytes]) -> dict[str, np.ndarray]:
        results = {}

        for key, encoded in arrays.items():
            stream = io.BytesIO(base64.decodebytes(encoded))
            stream.seek(0)
            results[key] = np.load(stream)

        return results

used as follows:

In [1]: import numpy as np
        from aiida.orm import ArrayData

In [2]: a = ArrayData(np.array([[1,2], [3,4]]))

In [3]: a.serialize()
Out[3]: 
{'pk': None,
 'uuid': '820ed501-55b7-4440-9eea-f22681d1805e',
 'node_type': 'data.core.array.ArrayData.',
 'process_type': None,
 'repository_metadata': {},
 'ctime': datetime.datetime(2024, 1, 19, 20, 28, 36, 772718, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600), 'CET')),
 'mtime': None,
 'label': '',
 'description': '',
 'extras': {},
 'computer': None,
 'user': 1,
 'repository_content': None,
 'source': None,
 'arrays': {'default': b'k05VTVBZAQB2AHsnZGVzY3InOiAnPGk4JywgJ2ZvcnRyYW5fb3JkZXInOiBGYWxzZSwgJ3NoYXBl\nJzogKDIsIDIpLCB9ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg\nICAgICAgICAgICAgIAoBAAAAAAAAAAIAAAAAAAAAAwAAAAAAAAAEAAAAAAAAAA==\n'}}

In [4]: r = ArrayData.from_serialized(**a.serialize())

In [5]: r.get_array()
Out[5]: 
array([[1, 2],
       [3, 4]])

Storing anything in attributes (or in the object repository) is more of an implementation detail of the Dataclass, and should not need to be accessed by users. This is one of the purposes of #5088

Fair enough. I also kept that as is and didn't add the attributes field on the Node.Model. Do you agree though that the repo is different? If for just the FolderData and nothing else.

I think ideally repository_metadata should not be directly exposed either; this will be very fragile if you start trying to change repository objects, which then invalidates their hash keys

I agree that it should not be possible to set this directly, because that would lead to inconsistencies. But I think it is very useful to be able to retrieve from an existing node. But in the current implementation, its field defines exclude_to_orm=True so it isn't possible to manually define it. So I think we are ok here.

We already do: https://github.com/aiidateam/aiida-restapi/blob/3f14ea2ac88e06ce9ce8d4cfb02ec16fea0463f7/aiida_restapi/routers/nodes.py#L68

I guess the multi-part file upload is indeed the typical approach there. But here it is just implemented for an explicit endpoint for SinglefileData with just a single file. Not sure if this could easily be generalized to an arbitrary number of files, for example if a FolderData needs to be created.

I would just suggest being careful to not introduce anything that makes it even more difficult to implement a "full" solution at a later date.

Do you have an idea what such a full solution might look like? If there is a better solution, I would also prefer that.

AiiDA's Python API provides an object relational mapper (ORM) that
abstracts the various entities that can be stored inside the provenance
graph and the relationships between them. In most use cases, users use
this ORM directly in Python to construct new instances of entities and
retrieve existing ones, in order to get access to their data and
manipulate it.

A current shortcoming of the ORM is that it is not possible to
programmatically introspect the schema of each entity: that is to say,
what data each entity stores. This makes it difficult for external
applications to provide interfaces to create and or retrieve entity
instances. It also makes it difficult to take the data outside of the
Python environment since the data would have to be serialized. However,
without a well-defined schema, doing this without an ad-hoc solution is
practically impossible.

Clear data schemas for all ORM entities would enable the creation of
external applications to work with the data stored in AiiDA's provenance
graph. A typical example use case would be a web API whose interface, to
create and return ORM entities from an AiiDA profile, would be
dynamically generated by programmatically introspecting the schema of
all ORM entities stored within it. Currently, the interface has to be
manually generated for each ORM entity, and data (de)serialization has
to be implemented.
@giovannipizzi
Copy link
Member

Hi, quick comment, and maybe there are multiple usecases and not all can be covered with the same technical solution.
Do we really need to serialise all inside a single JSON? As mentioned above, this will have the big limitation that all has to fit in memory (but this clearly can be not true e.g. with multi-GB files stored on the repo, e.g. the output of a MD simulation or similar.

For the API point of view, I would suggest not to have the files base64-encoded in the responses, but just links to other links to download them. This would allow direct streaming of files over HTTP that can be downloaded in chunks etc with standard clients. This could either be obtained by providing links instead of content for every file, or (if a standard URL is decided) one could just return the list of folders/files in the repo, and the client would use the info to get the file (e.g. something like <BASE_RESTAPI_URL>/nodes/<NODE_IDENTIFIER>/files/<FILE_PATH_AND_NAME_IN_REPO>).
Optionally one can also provide a 'bulk download' option that e.g. returns a zip or a tar of all files in the repo of a given node.

If we need to serialise to file, instead, (i.e. not for the REST API), also in this case I would think to an alternative format than JSON. E.g. a .zip wherein a JSON file is still used for the attributes, but files are stored as files, similar to the .aiida files.

Maybe there is however another usecase I'm missing?

@sphuber
Copy link
Contributor Author

sphuber commented Apr 3, 2024

Fair point about memory usage, for large files it certainly won't be a viable option and a solution is needed that allows for streaming the file when downloading. For retrieving node files this won't be that much of a problem. We can always add additional endpoints for this purpose.

What is more tricky I think is the node creation end point. There the files have to be sent with the rest of the data because the call needs to create the node in one go. It cannot first create the node in one call and then add files, because it will already have been stored and be immutable. Here, having the option of passing the file content as serialized strings works quite nicely as long as they are not too big. Not sure yet how to approach the case where files are too big for memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants