Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ NEW: Add orm.Entity.fields interface for QueryBuilder #5088

Closed
wants to merge 1 commit into from

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Aug 19, 2021

The QueryBuilder currently has a conceptual problem, in that it essentially exposes the backend database to the user. A classic example of this is that, on the front end, we use Node.pk, but when using the QueryBuilder, one has to use the "backend" name id (always confusing for new users) (#2577, #2196).
Additionally, there is currently no "programmatic" way to infer what can be retrieved from the database for each ORM entity, which is something very desirable for the REST API (aiidateam/aiida-restapi#1).

In this PR, I propose a new API (mainly in aiida/orm/fields.py), for:

  1. "decorating" ORM properties as database fields, Adding a __qb_fields__ class attribute containing QbField instances (and the QbAttrField subclass)
  2. "gathering" these instances into a new Entity.fields class attribute, then
  3. allowing these fields to be used when constructing QueryBuilder instances.

The API is backward compatible and looks like this (also see tests/orm/test_fields.py):

On any ORM entity, it will inherit parent fields, then if you want to add extra fields, you can add a __fields__ class attribute or directly decorate methods:

from typing import Optional
from aiida.orm import Data, QbAttrField

class NewData(Data):
   
     __fields__ = (
          QbAttrField("new_key", dtype=str, doc="A new key"),
     )

This class will now have a fields class attribute, which gives you the mapping of ORM properties to database fields:

In [1]: NewData.fields
Out[1]: 
{'ctime': 'QbField(ctime) -> datetime',
 'description': 'QbField(description) -> str',
 'extras': 'QbField(extras.*) -> Dict[str, Any]',
 'label': 'QbField(label) -> str',
 'mtime': 'QbField(mtime) -> datetime',
 'new_key': 'QbAttrField(attributes.new_key) -> str',
 'node_type': 'QbField(node_type) -> str',
 'pk': 'QbField(id) -> int',
 'repository_metadata': 'QbField(repository_metadata) -> Dict[str, Any]',
 'source': 'QbAttrField(attributes.source.*) -> Union[dict, NoneType]',
 'user_pk': 'QbField(user_id) -> int',
 'uuid': 'QbField(uuid) -> str'}
In [2]: NewData.fields.new_key
Out[2]: QbAttrField('new_key', dtype=str)

(you can see the fields for all aiida nodes in tests/orm/test_fields)

Within the QueryBuilder you can then directly use these fields for the project, filters and order_by values e.g.

In [4]: qb = QueryBuilder().append(Node, project=[Node.fields.pk, Node.fields.mtime], filters={Node.fields.pk: 1})
In [5]: qb.order_by({'node_1': Node.fields.mtime})

Out[5]: QueryBuilder(path=[{'entity_type': '', 'orm_base': 'node', 'tag': 'node_1', 'joining_keyword': None, 'joining_value': None, 'edge_tag': None, 'outerjoin': False}], filters={'node_1': {'node_type': {'like': '%'}, 'id': 1}}, project={'node_1': [{'id': {}}, {'mtime': {}}]}, project_map={'node_1': {'id': 'pk'}}, order_by=[{'node_1': [{'mtime': {'order': 'asc'}}]}], limit=None, offset=None, distinct=False)

# to get all fields
In [6]: QueryBuilder().append(Node, project=Node.fields)
Out[6]: QueryBuilder(path=[{'entity_type': '', 'orm_base': 'node', 'tag': 'node_1', 'joining_keyword': None, 'joining_value': None, 'edge_tag': None, 'outerjoin': False}], filters={'node_1': {'node_type': {'like': '%'}}}, project={'node_1': [{'ctime': {}, 'description': {}, 'extras': {}, 'label': {}, 'mtime': {}, 'node_type': {}, 'id': {}, 'repository_metadata': {}, 'user_id': {}, 'uuid': {}}]}, project_map={'node_1': {'id': 'pk', 'user_id': 'user_pk'}}, order_by=[], limit=None, offset=None, distinct=False)

# you can also use field comparitors to construct filters
In [7]: QueryBuilder().append(Node, project=Node.fields.pk, filters=(Node.fields.pk <= 2) & (Node.fields.pk.in_([2, 3])))
Out[7]: QueryBuilder(path=[{'entity_type': '', 'orm_base': 'node', 'tag': 'node_1', 'joining_keyword': None, 'joining_value': None, 'edge_tag': None, 'outerjoin': False}], filters={'node_1': {'node_type': {'like': '%'}, 'id': {'and': [{'<=': 2}, {'in': {2, 3}}]}}}, project={'node_1': [{'id': {}}]}, project_map={'node_1': {'id': 'pk'}}, order_by=[], limit=None, offset=None, distinct=False)

# fields can be designated as subscribtable to index into them
In [8]: QueryBuilder().append(Dict, project=Dict.fields.dict["subkey"])
Out[8]: QueryBuilder(path=[{'entity_type': 'data.core.dict.Dict.', 'orm_base': 'node', 'tag': 'Dict_1', 'joining_keyword': None, 'joining_value': None, 'edge_tag': None, 'outerjoin': False}], filters={'Dict_1': {'node_type': {'like': 'data.core.dict.%'}}}, project={'Dict_1': [{'attributes.subkey': {}}]}, project_map={'Dict_1': {'attributes.subkey': 'dict.subkey'}}, order_by=[], limit=None, offset=None, distinct=False)

then run the query as normal.

If you run qb.dict()/qb.iterdict(), it will correctly give you the front end projection keys, e.g.

In [1]: QueryBuilder().append(Dict, project=[Dict.fields.pk, Dict.fields.dict["a"]]).dict()
Out[1]: 
[{'Dict_1': {'pk': 15, 'dict.a': 1}},
 {'Dict_1': {'pk': 7, 'dict.a': 1}},
 {'Dict_1': {'pk': 3, 'dict.a': None}}]

Probably the main limitation at present, is that the output by qb.dict()/qb.iterdict() will still return the backend keys, e.g. id not pk (a4ee061)

Some other things that come to mind:

  • For e.g. Dict.fields.dict we would ideally like to be able to do e.g. Dict.fields.dict.key1 (use subscriptable=True when defining field)
  • Some fields are missing for joined ids e.g. Node.user_id (added foreign_key field kwarg)
  • Composite fields, e.g. in StructureData you have pbc -> [attributes.pbc1, attributes.pbc2, attributes.pbc3] in the database (can now use __qb_fields__)
  • Field post-conversions (after having been returned from the database ORM), e.g. list to numpy array (possible, but not adding to this PR)
  • You have to use some tyep: ignore[misc] on the decorators currently, due to Decorated property not supported python/mypy#1362
  • Obviously, also need to add tests (added tests/orm/test_fields.py)

Definitely love to have some feedback on this cc @sphuber, @giovannipizzi, @ltalirz, @mbercx, @ramirezfranciscof

@chrisjsewell chrisjsewell marked this pull request as ready for review August 19, 2021 05:00
@chrisjsewell chrisjsewell marked this pull request as draft August 19, 2021 05:01
@codecov
Copy link

codecov bot commented Aug 19, 2021

Codecov Report

Merging #5088 (c830af3) into develop (ea0f447) will decrease coverage by 0.64%.
The diff coverage is 87.65%.

❗ Current head c830af3 differs from pull request most recent head 9631a91. Consider uploading reports for the commit 9631a91 to get more accurate results

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #5088      +/-   ##
===========================================
- Coverage    82.12%   81.48%   -0.63%     
===========================================
  Files          533      531       -2     
  Lines        38510    37348    -1162     
===========================================
- Hits         31624    30431    -1193     
- Misses        6886     6917      +31     
Flag Coverage Δ
django 76.98% <87.65%> (-0.22%) ⬇️
sqlalchemy 75.98% <87.65%> (-0.53%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/orm/utils/node.py 91.18% <60.00%> (ø)
aiida/orm/querybuilder.py 84.63% <80.00%> (-0.37%) ⬇️
aiida/orm/fields.py 83.22% <83.22%> (ø)
aiida/orm/__init__.py 100.00% <100.00%> (ø)
aiida/orm/authinfos.py 83.83% <100.00%> (+0.50%) ⬆️
aiida/orm/comments.py 91.38% <100.00%> (+0.31%) ⬆️
aiida/orm/computers.py 81.57% <100.00%> (-0.60%) ⬇️
aiida/orm/entities.py 96.72% <100.00%> (+0.98%) ⬆️
aiida/orm/groups.py 94.17% <100.00%> (+0.05%) ⬆️
aiida/orm/implementation/querybuilder.py 94.24% <100.00%> (+4.04%) ⬆️
... and 247 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da5f9b0...9631a91. Read the comment docs.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Aug 19, 2021

another thought would be in some way to tie the fields to the backend, e.g. Node.fields.name would return a dynamic field name to use for querying based on the backend. This is certainly possible, and could be added at a later date without changing the user API

@ltalirz
Copy link
Member

ltalirz commented Aug 19, 2021

thanks @chrisjsewell!
I definitely agree that a route for accessing database properties of any given node programmatically through the frontend API is badly needed.

One question that comes to mind is how much tangible benefit the separation between frontend and backend will really still bring once we've dropped one of the two backends. But I guess that's still a bit further down the line and more controversial, so by adding this mapping explicitly and making it available for programmatic use I think we are making a step in the right direction.

Thinking about how to reuse this in e.g. the REST API, and your open question:

  • For e.g. Dict.fields.dict we would ideally like to be able to do e.g. Dict.fields.dict.key1

If I see correctly, you are using python's type hints to write the "schema" here, which I guess might be limiting when you want to represent more elaborate constraints than data types, such as specific field names in a dictionary that are optional; allowed value ranges (which could also be very interesting to reuse in the verdi CLI) as well as documentation of the fields.

In the aiida-restapi package we use pydantic for this - do you think something like this would make sense here as well?
Or maybe type hints can also do this - you're the expert - in that case: even better.

@chrisjsewell
Copy link
Member Author

Another, another thought would be, perhaps you could also extend this to specifying repository attributes, e.g.

@has_fields
class NewData(orm.Data):
   @property
   @field_method("path/to/file", in_repo=True)
   def y(self) -> str:
      return self.get_object_content("path/to/file")

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Aug 19, 2021

thanks @ltalirz

you are using python's type hints to write the "schema" here, which I guess might be limiting when you want to represent more elaborate constraints than data types, such as ... allowed value ranges

Well I guess here you could just add additional key-words to field_method, e.g.

   @field_method("attributes.x", validate=dict(min=0, max=100))
   def x(self) -> Optional[int]:
      return self.get_attribute("x", None)

In the aiida-restapi package we use pydantic for this - do you think something like this would make sense here as well?

I feel the drawback of pydantic is that maybe it does not work well with sub-classing; if we do not use it directly (e.g. for EntityFields) we can certainly take inspiration from it
Note my API takes some inspiration from https://github.com/python-attrs/attrs and https://docs.python.org/3/library/dataclasses.html

@ltalirz
Copy link
Member

ltalirz commented Aug 19, 2021

thanks, I also recently came across attrs in a different context - looks pretty cool.

I guess what I'm saying is that when making this change it would be great if you could include with it one small code example of how e.g. the REST API (or the verdi cli) could take advantage of this interface, removing the duplicated schema information that currently resides there (explicitly in the case of the new aiida-restapi; implicitly in old REST API & the verdi cli).

Just to make sure that we're achieving this goal - or if we are not, to be aware of what's still missing and how we can get there.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Aug 19, 2021

I also recently came across attrs in a different context - looks pretty cool.

Yeh I use it all over the place, its great, e.g. https://github.com/executablebooks/MyST-Parser/blob/master/myst_parser/main.py#L31

Just to make sure that we're achieving this goal - or if we are not, to be aware of what's still missing and how we can get there.

Yep fair. I don't really want to start rewriting the old REST API though lol, do you have any CLI commands in mind where this could be useful?

@chrisjsewell
Copy link
Member Author

(in 74f39f2 I have removed the need for using has_fields, and introduced __fields__, will update initial comment)

@@ -134,6 +144,23 @@ def process_class(self) -> Type['Process']:

return process_class

@property # type: ignore
@field_method()
def process_type(self) -> Optional[str]:
Copy link
Member Author

@chrisjsewell chrisjsewell Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think process_type should be removed from the base Node class, since obviously it is just a process specific field and an implementation detail that it is also contained in the database table for all nodes.

(I added it here so that it only shows up in fields of ProcessNode subclasses, not all nodes)

Copy link
Member Author

@chrisjsewell chrisjsewell Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same for computer, that is only relevant to code and process nodes)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, computer should stay for all nodes I think, some other nodes use it (specifically: RemoteData). Default = None.

Copy link
Contributor

@sphuber sphuber Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computer should definitely stay on the Node level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted this change as no longer necessary

@chrisjsewell chrisjsewell marked this pull request as ready for review August 20, 2021 12:20
@chrisjsewell
Copy link
Member Author

chrisjsewell commented Aug 20, 2021

Ok added just about all the fields for all the node types.
It does make it so much easier, e.g. with verdi shell tab-completion, to find what fields you can use in queries:

image

image

image

@giovannipizzi
Copy link
Member

Thanks @chrisjsewell, in general I think this was a very important missing feature.

Just a few quick questions:

  • to double check: the QueryBuilder interface is backward compatible, meaning you can still use strings to project etc., or now one has to use the new syntax only?
  • Once this is merged, would the plan to deprecate the old syntax at some point, or not? Or at least to prioritise one over the other in 1) the documentation, and 2) the usage in the code itself (e.g. all queries in the cmdline)
  • Is the limitation you mention ("qb.dict()/qb.iterdict() will still return the backend keys, e.g. id not pk") hard to overcome? I imagine that, in order to go to a fully machine-usable querybuilder, knowing in a reliable way which keys will be returned when one uses .dict() is important?

@chrisjsewell
Copy link
Member Author

to double check: the QueryBuilder interface is backward compatible, meaning you can still use strings to project etc., or now one has to use the new syntax only?

Yes, fully backward compatible

nce this is merged, would the plan to deprecate the old syntax at some point, or not? Or at least to prioritise one over the other in 1) the documentation, and 2) the usage in the code itself (e.g. all queries in the cmdline)

certainly to prioritise this syntax, then, once people have had a good chance to play around with it, I would indeed consider deprecating

Is the limitation you mention ("qb.dict()/qb.iterdict() will still return the backend keys, e.g. id not pk") hard to overcome? I imagine that, in order to go to a fully machine-usable querybuilder, knowing in a reliable way which keys will be returned when one uses .dict() is important?

I think with #5093 there is definitely a way forward with this

@chrisjsewell chrisjsewell force-pushed the entity-fields branch 2 times, most recently from b326479 to e20114e Compare August 25, 2021 10:20
@chrisjsewell
Copy link
Member Author

chrisjsewell commented Aug 25, 2021

Probably the main limitation at present, is that the output by qb.dict()/qb.iterdict() will still return the backend keys, e.g. id not pk

Alright now, after #5093, I know what the heck is going on with the QueryBuilder lol, it was pretty easy to add this (a4ee061).

So now you can do e.g.:

In [1]: QueryBuilder().append(Dict, project=[Dict.fields.pk, Dict.fields.dict["a"]]).dict()
Out[1]: 
[{'Dict_1': {'pk': 15, 'dict.a': 1}},
 {'Dict_1': {'pk': 7, 'dict.a': 1}},
 {'Dict_1': {'pk': 5, 'dict.a': 1}},
 {'Dict_1': {'pk': 4, 'dict.a': None}},
 {'Dict_1': {'pk': 3, 'dict.a': None}}]

🎉

(again, this is fully back compatible, i.e. if you projected just the "id" string, that is what you will get back)

@chrisjsewell
Copy link
Member Author

ok, presuming the tests pass, this is now good to go IMO

@chrisjsewell
Copy link
Member Author

Obviously it should also be documented but, so as not to make this PR any more complex, I will do that in a separate PR once this API is agreed on and merged

@giovannipizzi
Copy link
Member

Thanks @chrisjsewell !
One question, what is Dict.fields.dict? Is this just an alias for Dict.fields.attributes? Do other data types have similar aliases?

While I'm OK to keep the API flexible, I'm wondering if it's more confusing or more clear to make this mapping automatically - at some point, users need to know that properties are stored either as files or as attributes (or extras), and use those (I don't see this as a schema detail, that indeed should be visible, but as an "ontological" aspect of how we represent node content in AiiDA). And in practice only attributes (and extras) are efficiently queryable, not content in the files.
Or maybe you just meant Dict.fields.attributes?

Also (this is for my clarity), Dict_1 is the automatic tag since none was specified, and if you specify a tag explicitly, this is what will appear in the .dict() output, right?

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Aug 26, 2021

Also (this is for my clarity), Dict_1 is the automatic tag since none was specified, and if you specify a tag explicitly, this is what will appear in the .dict() output, right?

yes

One question, what is Dict.fields.dict? Is this just an alias for Dict.fields.attributes? Do other data types have similar aliases?

yes; it is an alias and also set to subscriptable=True, i.e. it allows Dict.fields.dict["key"].
Note, in the string reporesentation, the .* indicates it is subscriptable:

In [1]: str(Dict.fields.dict)
Out[1]: 'OrmField(attributes.*) -> dict'
In [2]: str(Dict.fields.extras)
Out[2]: 'OrmField(extras.*) -> dict'

While I'm OK to keep the API flexible, I'm wondering if it's more confusing or more clear to make this mapping automatically - at some point, users need to know that properties are stored either as files or as attributes (or extras), and use those (I don't see this as a schema detail, that indeed should be visible, but as an "ontological" aspect of how we represent node content in AiiDA). And in practice only attributes (and extras) are efficiently queryable, not content in the files.
Or maybe you just meant Dict.fields.attributes?

So this is actually a broader point that I was planning to make in response to #4976 (and I plant to open a PR for programmatically/dynamically setting __dir__ on Node subclasses):

Firstly, I would say there are broadly three kinds of users of the code (that can overlap):

  1. a developer of aiida-core
  2. a developer of a plugin
  3. a user of a plugin

Here I am talking about (3).

IMO they should never need to know about the concept of attributes on a Node subclass; ontologically, they should know about three things:

  • keys that relate to "queryable" fields, that are read-only once the node is stored
  • keys that relate to "queryable" fields, that are read/write always (i.e. extras)
  • keys that relate to "non-queryable" fields

Take Int for example:

When you tab complete on it, you get 92 options!

In [1]: print(sorted([a for a in Int().__dir__() if not a.startswith('_')]))
['Collection', 'add_comment', 'add_incoming', 'attributes', 'attributes_items', 'attributes_keys', 'backend', 'backend_entity', 'check_mutability', 'class_node_type', 'clear_attributes', 'clear_extras', 'clear_hash', 'clone', 'computer', 'convert', 'creator', 'ctime', 'delete_attribute', 'delete_attribute_many', 'delete_extra', 'delete_extra_many', 'delete_object', 'description', 'erase', 'export', 'extras', 'extras_items', 'extras_keys', 'from_backend_entity', 'get', 'get_all_same_nodes', 'get_attribute', 'get_attribute_many', 'get_cache_source', 'get_comment', 'get_comments', 'get_description', 'get_export_formats', 'get_extra', 'get_extra_many', 'get_hash', 'get_incoming', 'get_object_content', 'get_outgoing', 'get_stored_link_triples', 'has_cached_links', 'id', 'importfile', 'importstring', 'init_from_backend', 'initialize', 'is_created_from_cache', 'is_stored', 'is_valid_cache', 'label', 'list_object_names', 'list_objects', 'logger', 'mtime', 'new', 'node_type', 'objects', 'open', 'pk', 'process_type', 'put_object_from_file', 'put_object_from_filelike', 'put_object_from_tree', 'rehash', 'remove_comment', 'repository_metadata', 'repository_serialize', 'reset_attributes', 'reset_extras', 'set_attribute', 'set_attribute_many', 'set_extra', 'set_extra_many', 'set_source', 'source', 'store', 'store_all', 'update_comment', 'user', 'uuid', 'validate_incoming', 'validate_outgoing', 'validate_storability', 'value', 'verify_are_parents_stored', 'walk']

Firstly, this is not user-friendly but, more conceptually, a user should never be directly setting/getting attributes or objects (i.e. files) on an Int; these are implementation details

(w.r.t to tab completion, the naming of methods is also not ideal, e.g. if I want to deal with extras, then really I just want to do Int().extra<TAB> to see all the available methods)

You could roughly reduce this to at least 53:

In [1]: print(sorted([a for a in Int().__dir__() if not a.startswith('_')]))
['add_comment', 'add_incoming', 'class_node_type', 'clear_extras', 'clear_hash', 'clone', 'convert', 'ctime', 'delete_extra', 'delete_extra_many', 'description', 'erase', 'export', 'extras', 'extras_items', 'extras_keys', 'get_all_same_nodes', 'get_cache_source', 'get_comment', 'get_comments', 'get_description', 'get_export_formats', 'get_extra', 'get_extra_many', 'get_hash', 'get_incoming', 'get_outgoing', 'get_stored_link_triples', 'has_cached_links', 'is_created_from_cache', 'is_stored', 'label', 'mtime', 'new', 'node_type', 'pk', 'rehash', 'remove_comment', 'reset_extras', 'set_extra', 'set_extra_many', 'set_source', 'source', 'store', 'store_all', 'update_comment', 'user', 'uuid', 'validate_incoming', 'validate_outgoing', 'validate_storability', 'value', 'verify_are_parents_stored']

(and this could be dynamically reduced even more when the node is stored)

Analogously, for querying, a user shouldn't need to "know"/care the value is stored as an attribute, they just want to get the value, i.e. they should not be exposed to:

In[1]: QueryBuilder().append(Int, project="attribute.value").dict()
Out[1]:
[{'Int_1': {'attribute.value': 1}}]

they just want to do:

In[1]: QueryBuilder().append(Int, project=Int.fields.value).dict()
Out[1]:
[{'Int_1': {'value': 1}}]

and then this is "magnified" on other types, that have a lot more thing stored as keys of attributes

In [1]: Code.fields
Out[1]: 
{'append_text': 'OrmField(attributes.append_text) -> Union[str, NoneType]',
 'ctime': 'OrmField(ctime) -> datetime',
 'description': 'OrmField(description) -> str',
 'extras': 'OrmField(extras.*) -> dict',
 'input_plugin': 'OrmField(attributes.input_plugin) -> Union[str, NoneType]',
 'is_local': 'OrmField(attributes.is_local) -> Union[bool, NoneType]',
 'label': 'OrmField(label) -> str',
 'local_executable': 'OrmField(attributes.local_executable) -> Union[str, '
                     'NoneType]',
 'mtime': 'OrmField(mtime) -> datetime',
 'node_type': 'OrmField(node_type) -> str',
 'pk': 'OrmField(id) -> int',
 'prepend_text': 'OrmField(attributes.prepend_text) -> Union[str, NoneType]',
 'remote_exec_path': 'OrmField(attributes.remote_exec_path) -> Union[str, '
                     'NoneType]',
 'repository_metadata': 'OrmField(repository_metadata) -> Dict',
 'source': 'OrmField(attributes.source) -> Union[dict, NoneType]',
 'user_id': 'OrmField(user_id) -> int -> user',
 'uuid': 'OrmField(uuid) -> str'}

In [2]: CalcJobNode.fields
Out[2]: 
{'ctime': 'OrmField(ctime) -> datetime',
 'description': 'OrmField(description) -> str',
 'exception': 'OrmField(attributes.exception) -> Union[str, NoneType]',
 'exit_message': 'OrmField(attributes.exit_message) -> Union[str, NoneType]',
 'exit_status': 'OrmField(attributes.exit_status) -> Union[int, NoneType]',
 'extras': 'OrmField(extras.*) -> dict',
 'job_state': 'OrmField(attributes.state) -> Union[str, NoneType]',
 'label': 'OrmField(label) -> str',
 'mtime': 'OrmField(mtime) -> datetime',
 'node_type': 'OrmField(node_type) -> str',
 'paused': 'OrmField(attributes.paused) -> bool',
 'pk': 'OrmField(id) -> int',
 'process_label': 'OrmField(attributes.process_label) -> Union[str, NoneType]',
 'process_state': 'OrmField(attributes.process_state) -> Union[str, NoneType]',
 'process_status': 'OrmField(attributes.process_status) -> Union[str, '
                   'NoneType]',
 'process_type': 'OrmField(process_type) -> Union[str, NoneType]',
 'repository_metadata': 'OrmField(repository_metadata) -> Dict',
 'scheduler_state': 'OrmField(attributes.scheduler_state) -> str',
 'sealed': 'OrmField(attributes.sealed) -> bool',
 'user_id': 'OrmField(user_id) -> int -> user',
 'uuid': 'OrmField(uuid) -> str'}

A final note here; the fact that we use an attributes JSONB field to store most of the data for Node subclasses is a compromise we have made, to allow re-use of the same DB table for all Node subclasses. A likely unavoidable compromise, but a compromise nonetheless.
Obviously, if you wanted to make the most efficient implementation for storing e.g. a ProcessNode and its subclasses, you would make most of these attribute fields into actual standalone fields (and set indexes on some).

Or maybe you just meant Dict.fields.attributes?

So coming back to your original question: no I don't meant attributes; that is an implementation detail, I mean dict. That being said dict is a bit of a weird one, because for all other "base types" (Int, Float, List, etc) we use value to represent the field of the value.
I could change this field to value, but then there would be a disconnect between the class attribute (Data().dict) and the query field (Data.fields.value)

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chrisjsewell . I think the concept proposed here is very useful and badly needed indeed. I went through the changes and most of it looks fine, with some minor comments here and there, but I also still have some more generic questions about the design and the implications.

With the current design, we are effectively coupling the front-end directly to implementation details in the backend. By having to specify the database column in the front-end, the separation that we had will be gone. I am not saying that this would necessarily be a deal-breaker, since anyway we might be getting rid of the need of a separation if we're getting rid of Django, but I wonder if it is good and necessary in principle. What we really need in the front-end for the user is what you summarized very succinctly elsewhere:

keys that relate to "queryable" fields, that are read-only once the node is stored
keys that relate to "queryable" fields, that are read/write always (i.e. extras)
keys that relate to "non-queryable" fields

I would maybe add "queryable" but "private" fields that AiiDA stores in the database (hash etc.). This is an effective implementation-free description of what is stored in the repository and what in the database. Could we not move the concept of OrmFields to the backend layer and have the front-end orm.Entity automatically detect the defined fields on the BackendEntity that it contains. This would allow us to later extend the concept of fields in the front-end Entity with non-queryable fields, which could be data stored in the repository.

A second question is about consistency. There are two methods of defining a field to an ORM class: by explicitly declaring it in __fields__ or by decorating a class method or property. This is mostly a problem of historicity and having to keep backwards compatibility, but if we were to design the ORM from scratch, wouldn't we ideally just define the fields explicitly through __fields__ and have those generate properties dynamically. This would save a lot of code typing and would guarantee a uniform interface. My question is now that if you agree with this, and if so, if we should maybe already try to take this approach and drop the decorator. We can then also deprecate the methods that happen to not follow the naming convention of the automatically and dynamically named getters. There can of course still be custom methods if they need to perform additional logic to just retrieving a database value and returning it.

I think this approach would also prevent inconsistencies in to what fields are marked as fields, and which aren't. Reading through the code, I think there are quite a few entities that do not have certain fields defined, even though they probably should. Computer.user is one of them for example. Of course we could always thoroughly go through the code before merging to make sure everything is there, but I was wondering if the other approach may prevent the potential for missing fields.

@@ -77,6 +77,8 @@ class QueryDictType(TypedDict):
# mapping: tag -> [] -> field -> 'func' -> 'max' | 'min' | 'count'
# 'cast' -> 'b' | 'd' | 'f' | 'i' | 'j' | 't'
project: Dict[str, List[Dict[str, Dict[str, Any]]]]
# mapping: tag -> field -> return key for iterdict method
project_map: Dict[str, Dict[str, str]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You define this here but don't actually use it I believe. Yet in other files (e.g., aiida.orm.querybuilder), you could use it but then just literally write Dict[str, Dict[str, str]] again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisjsewell what's the plan here? I see a thumbs up but not an implementation.

aiida/orm/nodes/data/structure.py Outdated Show resolved Hide resolved
Comment on lines 355 to 379
@field_method(f'attributes.{SCHEDULER_STATE_KEY}', key='scheduler_state', dtype=str)
def get_scheduler_state(self) -> Optional['JobState']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice here that the dtype does not necessarily have to correspond to the return type of the method that it decorates. Makes sense since the method can operate on the raw database value before returning it as is the case here. I was just wondering if there could be a potential problem here. I guess in the end it boils down to the choice of defining an entities fields through its methods. I guess this is mostly just a convenience thing, because you also have the possibility to define the OrmFields directly and manually in the fields attribute. I am wondering if there is a downside to having both, especially since the decoration of method can have this discrepancy in types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, I like the idea of having the field “close” to the method, as it feels like it will make things like refactoring easier, but indeed the special case discrepancies are not ideal. I guess you will still have this problem if, as you mention, we want to look to auto generate the methods from the fields

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 35 to 36
SCHEDULER_STATE_KEY = 'scheduler_state'
CALC_JOB_STATE_KEY = 'state'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make all class constants of CalcJobNode module constants, as you did for ProcessNode, just for consistency?

Copy link
Member Author

@chrisjsewell chrisjsewell Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I realised this change was not necessary, so have reverted it (for ProcessNode and CalcJobNode)

aiida/orm/fields.py Outdated Show resolved Hide resolved
aiida/orm/fields.py Outdated Show resolved Hide resolved
@chrisjsewell
Copy link
Member Author

Thanks for the feedback @sphuber
some quick mobile points:
(now in a McDonald’s waiting for a train from Zagreb to Munich lol)

With the current design, we are effectively coupling the front-end directly to implementation details in the backend.
Could we not move the concept of OrmFields to the backend layer and have the front-end orm.Entity

I certainly agree we want to push good abstractions 👍. Obviously the query builder was actually abstracted in the first place, in the sense you request/receive directly backend columns. I’ll have a think if there is a better abstraction, but a key thing here is that I would like this API to be available to be available to plugin developers, to define their own fields on added data nodes, which I think would not be possible if moved to the backend

but if we were to design the ORM from scratch, wouldn't we ideally just define the fields explicitly through fields

yeh interesting, possibly. With the current implementation, I was sort of thinking along the lines of how sqlalchemy works; having field attributes, which auto-generate the table attribute. What you suggest would be, I guess an inversion of this. The only issue I foresee is with static analysis and instance attribute auto completion, since then all these would be dynamically generated

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Dec 15, 2021

Ok, in be43aec I have removed the field_method decorator method, instead leaving only the single __qb_fields__ class attribute mechanism for specifying fields.
As @sphuber said, it's over-complex to have two mechanisms for specifying these fields.

I have also renamed __fields__ -> __qb_fields__ and OrmField -> QbField, to make it clear that these are fields that can be returned from the QueryBuilder.
Lastly, I have added the QbAttrField subclass, as a shorthand for specifying keys in the attributes field. This is what data subclasses use to denote data they store in the attributes.

@chrisjsewell chrisjsewell force-pushed the entity-fields branch 2 times, most recently from 48f157d to 9631a91 Compare February 17, 2022 21:12
@chrisjsewell
Copy link
Member Author

@sphuber this is all rebased and "simplified" (see #5088 (comment)) if you want to have time to have another look

Outstanding:

  • whether to uncomment "JSONB only" methods on QbField (or have some way to activate only relevant methods)
  • A sphinx directive for displaying all the available QueryBuilder fields, per node type, in the documentation

@edan-bainglass
Copy link
Member

edan-bainglass commented Dec 24, 2023

Reviewing this PR. Tests passing where it is currently. I'm now incrementally moving (rebasing) it towards v2.5.

I've made it as far as #6141. A few minor adjustments along the way. Will discuss these at a later time.

I'm now running into an issue with #6134. Specifically, the new test_iterall_persistence test in test_querybuilder.py passes but raises a logging error. See my comments at #6134.

@edan-bainglass
Copy link
Member

edan-bainglass commented Dec 30, 2023

Okay. Rebased on main (e1255ce3). All tests pass other than the #6134 issue that needs review. @sphuber @chrisjsewell ready for your review.

There has been a good deal of development merged since Chris has last given this attention (I presume). Though my rebasing covered some, I suspect I may have missed a thing or two (or more). One thing that comes to mind is if any core data types have been added that require __qb_fields__ attention.

A few other things that I don't quite understand:

  • __qb_fields__ in ProcessNode and CalcJobNode, but not WorkChainNode
  • QbField and QbAttrField, but not QbExtraField

@sphuber
Copy link
Contributor

sphuber commented Jan 3, 2024

Thanks @edan-bainglass . You are not yet up-to-date with main. Note that a few weeks ago (after the release) the aiida directory was moved to src/aiida. Maybe it is just that, that you haven't moved the changes in yet. If you need help fixing it, let me know

@edan-bainglass
Copy link
Member

edan-bainglass commented Jan 3, 2024

Huh 🤔 I was sure I included the move to src (between v2.5.0 and main). Maybe I didn't push? I'll check in a bit. Out replenishing stock 🛒

Back from 🛒

Oh right. I'm doing all of this on my own fork (of Chris's fork). There I am caught up to main. And there the issue occurs. However, per @sphuber (see comments on #6134), since the issue occurs during post-testing teardown, it can safely be ignored for now.

So, with that out of the way, what is the policy? Should I PR to Chris's fork, or just push my rebasing directly to his repo? Also, there are still comments on the PR that I am awaiting response. @chrisjsewell, if you can please address these, that'd be super 🙏

@sphuber
Copy link
Contributor

sphuber commented Jan 3, 2024

So, with that out of the way, what is the policy? Should I PR to Chris's fork, or just push my rebasing directly to his repo? Also, there are still comments on the PR that I am awaiting response. @chrisjsewell, if you can please address these, that'd be super 🙏

There is no real existing policy. Depends on what @chrisjsewell prefers. But if he is no longer interested or doesn't have the time, which I can understand, maybe we can continue with your fork. We close this PR and you open a new one and continue there. That might be easiest

@edan-bainglass
Copy link
Member

Sure. If @chrisjsewell has no objections, we can proceed as you suggest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants