Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix threading issue when accessing user instance in _get_codes of ComputationalResourceWidgets #543

Merged
merged 4 commits into from
Jan 22, 2024

Conversation

unkcpz
Copy link
Member

@unkcpz unkcpz commented Dec 14, 2023

fixes aiidalab/aiidalab-qe#582

The _get_codes method is used in the multi-threading case by QEapp, and causes the issue. Since we assume one AiiDAlab app instance is bound with the default user, there is no need to restrict it again in the code queries.
This fix will not change any actual behavior.

Interestingly, the fix also reveals an incorrect test which should fail but somehow not.

@unkcpz unkcpz marked this pull request as draft December 14, 2023 11:01
Copy link

codecov bot commented Dec 14, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1acd3ec) 87.12% compared to head (e446b58) 87.12%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #543   +/-   ##
=======================================
  Coverage   87.12%   87.12%           
=======================================
  Files          27       27           
  Lines        4643     4646    +3     
=======================================
+ Hits         4045     4048    +3     
  Misses        598      598           
Flag Coverage Δ
python-3.10 87.12% <100.00%> (+<0.01%) ⬆️
python-3.8 87.16% <100.00%> (+<0.01%) ⬆️
python-3.9 87.16% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@unkcpz unkcpz marked this pull request as ready for review December 14, 2023 11:16
@unkcpz unkcpz requested a review from superstar54 December 14, 2023 11:16
tests/test_computational_resources.py Outdated Show resolved Hide resolved
aiidalab_widgets_base/computational_resources.py Outdated Show resolved Hide resolved
aiidalab_widgets_base/computational_resources.py Outdated Show resolved Hide resolved
@unkcpz unkcpz force-pushed the fix/xx/threading-issue-for-user-query-in-code-setup branch from 6ced438 to 98c52a4 Compare December 14, 2023 12:51
@unkcpz unkcpz force-pushed the fix/xx/threading-issue-for-user-query-in-code-setup branch from 98c52a4 to ffe8117 Compare December 14, 2023 12:54
@unkcpz unkcpz requested a review from danielhollas December 14, 2023 12:57
Copy link
Contributor

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very confused. How do your proposed changes fix the issue that you saw in the QeApp?? What is the root cause here?

@unkcpz
Copy link
Member Author

unkcpz commented Dec 15, 2023

This is the part of traceback that lead me to the change, if you remember the change we made from migrate aiida-core v1->v2, that the aiida nodes in the traits are replaced by uuid, the issue aiidalab/aiidalab-qe#582 has the same source of problem.

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jovyan/apps/aiidalab-qe/src/aiidalab_qe/common/setup_codes.py", line 235, in _refresh_installed
    self.set_trait("installed", True)
  File "/opt/conda/lib/python3.9/site-packages/traitlets/traitlets.py", line 1742, in set_trait
    getattr(cls, name).set(self, value)
  File "/opt/conda/lib/python3.9/site-packages/traitlets/traitlets.py", line 721, in set
    obj._notify_trait(self.name, old_value, new_value)
  File "/opt/conda/lib/python3.9/site-packages/traitlets/traitlets.py", line 1505, in _notify_trait
    self.notify_change(
  File "/opt/conda/lib/python3.9/site-packages/ipywidgets/widgets/widget.py", line 694, in notify_change
    super(Widget, self).notify_change(change)
  File "/opt/conda/lib/python3.9/site-packages/traitlets/traitlets.py", line 1517, in notify_change
    return self._notify_observers(change)
  File "/opt/conda/lib/python3.9/site-packages/traitlets/traitlets.py", line 1564, in _notify_observers
    c(event)
  File "/home/jovyan/apps/aiidalab-qe/src/aiidalab_qe/app/submission/__init__.py", line 202, in _auto_select_code
    code_widget.refresh()

The self.set_trait("installed", True) is called from QESetupWidget which installs the QE code is run in another threading to not block the app. Inside the threading, it triggers the trait change because it will run a refresh_codes of computational resource widget after the codes are installed to update the list of dropdown menus.

So if AiiDA still not threading safe, we need to implement a rule for all the widgets development in AiiDAlab that the node should have the local scope not cross the functions to avoid the problem. If the changes I made here are correct, we now know two proper design principles to follow:

  1. Use UUID as trait when developing the new widget.
  2. The user node which seems have a wider scope need to query and use inside the function. (pinning @sphuber for comment, I think this might be a change can happened in aiida-core to provide an API to get the user by query in a function scope session instead like we originally have here which uses user = orm.User.collection.get_default())

@danielhollas danielhollas changed the title Without using user restriction in _get_codes of ComputationalResourceWidgets Fix threading issue when accessing user instance in _get_codes of ComputationalResourceWidgets Dec 15, 2023
@danielhollas
Copy link
Contributor

@sphuber would you mind taking a look if the changes here make sense? I am not super familiar in how aiida-core handles User instances.

@danielhollas danielhollas requested a review from sphuber January 12, 2024 16:55
@sphuber
Copy link

sphuber commented Jan 15, 2024

@sphuber would you mind taking a look if the changes here make sense? I am not super familiar in how aiida-core handles User instances.

I had a look but I don't understand what the changes to fetching the default user should do. First of, User.objects is deprecated, and User.collection should be used instead. So the PR changes to use the deprecated version. The current use is correct. To get the default user, User.collection.get_default() is correct. The only tricky part here is that this default is cached on the StorageBackend. So if the default user is changed, this needs to be reset through StorageBackend.reset_default_user. This is done automatically through Manager.set_default_user_email but not if you go directly through Profile.set_default_user_email.

If you want to manually query for a user given an email, I would recommend User.collection.get(email='email'). It is concise and closest in syntax to User.colelction.get_default()

@unkcpz
Copy link
Member Author

unkcpz commented Jan 15, 2024

@sphuber thanks! Yes, User.collection.get(email='email') looks better, I'll give it a try.

@unkcpz
Copy link
Member Author

unkcpz commented Jan 15, 2024

Tested, works good! I update the PR. @danielhollas

@danielhollas
Copy link
Contributor

@unkcpz just to double-check. This PR fixes the threading issue that you saw in QeApp and reported at aiidalab/aiidalab-qe#582

(tl;dr, you were seeing sqlalchemy.exc.InvalidRequestError: Instance '<DbUser at 0x7f7f1d3d96d0>' is not persistent within this Session)

The current use is correct. To get the default user, User.collection.get_default() is correct. The only tricky part here is that this default is cached on the StorageBackend.

Oh, the caching then explains our threading issues! That's a really unfortunate side-effect, but now I understand why the fix in this PR works. Even though we were calling get_default inside the function and thought it thus should be thread-safe, it was in fact returning the same object.

One can verify this in verdi shell:

user = User.collection.get_default()
user2 = User.collection.get_default()
id(user) == id(user2)
> True

whereas load_node behaves differently

node = load_node(117419)
node2 = load_node(117419)
id(u) == id(u2)
> False

@unkcpz can you check if we use User.collection.get_default anywhere else in AWB?

I also wonder if User.collection.get_default() could get smarter and use per-thread cache? @sphuber
Are there any other instances of caching like this that we should be aware of?

@sphuber
Copy link

sphuber commented Jan 18, 2024

I also wonder if User.collection.get_default() could get smarter and use per-thread cache? @sphuber
Are there any other instances of caching like this that we should be aware of?

AiiDA has never been thread-safe, so I am not sure if we should start making off the cuff fixes for just this. There are plenty of other singletons in the code, such as the loaded Profile, Config and Manager to name just a few.

@danielhollas
Copy link
Contributor

AiiDA has never been thread-safe, so I am not sure if we should start making off the cuff fixes for just this. There are plenty of other singletons in the code, such as the loaded Profile, Config and Manager to name just a few.

That's fair, although I'd argue that there is still a slight difference here. User.collection.get_default() return the user node orm instance (not sure about the terminology here). We now understand well that AiiDA nodes themselved are not threadsafe. However, the fact that some methods may return the same node when called twice is a gotcha we didn't know about, and I don't see a good way guarding against this. So I am more interested in cases like this, and less worried about Profile, Config, Manager thing (but I might be underestimating the difficulties there). I wonder if at least this behaviour should be documented in the docstring somehow?

In general, I think the issue we have in AWB is that we have a coupling between the code that handles the frontend events (i.e. user clicking a ipywidgets button), which generally come from different threads, and backend code that communicates with the DB. This coupling makes it hard to see where the footguns lie.

@danielhollas
Copy link
Contributor

@unkcpz can you check if we use User.collection.get_default anywhere else in AWB?

@unkcpz there are two other instances of get_default() that need to be fixed in get_computers and configure_computers.

@sphuber thanks! Yes, User.collection.get(email='email') looks better, I'll give it a try.

I am not sure this is the correct solution here, feels more like a hack. We really want the default user. What if there are two users with the same email (is that possible?).

@sphuber
Copy link

sphuber commented Jan 19, 2024

I wonder if at least this behaviour should be documented in the docstring somehow?

Sure, that's fair. It is at least documented in the StorageBackend.default_user docstring which it just forwards to:

class StorageBackend(abc.ABC):

    @property
    def default_user(self) -> Optional['User']:
        """Return the default user for the profile, if it has been created.

        This is cached, since it is a frequently used operation, for creating other entities.
        """

I am not sure this is the correct solution here, feels more like a hack. We really want the default user. What if there are two users with the same email (is that possible?).

That is what StorageBackend.default_user does though. The email is currently defined as unique on the database model. The User class is a bit of an exception in the ORM, as it is the only class to not have a UUID but it uses the email instead. See this issue aiidateam/aiida-core#6174 for a discussion on pro's and cons. Looks like we are not going to change this for now

@sphuber
Copy link

sphuber commented Jan 19, 2024

In general, I think the issue we have in AWB is that we have a coupling between the code that handles the frontend events (i.e. user clicking a ipywidgets button), which generally come from different threads, and backend code that communicates with the DB. This coupling makes it hard to see where the footguns lie.

I guess that is indeed because you are directly using the Python API. If there were a full-fledged web API (e.g. REST API) that exposed all necessary functionality, then it might be a better solution to use that instead. There is the REST API implemented in aiida-core, but it is too limited. The aiida-restapi has made headway in improving it, but the main feature lacking is still to replace/represent the ORM API. Incidentally, I just opened an AEP that would solve this problem (at least to a large extent). I am planning on implementing a web API that exposed most of the important parts of the Python API.

@danielhollas
Copy link
Contributor

That is what StorageBackend.default_user does though. The email is currently defined as unique on the database model. The User class is a bit of an exception in the ORM, as it is the only class to not have a UUID but it uses the email instead. See this issue aiidateam/aiida-core#6174 for a discussion on pro's and cons. Looks like we are not going to change this for now

Oh, perfect, that's good to know. In that case I am happy with this PR as along as the other uses are fixed as well.

but the main feature lacking is still to replace/represent the ORM API. Incidentally, I just aiidateam/AEP#40 that would solve this problem (at least to a large extent). I am planning on implementing a web API that exposed most of the important parts of the Python API.

That is very exciting, thanks!

@unkcpz
Copy link
Member Author

unkcpz commented Jan 19, 2024

Thanks for the discussion!

In general, I think the issue we have in AWB is that we have a coupling between the code that handles the frontend events (i.e. user clicking a ipywidgets button)

Yes, in the long term this is the way we should go. We did some test before with the workflow viewer and it is definitely doable. But it requires also some test on how AiiDAlab handle the restapi service etc. and effort of changing things from using aiida node directly to restapi call little by little for AWB and other apps.

Incidentally, I just aiidateam/AEP#40 that would solve this problem

Cool! Looking forward to it. In AiiDAlab, we can start test on moving to restapi in parallel.

@unkcpz
Copy link
Member Author

unkcpz commented Jan 19, 2024

@danielhollas the other two cases are also changed, thanks a lot for the careful review!

@unkcpz unkcpz requested a review from danielhollas January 19, 2024 22:49
@unkcpz unkcpz merged commit 9134305 into master Jan 22, 2024
15 checks passed
@unkcpz unkcpz deleted the fix/xx/threading-issue-for-user-query-in-code-setup branch January 22, 2024 14:05
@unkcpz unkcpz added this to the v2.1.0 milestone Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

User node query not persistent in the Session because it happened in other thread
3 participants