-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEW: Mechanism to lock profile access within AiiDA (v2) #5270
Conversation
From now on aiida will keep track of all processes that request access to the profile by saving their PIDs inside: $AIIDA_PATH/access/profile_name/tracked/<process_id>.pid Before returning control to the client, it will also check that there is no files of the form: $AIIDA_PATH/access/profile_name/locking/<process_id>.pid As this would indicate that the profile is being locked by a process that requires exclusive access for safety of its operations. For a process to request such access it will first have to check that there are no active processes using the profile, so it will look at all files in the `tracked` folder and compare those to the currently running processes in the system to check that none is actually active (it will also delete those outdated tracking files in the process). The design is as follows: - A ProcessData class was defined to store the information relevant to the processes. - It can be initialized either with a PID (of a process to be looked among those currently running in the system) or with a filepath (where the data of a previous process was stored, typically either in the `locking` or the `tracked` folders). If none of these are provided, it will load the info of the currently running process that is calling the execution. - The class also has a method to write the information to a given filepath (typically in the `locking` or the `tracked` folders). - An AccessManager class to control the access to a profile. - It can be initialized with a profile to which the class will control the access to (by default it loads the one currenly in use). - It has a couple of internal methods to distribute and modularize the different tasks, but the most important externally are: - profile_locking_context: a context that can be called to work with the profile locked. It will raise LockedProfileError if the profile is already locked or LockingProfileError if the profile is being accessed by other processes. - record_process_access: a method to record the accessing to the profile. It is now being called in `load_backend_environment` to make sure every process that loads the backend gets recorded. It will raise LockedProfileError if the profile is already locked.
@giovannipizzi I tried to give the best explanation I could in the description of the PR, you can leave your feedback here or let me know if you want to discuss via zoom. This is the base skeleton prototype: I tried to make it as robust and structurally organized as the final one would be, but it still needs things like all adding all the tests and improving on the error messages and that kind of polish that I can do after you tell me if this would be ok. |
Thanks @ramirezfranciscof for the prototype. I have given it a first read and since this is a draft, before doing a standard review going line by line, a more high-level discussion on interface and implementation would be better to start with. I realize that some of the code that you added is not yet being used, but is something that would potentially be used in the future. For example the whole code of acquiring a lock. However, I have the feeling that there is quite a lot of code and methods that are not really necessary and are making it a bit difficult to understand how the class works and should be used. Also I am not sure if the Taking your concept of the mechanism, I think the following should be the public API of the class: class ProfileAccessManager:
def __init__(self, profile):
"""Class that manages access and locks to the given profile.
:param profile: the profile whose access to manage.
"""
@contextlib.contextmanager
def lock(self):
"""Request a lock on the profile for exclusive access.
This context manager should be used if exclusive access to the profile is required. Access will be granted if
the profile is currently not in use, nor locked by another process. During the context, the profile will be
locked, which will be lifted automatically as soon as the context exits.
:raises LockingProfileError: if there are currently active processes using the profile.
:raises LockedProfileError: if there currently already is a lock on the profile.
"""
def clear_locks(self):
"""Clear all locks on this profile.
.. warning:: This should only be used if the profile is currently still incorrectly locked because the lock was
not automatically released after the ``lock`` contextmanager exited its scope.
"""
def request_access(self):
"""Request access to the profile.
:raises LockedProfileError: if the profile is locked.
""" All other methods should probably be protected, as they should not have to be called from the outside. Now I am not saying that you should take this exactly, but I wanted to use this in order to raise some points in comparison to your design:
Let me know what you think. I would try to simplify the code as much as possible and remove anything that is not strictly necessary. |
Hey @sphuber , thanks for the feedback.
The purpose of the Re-the internals:
Yes, that is a good idea, I agree the Although re the default: I am a bit baffled by this (and also, in a sense, of how you guys reacted at this in the last meeting). The access manager is not the sensitive or dangerous code, the worst it do just by itself is lock the access in a way it can solve and you might need to go and manually delete a file. The actual dangerous operation is the one performed by whatever process is trying to acquire a lock. For example, the maintenance, which can actually be called directly by users and loads a default backend. The locking is trying to add a protective layer to that.
I think we agree on this, that is basically what I meant in the commit message / OP:
I don't have
Originally I did it with two folders because the number of access records that could accumulate might be high (which in principle only need to be cleaned when you want to acquire a lock) and we may want to keep the checks on the lock (which need to be done every time you load the profile) super quick. I personally think it is cleaner that way, but if you have any concrete practical reason why it would actually be better to have all in the same folder I wouldn't be against it. However, I should also say that when discussing this with @giovannipizzi he made significant emphasis in keeping these separated, so he might have stronger reasons.
Mmm, this I don't think I agree with this. You are trying to minimize the chance of the race condition clashing, but the way I am doing it I am actually eliminating it (at least for the "dangerous" clash where both processes think they have a lock, see at the end). First, I wouldn't describe my approach as "checking twice", since actually the first check is not really necessary but I do it to prevent unnecessary operations. The only relevant check is the last one: if I (1) first create the profile locking file and then (2) check if it is still the only one before returning control, there is no way for two processes to (think they managed to) lock the profile at the same time. The first process to reach (2) will have already created its own locking file, so even if that one moves on to the The most "conflicting" scenario is when two processes both arrive at the after-check, both see each other's file, and then both delete the file and raise thinking another process acquire the lock while neither did. But this should be harmless: I'm basically exchanging the risk, however small, of two locking attempts to think they succeeded (which could lead to database corruption) for the risk of two locking attempts raising (which should be solved by running again or checking why you have two simultaneous processes competing so fiercely for locks). Is there something wrong in this reasoning? Am I missing any case? |
aiida/backends/managers/access.py
Outdated
for filename in os.listdir(basepath): | ||
if not filename.startswith(self._temp_prefix): | ||
filepath = basepath / filename | ||
process_data = ProcessData(filepath=filepath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the content is corrupt? The safest is to assume there is a lock anyway, I think? In this case ProcessData will raise and the exception needs to be dealt with here. Maybe it's better to assume that the PID is part of the filename and infer it from there anyway, and set to None
the rest of the information that would have been returned (and print "UNKNOWN" if we need to list e.g. the command or crime of that process).
aiida/backends/managers/access.py
Outdated
filepath = self._path_to_tracked / f'{pid}.pid' | ||
try: | ||
os.remove(filepath) | ||
except OSError as exc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check the error number to make sure it's a file not found; if e.g. you get a 'cannot delete/write protected' you should raise anyway? check errno
see here using values from the corresponding package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or more simply catch FileNotFoundError
as you do before
aiida/backends/managers/access.py
Outdated
process_obj = psutil.Process(pid) | ||
process_cmd = process_obj.cmdline() | ||
process_ctm = time.localtime(process_obj.create_time()) | ||
process_ctm = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(process_obj.create_time())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use the ISO format? (there's a method to get it directly)
aiida/backends/managers/access.py
Outdated
self._pid = pid | ||
process_obj = psutil.Process(pid) | ||
process_cmd = process_obj.cmdline() | ||
process_ctm = time.localtime(process_obj.create_time()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is useless as it's replaced by the next one? Also, I would call it with a more descriptive name (process_ctime)
aiida/backends/managers/access.py
Outdated
|
||
def _read_from_file(self, filepath): | ||
"""Reads the info of the process from a file (filepath)""" | ||
with open(filepath, 'r') as json_file: # pylint: disable=unspecified-encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deal with an appropriate exception to "corrupt" content
aiida/backends/managers/access.py
Outdated
with open(filepath, 'w') as json_file: # pylint: disable=unspecified-encoding | ||
json.dump(json_data, json_file) | ||
|
||
def _read_from_file(self, filepath): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just move this code inside the __init__
? You don't want that this is called after the __init__
, I think
aiida/backends/managers/access.py
Outdated
self._data = json_data['data'] | ||
|
||
def __eq__(self, other): | ||
"""Define equivalency""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More extensive comment (also to clarify Sebastiaan's comment)
We'll also need some stress testing where we call the locking profile multiple times, we open many processes concurrently and at some point lock, etc. Finally, regarding the folder: definitely the lock file(s) it should be in a different folder than the tracked ones. [EDIT/ADD]: also, the idea of using both |
aiida/backends/managers/access.py
Outdated
process_datadicts = {} | ||
|
||
for filename in os.listdir(basepath): | ||
if not filename.startswith(self._temp_prefix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually also check that the file ends with .pid
(this should also be a class variable, reused everywhere and not hardcoded). This avoids that if there are other files these create problems (e.g. files ending in ~
if someone edited with and editor, files like .gitignore
for some reason put there, ...)
Why does it absolutely have to be in different folders? Why can't you have
Adding the command doesn't solve the problem though, you could start the same process with the same PID if the OS reassigns it. Sure, you might reduce the chances a bit, but how often do we have PID turn around anyway? Do other tools include additional information like this when working with PIDs? To me this seems like using SHA256 but also preemptively trying to protect against clashes. It is clear that it is a possibility, but you are not supposed to try and guard against it because it is super unlikely. Also regarding the |
I see; this would also work. However currently you have to loop over all files in that folder, that can be potentially many if users have been running hundreds of
Indeed, I thought to this. But in that (very rare) case, the "same process" would be a verdi process! So it's actually OK that it's still considered to be there for our locking purposes.
I would say this is quite common. You start AiiDA as one of the first things after a reboot, you will end up using a low-value integer. You reboot the computer, you start some other software, those will get the same low-value integers so it's not so rare. If these are daemons or long-running processes, the user will have issues using AiiDA unless they do some non-obvious operations (remove all PID files); letting users do it might actually be dangerous because they will probably delete all of them, i.e. they might delete also those of Note: this is even more common than usual if e.g. AiiDA inside docker (e.g. for AiiDAlab) since processes always start back from 1 when you restart the container; there is a high probability that they exchange the order and overlap (even just running The implementation takes care of all of this in a way that I see as robust (i.e., I still don't see a usecase in which you think this might either fail its intended purpose, i.e. allow a locking process to run even if there is another [correction: I see an edge case: a user starts a
This is a good idea. As I commented, we should really limit the content of the file, because otherwise we need to start thinking to all cases of invalid content and how to deal with them. @ramirezfranciscof what do you think? The content of the PID file could simply be the string of the |
Note: probably @ramirezfranciscof didn't use the file ctime because in his original implementation, this was part of the comparison to check if the process was the same therefore it was crucial that this was the ctime of the process, not of the creation of the file that can happen milliseconds or even seconds later if the computer is slow. So it had to be the time from |
It introduces the class `ProfileAccessManager`, used to keep track and control what system processes access the aiida profiles. It has the following public methods: - `request_access`: to be called every time the profile is loaded (for example, inside of `load_backend_environment` in the module `aiida.backends.manager:BackendManager`). A file will be created with the process ID as its name so as to keep track of who is accessing (or has accessed) the profile. - `lock`: this context manager makes sure the process calling it is the only one accessing the profile. It does so by checking that there are not tracking files that correspond to currently running processes and by creating a locking file that will prevent other processes from accessing the profile. - `is_active` / `is_locked`: for clients to easily check if a profile is being used or locked by running processes (respectively). I would not recommend to use these as guards before calling `request_access` or `lock` since running conditions are still possible, it is better to use `try ... except ...` instead. - `clear_locks`: this removes any current lock by force-deleting the locking file. This will not stop any process that was locking the profile if it is still running: that process will still "think" it has exclusive access to that profile, which could result in data corruption. This method should be used with extreme care. Co-authored-by: Sebastiaan Huber <[email protected]>
Codecov Report
@@ Coverage Diff @@
## develop #5270 +/- ##
===========================================
+ Coverage 82.07% 82.11% +0.04%
===========================================
Files 533 534 +1
Lines 38307 38398 +91
===========================================
+ Hits 31436 31526 +90
- Misses 6871 6872 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
1118415
to
3c7f70f
Compare
@ramirezfranciscof Did you consider to delegate some of the file-locking related implementation to this or any other library? I have had good experiences with https://pypi.org/project/filelock/. |
Ha yep, this is the package I pointed to on slack (in the reviewing channel) |
3c7f70f
to
3606f9e
Compare
@csadorf I checked the https://pypi.org/project/filelock/ but I don't think it works for our purposes. Long story short: it is critical here that we are tracking both access as well as locking. I have both an "access track" and a "lock" procedures, and the "lock" needs to check not only for the absence of other locks, but also has to make sure there is no ongoing access. Moreover the tracking of access has the added difficulty that it there is no common point of "closing" (we can't tell when a given process stopped using a profile). Example: If I open a verdi shell it should not lock the profile, I should be able to open a 2nd one. But I do need to keep track that I opened the first verdi shell: if I try to then lock the profile with the second shell, it has to say "I can't lock because there is another shell open". As far as I could see, there was no support for something like this in that library. |
@ramirezfranciscof I might misunderstand this, but I think the example you provide could be realized with a combination of hard and soft locks where the soft lock checks for the existence for the lock file, but does not require exclusive access. Either way, I am not saying that you should definitely use the library (I also have not reviewed the PR in detail (yet)), I just want to make sure that it was adequately considered. |
c48b9b6
to
3c73bec
Compare
I can't see how mixing types of lock would help in this case. If the first process acquires any kind of lock, then the second one won't be able to do so until the first one releases. This is the same for both hard and soft locks as far as I could see from my tests of the library. If you want, you can ping me on slack and I'm happy to arrange a zoom meeting so I can go over the parameters of the problem and we can brainstorm more concrete possible implementations. But from what I saw of this library and the tests I made, I can't think of any way to use |
@csadorf and @chrisjsewell could you please explain how you think the implementation can be simplified using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber I believe I made it abundantly clear that my comments are not to be regarded as blocking to this PR, I just wanted to make the authors and reviewers of this PR aware of an existing solution to this problem to ensure that "re-inventing the wheel is avoided" and existing libraries are adequately considered. I was never formally requested for review for this draft PR, but have reviewed it anyways now.
I think it would be helpful to adopt the term "non-exclusive" access where appropriate. I think @ramirezfranciscof 's explanation would have been easier for me to follow using this terminology. Unfortunately, py-filelock does not currently support non-exclusive/shared locks, which would be required to implement this logic with the library.
To move forward I would suggest to:
- Adopt the terminology "exclusive" and "non-exclusive" access.
- Ensure that the code introduced with this PR is covered by unit tests.
- Revisit the specific implementation of acquiring the lock file via the filelock library.
raise exc | ||
|
||
@contextlib.contextmanager | ||
def lock(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of the locking logic here is the only concern I have with this PR. A platform independent, well-tested in practice, and overall well-maintained implementation of this logic can be found in: https://github.com/tox-dev/py-filelock/tree/main/src/filelock . The only thing the library does not do is unlinking the lock file after releasing it, but I would argue that we can either adapt to that behavior or simply wrap the context manager and perform that step ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that at this point I think we would be introducing a new library not only at normal cost (i.e. dependency maintenance), but also with the extra of having to coordinate it awkwardly for our use case (which adds another point of possible break), and with little to nothing to gain:
- Our
lock
method still has to check the access files which is basically 80% of its logic, so not a lot of simplification there. The internal call to theirlock
would only be replacing the file creation basically (which should not have a lot of platform dependency issues). - The typical benefit of "outsourcing" the effort of having to deal with maintaining the implementation is reduced given that we anyways would need to maintain the other half of tracking "non-exclusive" access.
- With respect to the difference in behavior (our
lock
currently raises if you try to lock a profile that isalready locked, their locking mechanism has the process waiting for the other lock to release) I think I actually prefer ours. The situation where this is relevant is when a racing condition triggers and we have two processes trying to acquire the lock at almost the same time. I think I personally prefer our resolution to this situation.
I don't know, maybe I'm underestimating the possible complications of locking with files, but I don't feel this is a good tradeoff. @sphuber you are quite familiar with the code by now, do you feel like we could gain something from trying to fit in the py-filelock
in this design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see both points. Normally, for such an important thing, I would also be tempted to use the external package, but I think you expressed the counter-arguments pretty well. We could indeed use it for our own lock
implementation, but for the request_access
we would still have our own implementation.
There would still be a robustness advantage in replacing our own locking implementation with the filelock
package in the lock
method. If we simply use their FileLock
, we don't have to worry about racing conditions and robustness. As simple as it may seem, it is often quite tricky to be sure there are no subtle issues. The problem is that we cannot use the process ID in the filename, because the lock filename should be identical for all processes. But we can also not write the process ID in the lockfile, which the docs of filelock
clearly state not to do. If we want to keep the functionality to be able to tell which process already has a lock when another tries to acquire it, we would have to start writing that to a separate file. But that has problems in and of its own.
TLDR: I think it would be preferable to use the package, but I think this is not possible due to our requirements of having non-exclusive locks and wanting to notify a user which processes are accessing it or locking it if a lock is denied. Only if we are willing to give up this user functionality, can we simplify the implementation and use the package. So in the end, this is the decision we have to take I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest move forward with the current implementation, ensure that is extremely well tested (that's one of those things that should have 100% coverage IMO), and then maybe we can spend 1-2hrs trying to refactor it with the libraries that I have mentioned. If it is too complicated, then we just stick with the current approach until we run into problems.
) | ||
self._raise_if_active(error_message) | ||
|
||
filepath = self._dirpath_records / f'{self.process.pid}.lock' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the exclusive lock file to be PID specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we record the PID
we can inform the user which process is locking when raising errors about not being able to access a profile and we can also perform some checks ourselves (i.e. decide that we will automatically check if the process is running and unlock if it is not). Putting it in the name was the easiest way.
Moreover, having separate locking files gives an easy way to prevent racing conditions: if by the end of the process of acquiring the lock there is more than one lock file, something "went wrong" (more than one process were simultaneously trying to acquire the lock) and we remove at least the one that correspond to the process checking. This makes it impossible for two processes to each think they acquired a lock for exclusive access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow this argument, isn't the whole point of using a single exclusive lock file to prevent race conditions? If it possible that two processes can obtain the same lock at the same time, then the file system does not support locking. If you need to have the PID, then just obtain the lock and then write the PID file into a distinct PID file.
To my understanding, the file locking procedure should be an atomic operation, whereas checking for files, creating a file, and then checking for file existence again is not.
There is also https://pypi.org/project/pid/ which might support that exact use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, having separate locking files gives an easy way to prevent racing conditions:
@csadorf is right here. The whole point of a lock file is to prevent race conditions, so the filelock
library would naturally protect against this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit lost in this thread now. I don't know if maybe a "lock file" is a more technical term with very specific details of how it is implemented; I'm using files to track access to the AiiDA profile and provide a context in which I guarantee no other process was accessing it.
My design indeed can't get completely rid of the racing condition, but it exchanges the situation of two competing processes both thinking they acquired the lock for a situation in which they both think the other one beat them to it and except. I think this should be good enough for our use case since it effectively prevents any corruption of the data and the user can just try again (I can't imagine a case where there are multiple processes trying to lock the profile by design). Using filelock
would solve this but I still think this does not outweigh the costs of including an extra dependency and extra complexity for coordinating it with our other .pid
tracking files (and having to keep track separately of the id of the locking process).
There is also https://pypi.org/project/pid/ which might support that exact use case.
Mmm maybe? The "stale detection" line sounds interesting, but honestly I can't even understand what exactly does this do from the documentation they have (which seems to be reduced to that small readme).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, I am also not trying to antagonize here. I am just trying to help by trying to see if the suggestion of @chrisjsewell and @csadorf could work. The best way to do this and to show that we are taking it seriously, is to actually try it and propose an alternate implementation. Making it concrete is the best way to visualize any problems and could help us finalize the decision with the design. Otherwise, if we stick with just discussions we never move forward. But I think we have done enough now and I think we should just stick with this design then. Unless the others really think that having the additional information of which PIDs are blocking is not worth the additional complexity of the design and forcing a custom implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already suggested to move forward with the current implementation with extremely high test coverage and then see if we can recover the exact same behavior using the available libraries. I would personally be extremely surprised if that was not possible.
I am a bit disappointed that my suggestion to adopt more concise semantics was ignored. @ramirezfranciscof You are using the term lock file, but are not technically locking anything and also state that you are merely tracking "access". At the same time, these access files are used to simulate exclusive and non-exclusive locking behavior.
What I am seeing here is the need for
- Locking the profile exclusively for specific operations (e.g. migrations),
- locking the profile non-exclusively to to prevent exclusive locking (standard access),
- tracking which processes are obtaining any of these locks.
All of these use cases could certainly be implemented with aforementioned libraries. It is just open whether the implementation would be more or less complex. What appears obvious to me is that at least the exclusive locking would be more robust.
Please move forward with this PR any way you see fit. I believe that especially when it comes to sensitive operations such as file locking, well-tested and platform independent libraries should be considered which is why I suggested the specific library that I have had good experience with. I am starting to get somewhat frustrated needing to defend that suggestion over and over again. In the end it does not matter to me whether we use the library or not, what I care about is that things work and work well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these use cases could certainly be implemented with aforementioned libraries.
This is what you keep saying, but then when we describe potential reasons why this is not possible, we don't get a specific response whether you agree with that analysis. I tried to fix this by making it more concrete, going out of my way to provide a potential implementation on a PR that is not mine, but still we get a repetition of platitudes that it should be possible.
I am starting to get somewhat frustrated needing to defend that suggestion over and over again.
Well that makes two of us. I am not asking you to defend it, I am asking you to look at our analysis of trying to implement your suggestion and the limitations that we think are there, and either provide a clear solution to those limitations or accept that they are there and so sign off on the design. Again, I am doing this in order to not discard your suggestion without proper consideration, out of respect to you and the time you took to review the concept of this PR, not to cause you grief. Just reiterating that "you think it should be possible with the library" is not helping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these use cases could certainly be implemented with aforementioned libraries.
This is what you keep saying, but then when we describe potential reasons why this is not possible, we don't get a specific response whether you agree with that analysis. I tried to fix this by making it more concrete, going out of my way to provide a potential implementation on a PR that is not mine, but still we get a repetition of platitudes that it should be possible.
I do not understand what you are expecting from me here. Do I have to implement it myself? I think I provided a clear analysis on what I interpret we are trying to implement here and made suggestion on how to improve the language and abstraction as well as provided guidance as to how I would implement it. When I inquired about specific obstacles the presented arguments were not convincing.
I am starting to get somewhat frustrated needing to defend that suggestion over and over again.
Well that makes two of us. I am not asking you to defend it, I am asking you to look at our analysis of trying to implement your suggestion and the limitations that we think are there, and either provide a clear solution to those limitations or accept that they are there and so sign off on the design. Again, I am doing this in order to not discard your suggestion without proper consideration, out of respect to you and the time you took to review the concept of this PR, not to cause you grief. Just reiterating that "you think it should be possible with the library" is not helping.
Again, not quite sure what is expected here despite providing an actual implementation. You have already received my approval to go ahead multiple times.
I apologize for not providing a constructive review and will withdraw from the discussion. I am very sorry that I wasted everybody's time with my suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize for not providing a constructive review and will withdraw from the discussion. I am very sorry that I wasted everybody's time with my suggestion.
I am sorry if anything I said gave the impression that this was my stance, it is not at all. I really appreciate you taking the time to give feedback on this, and I truly think it was very valuable even if in the end I might opt or push to not apply all of it. I assure I did not just dismissed it: I really tried looking into the filelock
library and playing around with it, but couldn't figure out a non-forceful way to leverage it for our specific conditions. I wanted to respond to all your points not as an attack to them, nor expecting you to defend the suggestions, but because they were perfectly valid points and thus merited an explanation of why, for this particular use case, I might prefer to decide against them.
I also tried my best to start using the proposed terminology throughout the discussion, is just that it is a bit hard for me to immediately understand how exactly the terms are intended and what is the extent of their influence (e.g. is "exclusive access" a qualifier when talking about locking? does it completely replace "locking"? can I still talk about lock files or how do I refer to the files I use for tracking this now?). Again, this is not to say it was on you to clarify all this right away, I'm just trying to explain why I may not have seem to fully adopt these semantics right away.
Lastly I hope my comments on the pid
library's documentation didn't contribute to this. It was meant to express my frustration at the library itself, not at your recommendation of it which was very pertinent. Apologies if it seemed that way.
I think @giovannipizzi may be right, this has already eroded all of us a bit and it might be better to just go ahead and merge it. If at some point someone wishes to revisit incorporating any of these libraries with some fresh eyes or improving the feature in any other way, I would be more than happy to help.
Running this locally against a test profile, I get two failing tests:
I made sure that before running the tests there are no residual pid or lock files in the access directory of the test profile. The errors seem reproducible (ran them several times). After each run of the tests, the access directory contains a pid file. Not sure by which test this is created and not cleaned. |
Figured out the problem. Your tests are relying on the fact that the test profile is also marked as the default profile. Because when they launch a new external process, they do not explicitly specify the profile but rely on the default. This is fragile and the tests should explicitly specify the profile (being the same as the test profile) when launching an external process. |
Hi all, just to avoid this goes into an infinite loop and everybody is disappointed:
I'm suggesting this because I now think that in the balance of human energy invested in this, and the risk of a bug and its consequences, now it's time to merge this (otherwise we just spend more time, and people get more and more frustrated - I think we still have other things we also need to focus on before the release). And also because my analysis of the code (at least in my first review, I think now the code changed quite a lot) convinced me that the behaviour was correct in all edge cases I could think about. Again, thanks again to everybody! |
Hey, thanks for the testing and finding the problem! I could reproduce locally and I think the last commit fixed it, let me know. Also, sidenote, after merging the develop branch again, I'm having problems when
The PR in question mentions that:
Does this mean then I can no longer use editable mode until I update to pip v21 or is there another syntax to do it? |
I would certainly suggest updating pip (should be as simple as |
One thing that just crossed my mind. Does anybody see a problem if AiiDA is used from two machines sharing the same filesystem? (I guess there might be many other problems we are not aware of in this case, e.g. with detecting if the daemon is working, but just pointing out the issue - if also detecting if the daemon is running has the same problem, then I would say we shouldn't worry in this PR). E.g. if one installs AiiDA on a login node of a supercomputer, that connects you to one of many possible login nodes. Then the PID written to file could be not existing on the machine you connect (because it's in a different machine). Again, this might be problematic also when checking if the daemon is running, so probably this is not a usecase we support? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ramirezfranciscof . I have to say that I haven't had the chance to look into depth in the tests yet, but given the current situation, let's just get this thing merged. I just noticed some minor mistakes in the docstrings and a suggestion for the naming of the test class to avoid annoying warnings. Finally, I think the new module should simply go in aiida.manage.configuration
since aiida.backends.managers
might not be long for this world.
########################################################################### | ||
|
||
|
||
class TestProcess(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest renaming this, since with the current name pytest
think it is an actual class of tests and will emit a warning
class TestProcess(): | |
class MockProcess(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't notice this warning. Thanks! Is changed now.
raise LockedProfileError(error_msg) | ||
|
||
def _raise_if_active(self, message_start): | ||
"""This method will raise the exception given in `ExceptionClass` if the profile is being accessed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""This method will raise the exception given in `ExceptionClass` if the profile is being accessed. | |
"""Raise a ``LockingProfileError`` if the profile is being accessed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
return list_of_files | ||
|
||
def _raise_if_locked(self, message_start): | ||
"""This method will raise the exception given in `ExceptionClass` if the profile is locked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""This method will raise the exception given in `ExceptionClass` if the profile is locked. | |
"""Raise a ``LockedProfileError`` if the profile is locked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -0,0 +1,222 @@ | |||
# -*- coding: utf-8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this file also not just belong in aiida.manage.configuration
? It is there where config and profile loading/access is organized and where this is also being called. I think @chrisjsewell was getting rid of this entire module soon anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to aiida.manage.profile_access
, it seemed more fitting that configuration
, but let me know if you disagree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
Random note that, as it does pertain to this line, but https://www.python.org/dev/peps/pep-3120/ makes the default encoding for python source UTF-8, so these # -*- coding: utf-8 -*-
lines are kinda a legacy thing
Yeah, indeed it does not currently support this use case, although I don't see how it would be possible to account for such a distributed system without migrating the profile loading to a context manager. I mean, it will probably be highly disruptive but necessary if we want to have good control over the access to the AiiDA instances. Maybe we can already start adding the features without making them required just to see how it feels to use them. Something like this: with loaded_profile('profile_name') as aiida_profile:
qb = aiida_profile.get_querybuilder()
dict_node = orm.Dict({'example num': 1})
dict_node.store(aiida_profile)
same_node = aiida_profile.load_node(dict_node.pk) |
@ramirezfranciscof this is literally what the whole of #5172 and #5145 (and all the other PRs I've been doing) is trying to achieve 😉 i.e. you can already now do: In [1]: from aiida.tools.archive.abstract import get_format
In [2]: archive_format = get_format()
In [3]: with archive_format.open("2d-export-new.aiida", "r") as reader:
...: qb = reader.querybuilder()
...: node = reader.get(Dict, pk=10)
...: print(qb.append(ProcessNode, tag="tag").append(Code, with_outgoing="tag").distinct().count())
...:
10817 |
@chrisjsewell yes, there seems to be a lot going on in those refactorings 😅 , but I guess essentially this is what we are talking about:
Looking forward to that 👍🏽 |
yep getting there cheers 😅 |
Requests addressed
From now on aiida will keep track of all processes that request access
to the profile by saving their PIDs inside:
$AIIDA_PATH/access/profile_name/tracked/<process_id>.pid
Before returning control to the client, it will also check that there is
no files of the form:
$AIIDA_PATH/access/profile_name/locking/<process_id>.pid
As this would indicate that the profile is being locked by a process that
requires exclusive access for safety of its operations.
For a process to request such access it will first have to check that there
are no active processes using the profile, so it will look at all files in
the
tracked
folder and compare those to the currently running processesin the system to check that none is actually active (it will also delete
those outdated tracking files in the process).
The design is as follows:
A ProcessData class was defined to store the information relevant to the
processes.
It can be initialized either with a PID (of a process to be looked
among those currently running in the system) or with a filepath
(where the data of a previous process was stored, typically either
in the
locking
or thetracked
folders). If none of these areprovided, it will load the info of the currently running process
that is calling the execution.
The class also has a method to write the information to a given
filepath (typically in the
locking
or thetracked
folders).An AccessManager class to control the access to a profile.
It can be initialized with a profile to which the class will control
the access to (by default it loads the one currenly in use).
It has a couple of internal methods to distribute and modularize the
different tasks, but the most important externally are:
profile_locking_context: a context that can be called to work
with the profile locked. It will raise LockedProfileError if the
profile is already locked or LockingProfileError if the profile
is being accessed by other processes.
record_process_access: a method to record the accessing to the
profile. It is now being called in
load_backend_environment
to make sure every process that loads the backend gets recorded.
It will raise LockedProfileError if the profile is already
locked.