Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 FIX: Ensure Container DB always closed after access #5123

Merged
merged 4 commits into from
Jan 20, 2022

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Sep 9, 2021

Possibly fixes #4899

This is the "super cautious" approach, whereby we ensure the sqlite database is closed after every access. Obviously this may not be performant if you want to perform multiple operations

Alternatively, we could add __enter__ / __exit__ methods to the AbstractRepositoryBackend / Repository itself, and then move the responsibility of using a context manager to uses of it.
Possibly changing Profile.get_repository to a contextmanager Profile.with_repository

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Sep 9, 2021

This certainly removes the warning for open files in test_add_multiply.py

Still not entirely sure how this relates to threads though, but to note the check can be turned on/off at the sqlite level:

@chrisjsewell
Copy link
Member Author

Another consideration here, is that in the Container.close we close the sqlalchemy session, but also throw it away:

    def close(self) -> None:
        """Close open files (in particular, the connection to the SQLite DB)."""
        if self._session is not None:
            self._session.close()
            self._session = None

In the code here, we may just want to close the session, but not throw it away (which then requires having to create a whole new engine)

@codecov
Copy link

codecov bot commented Sep 9, 2021

Codecov Report

Merging #5123 (7e0cb19) into develop (e880ab7) will increase coverage by 0.01%.
The diff coverage is 98.53%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #5123      +/-   ##
===========================================
+ Coverage    82.06%   82.06%   +0.01%     
===========================================
  Files          533      533              
  Lines        38296    38307      +11     
===========================================
+ Hits         31424    31434      +10     
- Misses        6872     6873       +1     
Flag Coverage Δ
django 77.12% <98.53%> (+0.04%) ⬆️
sqlalchemy 76.42% <98.53%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/repository/backend/disk_object_store.py 95.73% <98.53%> (+0.45%) ⬆️
aiida/transports/plugins/local.py 81.46% <0.00%> (-0.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e880ab7...7e0cb19. Read the comment docs.

@sphuber
Copy link
Contributor

sphuber commented Sep 9, 2021

Trying to understand how this is fixing the problem that you describe. Firstly, I am not sure where in the disk-objectstore you close any files. In this commit you seem to add the context manager methods but they don't seem to do anything. Where are the files being closed?

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Sep 9, 2021

Where are the files being closed?

The file is the sqlite database; when you close the session you close the file (well at least you allow it the possibility to close)

@chrisjsewell
Copy link
Member Author

In this commit you seem to add the context manager methods but they don't seem to do anything

That's the wrong context manager you are looking at; I'm talking about the one on the Container class (that closes the session)

@chrisjsewell
Copy link
Member Author

Basically, I believe it is not ideal that we are not specifically closing connections with the sqlite DB ever.
With the Postgres DB, I guess we also do not specifically close these connections; you open one when you load the profile, then it dies when the python process dies. But, since here it is a service, it can actively handle dropped connections and clean them up on the server side.
In contrast, the sqlite DB is a static file, so when you "drop" a connection you leave a file in the open state, similar to if you did file.open() but never called file.close()

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Sep 9, 2021

My feeling in general, is that it would probably be "optimal" to abstract the database connection from the Container, to allow something like:

container = Container("path/to/container", database=database)

and just have the container's DB table stored within the Postgres DB. After all, the container is basically useless anyway, without the Postgres DB to tell you the (one-to-many) relationship between the hashkeys and the node files

@sphuber
Copy link
Contributor

sphuber commented Sep 9, 2021

Did a very quick and dirty test of performance between develop and this branch. Just writing some content and reading it again:

#!/usr/bin/env python
import io
from aiida.orm import Data

count = 100
node = Data().store()  # Need to store it to actually use the disk object store, otherwise the sandbox is used

def write_files(count):
    for i in range(count):
        filename = f'file{i}.txt'
        node._repository.put_object_from_filelike(io.BytesIO(f'some\ncontent{i}'.encode('utf-8')), filename)

def read_files(count):
    for i in range(count):
        filename = f'file{i}.txt'
        with node.open(filename) as handle:
            handle.read()


%timeit write_files(100)
%timeit read_files(100)

develop

In [5]: %timeit write_files(100)
187 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit read_files(100)
48.9 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This PR

In [13]: %timeit write_files(100)
188 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: %timeit read_files(100)
270 ms ± 5.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Seems writing is not affected that much but reading is slowed down by a factor of 5. This should be just the offset though, so for bigger files the increase will not be as noticeable.

@chrisjsewell
Copy link
Member Author

Seems writing is not affected that much but reading is slowed down by a factor of 5. This should be just the offset though, so for bigger files the increase will not be as noticeable.

Yep, I believe writing should not be affected at all, because you are just writing the "loose" file to the container folder (and not touching the database).
But then reading is where you need to check the database, for packed files

@sphuber
Copy link
Contributor

sphuber commented Sep 9, 2021

Yep, I believe writing should not be affected at all, because you are just writing the "loose" file to the container folder (and not touching the database).

It would affect the places where we write directly to pack files though, which would require a database connection. Although currently we only do this in the export command I believe. There we can of course make sure that we perform all operations in a single context manager, so it wouldn't be too problematic.

@chrisjsewell
Copy link
Member Author

One thing to quickly note here: when reading a file, currently a database session is always started, even if the hashkey ends up being related to a "loose" file, since the database is checked first before the loose folder.
There may well be a reason for this ordering (I can't remember offhand), but obviously looking in the loose folder first would minimize db opens

@giovannipizzi
Copy link
Member

Thanks @chrisjsewell for looking into this and @sphuber for testing.

Obviously this may not be performant if you want to perform multiple operations

Indeed, that is my main concern (as you both stated I think, the issue is mostly for a lot of small files).
The thing that also puzzles me is, indeed, how this fixes the problem (I mean: I understand why it manages to fix it, but I don't understand why the problem is there in the first place and why it relates to threads?)

Related to this PR, I guess that, if the speed is sub-optimal but still acceptable, and the behaviour is more robust, it's ok to merge for now and open, at the same time, a new issue about checking the performance of this and how to improve it (probably looking into bulk methods). At least we can move on with the release, and then we improve performance later. Or add some "bulk" operations where everything is done within the context manager. But I agree that it's sub-optimal to reopen the DB every time, this has a cost as shown by Sebastiaan (reading becomes even slower than writing).

BTW - can you @sphuber (or @chrisjsewell) run the same test, but pack the loose objects before reading? I think the performance we really care about is for packed objects, as we tell people to pack anyway and there's no way back (and if the performance is OK with packed objects, then we're OK - they can just pack to improve performance). Even if I guess the slowdown is identical, since the SQLite file needs to be re-opened every time.

A few additional comments just to answer to some points:

In the code here, we may just want to close the session, but not throw it away (which then requires having to create a whole new engine).

I see - one needs of course to change slightly the code for caching the session. @chrisjsewell if you think this helps, you can try doing this at the disk-objectstore level and I think that, if all tests continue to pass (and performance improves), then we're safe (just open an issue there if you don't want to do it now but still think it can help).

Basically, I believe it is not ideal that we are not specifically closing connections with the sqlite DB ever.

I see the general reasoning, but is this really a problem? I'm not aware of explicit suggestions to close the connection (file) to SQLite (but I didn't check so maybe I'm wrong). But maybe this interacts with some other things that SQLA does (connection pooling etc.?).
Indeed, my main reasoning in disk-objectstore was to keep it open for performance reasons (BTW, when reading, there are conditions when the connection is closed and then reopen, to be sure that an object wasn't packed in the meantime: see this (and the respective comment):
https://github.com/aiidateam/disk-objectstore/blob/develop/disk_objectstore/container.py#L722-L731
(but also a few more places, like this https://github.com/aiidateam/disk-objectstore/blob/7a09ea2a953f0b0dfa79a6688306c51a501f874b/disk_objectstore/container.py#L1956 and I think a couple more)

@chrisjsewell just to double check - is this login going to interact correctly with your new context manager?

just have the container's DB table stored within the Postgres DB

I see and you're probably right for AiiDA. My original design was with SQLite to keep it separate from AiiDA, and it gives the name "disk" of "disk-objectstore" here was meant to be "stores the state on disk, no need of any running service".

For AiiDA, it might be interesting to look at moving all inside AiiDA - but 1. there are a few subtleties related e.g. with how transactions are managed in PSQL vs. SQLite (or other things like vacuuming) that would require a bit of thought and testing, 2. one would need to re-benchmark speed, 3. backups of the repository need also a dump of the DB (again, OK for AiiDA, but less so for the library used as a standalone, and 4. by the way SQLite is even more robust and has less issues related to storing files than even PSQL :-D
(just to say that it's additional work, that we might not want to do now).

when reading a file, currently a database session is always started, even if the hashkey ends up being related to a "loose" file, since the database is checked first before the loose folder.

There may well be a reason for this ordering (I can't remember offhand), but obviously looking in the loose folder first would minimize db opens

I might need to think a bit more if there are more reasons. One is the way the various operations are organised in order to guarantee that concurrent writing and packing from many processes does not create issues (but I would need to think a bit more to give an example where the other order would be problematic - and maybe one could change the overall logic to overcome this issue, I don't know).
But most importantly, my goal is that client applications use the packed version by default over the loose version, if both exists. Both existing is a valid condition, needed for safe concurrency of packing (1 single process) and reading/writing (many processes), and to overcome two issues: 1. in Windows, reading a file locks it so one cannot delete it (e.g. when cleaning up the repository after packing), so if there are processes reading loose objects, it's annoying as some loose objects cannot be cleaned up. 2. In recent Macs (past few years) there is this nasty bug of the filesystem that I discovered. I never heard back from Apple, but admittedly I'm trying to test it again and I don't reproduce it - maybe they fixed this in recent MacOS BigSur versions? Anyway, the issue is that there were some race conditions where, when opening a file that was being deleted (atomically) in a different Unix process, the reading process would not either read the content or get a missing-file-error, but would open it and find zero content, that's quite annoying.
Therefore, my reasoning that we should try to ensure that clients read the packed objects first (so that we reduce the risk that we clean up the repository and a client app gets confused).

@sphuber
Copy link
Contributor

sphuber commented Sep 19, 2021

Let's make a decision on this one. There is one thing I am not clear on yet: @chrisjsewell you claim that this may fix the problem of the warnings of files being accessed on different threads. May I ask why you think this? I don't really see how closing the files would get rid of this.

Anyway, if this really does fix the problem, I would be ok with merging this and accepting the performance hit.

@giovannipizzi
Copy link
Member

@chrisjsewell - do you happen to have some simple tests on how much this might affect performance? (E.g. storing new nodes (even if this stores loose objects, so might not be affected) or loading nodes (after packing their files) or more specifically getting files from nodes.
I think a simple %timeit check of these three things before and after this PR would be enough and if the performance hit is minimal, I would encourage merging this (I keep seeing quite a few of these, even in the shell sometimes, while submitting). I'm happy to test running develop after this is merged to check if we still see the warning.

@sphuber
Copy link
Contributor

sphuber commented Dec 17, 2021

I think a simple %timeit check of these three things before and after this PR would be enough and if the performance hit is minimal, I would encourage merging this (I keep seeing quite a few of these, even in the shell sometimes, while submitting). I'm happy to test running develop after this is merged to check if we still see the warning.

I have already performed this quick benchmarking and attached the results in this PR. (See this comment). Or are you thinking of different tests?

@giovannipizzi
Copy link
Member

Ah sorry - I had completely forgotten about that you had already done the tests, sorry for not checking back in the thread.
I think that in the interest of having something reliable and having time to test if this PR fixes original the issue, I would suggest to go ahead with this PR and open an issue in the post-2.0 to monitor if this needs to be improved?

@sphuber
Copy link
Contributor

sphuber commented Dec 17, 2021

@giovannipizzi Fine for me to go with that approach. @chrisjsewell what was your original rule of thumb on when to use a content manager when using the container? There were quite a few places that still didn't use it, which I have now corrected, but now I am wondering if you were excluding these on purpose?

@sphuber
Copy link
Contributor

sphuber commented Jan 17, 2022

@chrisjsewell shall we wrap this one up as well before the release? Could you please look at my previous comment?

@chrisjsewell
Copy link
Member Author

There were quite a few places that still didn't use it, which I have now corrected, but now I am wondering if you were excluding these on purpose?

No, I think it should be fine to use it every, given that it is only the sqlalchemy database if it opened it.

One quick thought, we have:

    @property
    def container(self) -> Container:
        return self._container

but do we actually need to expose this publicly, given that it should only ever be used as a contextmanager?
It is only used in over methods on the class (which could directly use self._container) and a few tests.

@sphuber
Copy link
Contributor

sphuber commented Jan 20, 2022

Good point, the current uses outside of the class seem to be tests in tests/orm/nodes/data/test_array.py and tests/tools/archive/test_repository.py but I don't see a justification for either use. The first one is something I added and I think stems from the beginning of the implementation when there was a subtle bug with ArrayData and the disk-objectstore implementation at that time. So I had to explicitly test reading the array after the DOS was packed. But that should not be a test for the ArrayData so I am tempted to simply remove that test, or at least the explicit call to pack all loose files on the repo.

And a similar point goes for the second test. It is testing export and import of repo files, but that should be already done elsewhere. So I think that test can also be removed. So if you agree, I will go ahead and remove the public property and update/remove those tests.

@chrisjsewell
Copy link
Member Author

So if you agree, I will go ahead and remove the public property and update/remove those tests.

yep sure 👍

@sphuber sphuber force-pushed the fix-repo-access branch 2 times, most recently from 5cad5c0 to c9c5afe Compare January 20, 2022 16:43
This should not be used externally. The two tests that were using it
should also not have to use it directly.
@sphuber sphuber merged commit 8e52d18 into develop Jan 20, 2022
@sphuber sphuber deleted the fix-repo-access branch January 20, 2022 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQLite used in multithreaded mode by the engine
3 participants