🐛 FIX: Ensure Container DB always closed after access #5123

chrisjsewell · 2021-09-09T00:42:55Z

Possibly fixes #4899

This is the "super cautious" approach, whereby we ensure the sqlite database is closed after every access. Obviously this may not be performant if you want to perform multiple operations

Alternatively, we could add __enter__ / __exit__ methods to the AbstractRepositoryBackend / Repository itself, and then move the responsibility of using a context manager to uses of it.
Possibly changing Profile.get_repository to a contextmanager Profile.with_repository

chrisjsewell · 2021-09-09T01:01:21Z

This certainly removes the warning for open files in test_add_multiply.py

Still not entirely sure how this relates to threads though, but to note the check can be turned on/off at the sqlite level:

chrisjsewell · 2021-09-09T01:26:24Z

Another consideration here, is that in the Container.close we close the sqlalchemy session, but also throw it away:

    def close(self) -> None:
        """Close open files (in particular, the connection to the SQLite DB)."""
        if self._session is not None:
            self._session.close()
            self._session = None

In the code here, we may just want to close the session, but not throw it away (which then requires having to create a whole new engine)

codecov · 2021-09-09T11:07:02Z

Codecov Report

Merging #5123 (7e0cb19) into develop (e880ab7) will increase coverage by 0.01%.
The diff coverage is 98.53%.

@@             Coverage Diff             @@
##           develop    #5123      +/-   ##
===========================================
+ Coverage    82.06%   82.06%   +0.01%     
===========================================
  Files          533      533              
  Lines        38296    38307      +11     
===========================================
+ Hits         31424    31434      +10     
- Misses        6872     6873       +1

Flag	Coverage Δ
django	`77.12% <98.53%> (+0.04%)`	⬆️
sqlalchemy	`76.42% <98.53%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/repository/backend/disk_object_store.py	`95.73% <98.53%> (+0.45%)`	⬆️
aiida/transports/plugins/local.py	`81.46% <0.00%> (-0.25%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e880ab7...7e0cb19. Read the comment docs.

sphuber · 2021-09-09T11:20:06Z

Trying to understand how this is fixing the problem that you describe. Firstly, I am not sure where in the disk-objectstore you close any files. In this commit you seem to add the context manager methods but they don't seem to do anything. Where are the files being closed?

chrisjsewell · 2021-09-09T11:21:32Z

Where are the files being closed?

The file is the sqlite database; when you close the session you close the file (well at least you allow it the possibility to close)

chrisjsewell · 2021-09-09T11:23:21Z

In this commit you seem to add the context manager methods but they don't seem to do anything

That's the wrong context manager you are looking at; I'm talking about the one on the Container class (that closes the session)

chrisjsewell · 2021-09-09T11:40:43Z

Basically, I believe it is not ideal that we are not specifically closing connections with the sqlite DB ever.
With the Postgres DB, I guess we also do not specifically close these connections; you open one when you load the profile, then it dies when the python process dies. But, since here it is a service, it can actively handle dropped connections and clean them up on the server side.
In contrast, the sqlite DB is a static file, so when you "drop" a connection you leave a file in the open state, similar to if you did file.open() but never called file.close()

chrisjsewell · 2021-09-09T11:49:01Z

My feeling in general, is that it would probably be "optimal" to abstract the database connection from the Container, to allow something like:

container = Container("path/to/container", database=database)

and just have the container's DB table stored within the Postgres DB. After all, the container is basically useless anyway, without the Postgres DB to tell you the (one-to-many) relationship between the hashkeys and the node files

sphuber · 2021-09-09T12:29:07Z

Did a very quick and dirty test of performance between develop and this branch. Just writing some content and reading it again:

#!/usr/bin/env python
import io
from aiida.orm import Data

count = 100
node = Data().store()  # Need to store it to actually use the disk object store, otherwise the sandbox is used

def write_files(count):
    for i in range(count):
        filename = f'file{i}.txt'
        node._repository.put_object_from_filelike(io.BytesIO(f'some\ncontent{i}'.encode('utf-8')), filename)

def read_files(count):
    for i in range(count):
        filename = f'file{i}.txt'
        with node.open(filename) as handle:
            handle.read()


%timeit write_files(100)
%timeit read_files(100)

`develop`

In [5]: %timeit write_files(100)
187 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit read_files(100)
48.9 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This PR

In [13]: %timeit write_files(100)
188 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: %timeit read_files(100)
270 ms ± 5.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Seems writing is not affected that much but reading is slowed down by a factor of 5. This should be just the offset though, so for bigger files the increase will not be as noticeable.

chrisjsewell · 2021-09-09T12:42:19Z

Seems writing is not affected that much but reading is slowed down by a factor of 5. This should be just the offset though, so for bigger files the increase will not be as noticeable.

Yep, I believe writing should not be affected at all, because you are just writing the "loose" file to the container folder (and not touching the database).
But then reading is where you need to check the database, for packed files

sphuber · 2021-09-09T12:44:47Z

Yep, I believe writing should not be affected at all, because you are just writing the "loose" file to the container folder (and not touching the database).

It would affect the places where we write directly to pack files though, which would require a database connection. Although currently we only do this in the export command I believe. There we can of course make sure that we perform all operations in a single context manager, so it wouldn't be too problematic.

chrisjsewell · 2021-09-09T12:56:01Z

One thing to quickly note here: when reading a file, currently a database session is always started, even if the hashkey ends up being related to a "loose" file, since the database is checked first before the loose folder.
There may well be a reason for this ordering (I can't remember offhand), but obviously looking in the loose folder first would minimize db opens

giovannipizzi · 2021-09-17T17:45:20Z

Thanks @chrisjsewell for looking into this and @sphuber for testing.

Obviously this may not be performant if you want to perform multiple operations

Indeed, that is my main concern (as you both stated I think, the issue is mostly for a lot of small files).
The thing that also puzzles me is, indeed, how this fixes the problem (I mean: I understand why it manages to fix it, but I don't understand why the problem is there in the first place and why it relates to threads?)

Related to this PR, I guess that, if the speed is sub-optimal but still acceptable, and the behaviour is more robust, it's ok to merge for now and open, at the same time, a new issue about checking the performance of this and how to improve it (probably looking into bulk methods). At least we can move on with the release, and then we improve performance later. Or add some "bulk" operations where everything is done within the context manager. But I agree that it's sub-optimal to reopen the DB every time, this has a cost as shown by Sebastiaan (reading becomes even slower than writing).

BTW - can you @sphuber (or @chrisjsewell) run the same test, but pack the loose objects before reading? I think the performance we really care about is for packed objects, as we tell people to pack anyway and there's no way back (and if the performance is OK with packed objects, then we're OK - they can just pack to improve performance). Even if I guess the slowdown is identical, since the SQLite file needs to be re-opened every time.

A few additional comments just to answer to some points:

In the code here, we may just want to close the session, but not throw it away (which then requires having to create a whole new engine).

I see - one needs of course to change slightly the code for caching the session. @chrisjsewell if you think this helps, you can try doing this at the disk-objectstore level and I think that, if all tests continue to pass (and performance improves), then we're safe (just open an issue there if you don't want to do it now but still think it can help).

Basically, I believe it is not ideal that we are not specifically closing connections with the sqlite DB ever.

I see the general reasoning, but is this really a problem? I'm not aware of explicit suggestions to close the connection (file) to SQLite (but I didn't check so maybe I'm wrong). But maybe this interacts with some other things that SQLA does (connection pooling etc.?).
Indeed, my main reasoning in disk-objectstore was to keep it open for performance reasons (BTW, when reading, there are conditions when the connection is closed and then reopen, to be sure that an object wasn't packed in the meantime: see this (and the respective comment):
https://github.com/aiidateam/disk-objectstore/blob/develop/disk_objectstore/container.py#L722-L731
(but also a few more places, like this https://github.com/aiidateam/disk-objectstore/blob/7a09ea2a953f0b0dfa79a6688306c51a501f874b/disk_objectstore/container.py#L1956 and I think a couple more)

@chrisjsewell just to double check - is this login going to interact correctly with your new context manager?

just have the container's DB table stored within the Postgres DB

I see and you're probably right for AiiDA. My original design was with SQLite to keep it separate from AiiDA, and it gives the name "disk" of "disk-objectstore" here was meant to be "stores the state on disk, no need of any running service".

For AiiDA, it might be interesting to look at moving all inside AiiDA - but 1. there are a few subtleties related e.g. with how transactions are managed in PSQL vs. SQLite (or other things like vacuuming) that would require a bit of thought and testing, 2. one would need to re-benchmark speed, 3. backups of the repository need also a dump of the DB (again, OK for AiiDA, but less so for the library used as a standalone, and 4. by the way SQLite is even more robust and has less issues related to storing files than even PSQL :-D
(just to say that it's additional work, that we might not want to do now).

when reading a file, currently a database session is always started, even if the hashkey ends up being related to a "loose" file, since the database is checked first before the loose folder.

There may well be a reason for this ordering (I can't remember offhand), but obviously looking in the loose folder first would minimize db opens

I might need to think a bit more if there are more reasons. One is the way the various operations are organised in order to guarantee that concurrent writing and packing from many processes does not create issues (but I would need to think a bit more to give an example where the other order would be problematic - and maybe one could change the overall logic to overcome this issue, I don't know).
But most importantly, my goal is that client applications use the packed version by default over the loose version, if both exists. Both existing is a valid condition, needed for safe concurrency of packing (1 single process) and reading/writing (many processes), and to overcome two issues: 1. in Windows, reading a file locks it so one cannot delete it (e.g. when cleaning up the repository after packing), so if there are processes reading loose objects, it's annoying as some loose objects cannot be cleaned up. 2. In recent Macs (past few years) there is this nasty bug of the filesystem that I discovered. I never heard back from Apple, but admittedly I'm trying to test it again and I don't reproduce it - maybe they fixed this in recent MacOS BigSur versions? Anyway, the issue is that there were some race conditions where, when opening a file that was being deleted (atomically) in a different Unix process, the reading process would not either read the content or get a missing-file-error, but would open it and find zero content, that's quite annoying.
Therefore, my reasoning that we should try to ensure that clients read the packed objects first (so that we reduce the risk that we clean up the repository and a client app gets confused).

sphuber · 2021-09-19T14:48:40Z

Let's make a decision on this one. There is one thing I am not clear on yet: @chrisjsewell you claim that this may fix the problem of the warnings of files being accessed on different threads. May I ask why you think this? I don't really see how closing the files would get rid of this.

Anyway, if this really does fix the problem, I would be ok with merging this and accepting the performance hit.

giovannipizzi · 2021-12-17T14:24:00Z

@chrisjsewell - do you happen to have some simple tests on how much this might affect performance? (E.g. storing new nodes (even if this stores loose objects, so might not be affected) or loading nodes (after packing their files) or more specifically getting files from nodes.
I think a simple %timeit check of these three things before and after this PR would be enough and if the performance hit is minimal, I would encourage merging this (I keep seeing quite a few of these, even in the shell sometimes, while submitting). I'm happy to test running develop after this is merged to check if we still see the warning.

sphuber · 2021-12-17T14:36:26Z

I think a simple %timeit check of these three things before and after this PR would be enough and if the performance hit is minimal, I would encourage merging this (I keep seeing quite a few of these, even in the shell sometimes, while submitting). I'm happy to test running develop after this is merged to check if we still see the warning.

I have already performed this quick benchmarking and attached the results in this PR. (See this comment). Or are you thinking of different tests?

giovannipizzi · 2021-12-17T15:01:36Z

Ah sorry - I had completely forgotten about that you had already done the tests, sorry for not checking back in the thread.
I think that in the interest of having something reliable and having time to test if this PR fixes original the issue, I would suggest to go ahead with this PR and open an issue in the post-2.0 to monitor if this needs to be improved?

sphuber · 2021-12-17T15:43:22Z

@giovannipizzi Fine for me to go with that approach. @chrisjsewell what was your original rule of thumb on when to use a content manager when using the container? There were quite a few places that still didn't use it, which I have now corrected, but now I am wondering if you were excluding these on purpose?

sphuber · 2022-01-17T14:32:24Z

@chrisjsewell shall we wrap this one up as well before the release? Could you please look at my previous comment?

chrisjsewell · 2022-01-20T14:51:05Z

There were quite a few places that still didn't use it, which I have now corrected, but now I am wondering if you were excluding these on purpose?

No, I think it should be fine to use it every, given that it is only the sqlalchemy database if it opened it.

One quick thought, we have:

    @property
    def container(self) -> Container:
        return self._container

but do we actually need to expose this publicly, given that it should only ever be used as a contextmanager?
It is only used in over methods on the class (which could directly use self._container) and a few tests.

sphuber · 2022-01-20T15:07:48Z

Good point, the current uses outside of the class seem to be tests in tests/orm/nodes/data/test_array.py and tests/tools/archive/test_repository.py but I don't see a justification for either use. The first one is something I added and I think stems from the beginning of the implementation when there was a subtle bug with ArrayData and the disk-objectstore implementation at that time. So I had to explicitly test reading the array after the DOS was packed. But that should not be a test for the ArrayData so I am tempted to simply remove that test, or at least the explicit call to pack all loose files on the repo.

And a similar point goes for the second test. It is testing export and import of repo files, but that should be already done elsewhere. So I think that test can also be removed. So if you agree, I will go ahead and remove the public property and update/remove those tests.

chrisjsewell · 2022-01-20T15:09:19Z

So if you agree, I will go ahead and remove the public property and update/remove those tests.

yep sure 👍

This should not be used externally. The two tests that were using it should also not have to use it directly.

chrisjsewell requested review from sphuber and giovannipizzi September 9, 2021 01:04

chrisjsewell force-pushed the fix-repo-access branch from 3383705 to 0e4cf22 Compare September 9, 2021 10:50

sphuber force-pushed the fix-repo-access branch from 0e4cf22 to 2144c30 Compare December 17, 2021 15:32

chrisjsewell and others added 3 commits January 20, 2022 16:39

🐛 FIX: Ensure Container DB always closed after access

218af2f

Add additional context managers

0c839e3

Update disk_object_store.py

a7a27ca

sphuber force-pushed the fix-repo-access branch from 83ae467 to eb4e7ad Compare January 20, 2022 15:39

sphuber force-pushed the fix-repo-access branch 2 times, most recently from 5cad5c0 to c9c5afe Compare January 20, 2022 16:43

Remove the container public property

7e0cb19

This should not be used externally. The two tests that were using it should also not have to use it directly.

sphuber force-pushed the fix-repo-access branch from c9c5afe to 7e0cb19 Compare January 20, 2022 21:28

sphuber approved these changes Jan 20, 2022

View reviewed changes

sphuber merged commit 8e52d18 into develop Jan 20, 2022

sphuber deleted the fix-repo-access branch January 20, 2022 22:16

chrisjsewell mentioned this pull request Mar 12, 2022

🔀 MERGE: Release v2.0.0b1 #5426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 FIX: Ensure Container DB always closed after access #5123

🐛 FIX: Ensure Container DB always closed after access #5123

chrisjsewell commented Sep 9, 2021 •

edited

Loading

chrisjsewell commented Sep 9, 2021 •

edited

Loading

chrisjsewell commented Sep 9, 2021

codecov bot commented Sep 9, 2021 •

edited

Loading

sphuber commented Sep 9, 2021

chrisjsewell commented Sep 9, 2021 •

edited

Loading

chrisjsewell commented Sep 9, 2021

chrisjsewell commented Sep 9, 2021

chrisjsewell commented Sep 9, 2021 •

edited

Loading

sphuber commented Sep 9, 2021

chrisjsewell commented Sep 9, 2021

sphuber commented Sep 9, 2021

chrisjsewell commented Sep 9, 2021

giovannipizzi commented Sep 17, 2021

sphuber commented Sep 19, 2021

giovannipizzi commented Dec 17, 2021

sphuber commented Dec 17, 2021

giovannipizzi commented Dec 17, 2021

sphuber commented Dec 17, 2021

sphuber commented Jan 17, 2022

chrisjsewell commented Jan 20, 2022

sphuber commented Jan 20, 2022

chrisjsewell commented Jan 20, 2022

🐛 FIX: Ensure Container DB always closed after access #5123

🐛 FIX: Ensure Container DB always closed after access #5123

Conversation

chrisjsewell commented Sep 9, 2021 • edited Loading

chrisjsewell commented Sep 9, 2021 • edited Loading

chrisjsewell commented Sep 9, 2021

codecov bot commented Sep 9, 2021 • edited Loading

Codecov Report

sphuber commented Sep 9, 2021

chrisjsewell commented Sep 9, 2021 • edited Loading

chrisjsewell commented Sep 9, 2021

chrisjsewell commented Sep 9, 2021

chrisjsewell commented Sep 9, 2021 • edited Loading

sphuber commented Sep 9, 2021

develop

This PR

chrisjsewell commented Sep 9, 2021

sphuber commented Sep 9, 2021

chrisjsewell commented Sep 9, 2021

giovannipizzi commented Sep 17, 2021

sphuber commented Sep 19, 2021

giovannipizzi commented Dec 17, 2021

sphuber commented Dec 17, 2021

giovannipizzi commented Dec 17, 2021

sphuber commented Dec 17, 2021

sphuber commented Jan 17, 2022

chrisjsewell commented Jan 20, 2022

sphuber commented Jan 20, 2022

chrisjsewell commented Jan 20, 2022

chrisjsewell commented Sep 9, 2021 •

edited

Loading

chrisjsewell commented Sep 9, 2021 •

edited

Loading

codecov bot commented Sep 9, 2021 •

edited

Loading

chrisjsewell commented Sep 9, 2021 •

edited

Loading

chrisjsewell commented Sep 9, 2021 •

edited

Loading

`develop`