Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLite used in multithreaded mode by the engine #4899

Closed
giovannipizzi opened this issue Apr 30, 2021 · 20 comments · Fixed by #5123
Closed

SQLite used in multithreaded mode by the engine #4899

giovannipizzi opened this issue Apr 30, 2021 · 20 comments · Fixed by #5123
Assignees
Labels
Milestone

Comments

@giovannipizzi
Copy link
Member

Describe the bug

When running many simulations with the daemon (many concurrent aiida-quantemespresso WorkChains) with the new version of AiiDA (current develop, after 1.6.3, with the new repository merged) I get a lot of these errors in my daemon log file:

sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 139722984797952 and this is thread id 139722082379520.
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140098881738496 and this is thread id 140097979320064.
...

that are interleaved with lines like this:

04/30/2021 03:53:21 PM <52504> sqlalchemy.pool.impl.NullPool: [ERROR] Exception closing connection <sqlite3.Connection object at 0x7f13bca321f0>

(note: currently I have over 6000 such lines from the past 24h - note that I've been running ~15000 processes in the same time frame).

Steps to reproduce

Steps to reproduce the behavior:

  1. Install the current develop
  2. Run some workchains via the daemon
  3. Check your daemon log

Expected behavior

I'm not sure if it's easy to double check if these errors actually created problems in storing or retrieving data (simulations seem to be running fine, but they are many so I didn't do a thorough check).
However, they clearly show a problem.
The thing that it's strange is:

  • SQLite should be used only by the new repository
  • Threads should be used in AiiDA only for the heartbeats, and shouldn't (I think?) try to access the node repository or even more generally the ORM
    So it's not clear to me why we get these errors, and who's triggering these

Your environment

  • Operating system [e.g. Linux]: Linux Ubuntu 16.04
  • Python version [e.g. 3.7.1]: 3.7.9
  • aiida-core version [e.g. 1.2.1]: develop, post 1.6.3 (c7897c5)
@giovannipizzi
Copy link
Member Author

Pinging @sphuber @chrisjsewell

@giovannipizzi giovannipizzi added this to the v2.0.0 milestone Apr 30, 2021
@sphuber
Copy link
Contributor

sphuber commented Apr 30, 2021

Just to be clear, it doesn't actually except any of the AiiDA processes? There don't seem to be any actual negative side effects for now?

@chrisjsewell
Copy link
Member

chrisjsewell commented Apr 30, 2021

yep can recreate with aiida-integration-tests (aiida-sleep workchain -nw 10 -nc 10 -t 10 --submit)

I also get:

04/30/2021 02:22:43 PM <1613> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 110, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 82, in do_upload
    with SandboxFolder() as folder:
  File "/root/aiida-core/aiida/common/folders.py", line 421, in __init__
    os.makedirs(sandbox)
  File "/usr/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/root/.aiida/repository/default/sandbox'

04/30/2021 02:22:43 PM <1613> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 1 of do_upload excepted, retrying after 20 seconds
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 82, in do_upload
    with SandboxFolder() as folder:
  File "/root/aiida-core/aiida/common/folders.py", line 421, in __init__
    os.makedirs(sandbox)
  File "/usr/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/root/.aiida/repository/default/sandbox'

has this been identified/fixed yet? (I'm running with c7897c5, which is missing a few recent commits)

@sphuber
Copy link
Contributor

sphuber commented Apr 30, 2021

Haven't seen that exception yet, so would maybe open separate issue for that

@chrisjsewell
Copy link
Member

but yeh all workchains completed successfully

@chrisjsewell
Copy link
Member

chrisjsewell commented Apr 30, 2021

Threads should be used in AiiDA only for the heartbeats, and shouldn't (I think?) try to access the node repository or even more generally the ORM
So it's not clear to me why we get these errors, and who's triggering these

well surely this is if you are running multiple daemon workers, then each python process is a separate thread trying to write to the repository concurrently
(from what I've seen so far, you only get these warnings when you are using multiple workers)

@sphuber
Copy link
Contributor

sphuber commented Apr 30, 2021

well surely this is if you are running multiple daemon workers, then each python process is a separate thread trying to write to the repository concurrently

Are you sure? I think this error message may really come from a single Python interpreter. I don't think it is checking thread ids across system processes. And yes, the only threads that should be used is the main thread, and the communicator thread of a daemon worker. That should not deal with any data from the database, but there is no 100% guarantee and a leak might explain this error message.

@chrisjsewell
Copy link
Member

chrisjsewell commented Apr 30, 2021

Perhaps not, but so far I have only been able to reproduce with greater than one daemon worker running.

It's also of note that it only occurs when retrieving calculation outputs, and it took me a while to realise why I could not reproduce the warnings on subsequent runs: because my aiida-sleep runs would always produce the same output file content, and thus (due to de-duplication) it did not need to be stored again, i.e. it only occurs when the file is not already in the repository.

I've also made a note in our repo CLI document, that now verdi node delete does not actually delete anything from the repository, and so may be confusing to users why they are not claiming back any disk-space after running it (it might be good to have an option on this command to also clean the repo)

(to reproduce myself, for now, I used the disk_objectstore API after deleting all process nodes: container.delete_objects(container.list_all_objects()))

@giovannipizzi
Copy link
Member Author

giovannipizzi commented May 2, 2021

I also get:
...
FileExistsError: [Errno 17] File exists: '/root/.aiida/repository/default/sandbox'
...
has this been identified/fixed yet? (I'm running with c7897c5, which is missing a few recent commits)

I'm not aware, but indeed great if you @chrisjsewell could open another issue for this (I don't see one yet).
Anyway, the fix is easy: just replace these two lines:

if not os.path.exists(sandbox):
os.makedirs(sandbox)

with a single line:

os.makedirs(sandbox, exists_ok=True)

(this is a concurrency issue, happening if the very first usage of AiiDA happens through more than one daemon: both will try at the same time to create the sandbox, the second will fail as it has been created by the first one).

@giovannipizzi
Copy link
Member Author

well surely this is if you are running multiple daemon workers, then each python process is a separate thread trying to write to the repository concurrently

I agree with @sphuber that this is not the cause. I'm quite sure that the error

SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 139722984797952 and this is thread id 139722082379520.

is output if there are two threads in the same python process, and one of the two is using the object created in a different thread. Also because, in any case, you cannot reuse an object from a different python process as they don't share memory :-)

I don't know why the error only occurs when multiple daemons are running (note that I didn't test this, though; I'm just repeating what @chrisjsewell said above).
Even if nothing excepts (which I cannot confirm for sure - I have many processes failed, mostly because of Quantum ESPRESSO failures, but one or two of them could have raised this message - I'll try to check if I do more checks), I think we need to understand why these two errors are thrown. They might be signatures of subtle bugs, that might lead to random data corruption, so I would at least try to understand what's happening.

I guess the first thing is trying to track which part of the code is printing that error message.... any idea?

@giovannipizzi
Copy link
Member Author

I've also made a note in our repo CLI document, that now verdi node delete does not actually delete anything from the repository, and so may be confusing to users why they are not claiming back any disk-space after running it (it might be good to have an option on this command to also clean the repo)

That is correct - the verdi repository clean command will also need to delete objects, indeed.

container.delete_objects(container.list_all_objects()))

That's quite drastic :-)
The important thing to remember is that the correct way to delete files is to

  1. get the list of all node hash keys from the container
  2. get the list of all node hash keys known to AiiDA in the PostgreSQL DB
  3. compute the set difference of the two hash key sets (point 1 - point 2)
  4. call container.delete_objects on the hash keys of point 3

Very important: do NOT swap point 1 and 2!
Otherwise, nodes that are being created at the same time (more precisely, between the conclusion of the listing of point 2 and the listing of point 1) will have objects that will be listed in 3 and thus immediately deleted at point 4!

Also very important: what I write above is not even enough!! A new node could be created in a PostgreSQL transaction. The objects will be created first in the container, and another process listing the objects known to AiiDA (point 2) might not see the hash keys, depending on the isolation of the transactions - and those objects would therefore be deleted.

Therefore, this kind of deletion is quite "dangerous" and MUST be performed with all daemons shut down (this is why deletion is not done immediately - in addition, since there is only deduplication with only one object stored even if it belongs to multiple AiiDA nodes, it's not straightforward to decide whether to actually delete the corresponding objects or not - searching for all other instances is slow as you need to iterate over all nodes, so better to delay in a repository clean operation).

Anyway, this comment is more for a different issue, just mentioning here since it was brought up.

@chrisjsewell
Copy link
Member

chrisjsewell commented May 2, 2021

@giovannipizzi as a "developer" I already understand all this (you didn't have write it all here 😜). But the point is, as a "user, I don't care about any of it (and shouldn't have to), all I know is that you have broken functionality that I previously used, and now I have to go through this more complex and potentially dangerous process, just to reclaim some disk space.

@giovannipizzi
Copy link
Member Author

Great if it was already clear! I still preferred to write it out because if someone else implements it and we mess that part up, we end up with big data losses, so better safe than sorry :-)

However I wouldn't call the current (new) functionality "broken" - I don't see a safe and efficient way to reclaim space immediately, so that a trade-off (still, I agree that there should be a single and simple way to reclaim space).
Note that for the same reason, also e.g. SQLite and PostgreSQL (and I guess most of the DBs) do not free space when you delete nodes, but you need to call a specific VACUUM operation for this, for similar reasons.
If you course you see a way to delete and reclaim the space immediately and also in an efficient way, I'm happy to hear!

@giovannipizzi
Copy link
Member Author

Quick question/note: here

node = load_node(pk=pk)

we load in memory the node and (I think?) pass it to threads:
self._poll_process(node, functools.partial(inline_callback, event))

Is it needed to pass the ORM node? This might be the cause of (some of the) problems?

It seems to me that only two things are needed:
https://github.com/aiidateam/aiida-core/blob/develop/aiida/engine/runners.py#L347-L352

  1. the class name (easy, we can just pass that)
  2. being able to check is_terminated - maybe we can just have a utility function that returns is_terminated given a PK, for instance, making sure we don't have a fully loaded node passed around?

@giovannipizzi
Copy link
Member Author

Also pinging @CasperWA - I realise that we might be using multithreading for some web-serving engines for the API etc.
Could you check (maybe coordinating with @asle85) a proper multithreaded deployment (e.g. with Apache, making sure we're using multithreading) with this new version of AiiDA, to see if we run into issues? It's important that you do it on a container that has some packed file, meaning:

  1. you get a file content via the REST API from some repository you migrated (e.g. a UPF, the output of a calculation etc.)
  2. if you fetch file contents from simulations run after the migrations, the objects will not be packed; you'll need to pack the files first - Chris is writing the verdi commands for that, for now you might need to do:
    from disk_objectstore import Container
    c = Container('PATH_TO_CONTAINER_FOLDER_INSIDE_AIIDA_REPOSITORY')
    c.pack_all_loose()

@sphuber
Copy link
Contributor

sphuber commented May 5, 2021

we load in memory the node and (I think?) pass it to threads:

I don't see this being passed to the communication thread. This is purely an internal polling function that remains on the main thread of execution. It simply schedules a callback for that node. I doubt this is the origin of the message, but you can try to change it. Also, passing the loaded node around doesn't seem any different to me than simply passing the pk and using that to perform a query on the database, is it? Or maybe that through a direct query there is no possibility for some indirect connection to the repository? Because that is what is ultimately the problem right, that the repository is being accesses. I don't really see how calling is_terminated is doing that

@giovannipizzi
Copy link
Member Author

Thanks @chrisjsewell the aiida-integration-tests is super nice and useful!
I manage to reproduce it as well, however it seems that if I turn on extensive debugging with:

verdi config set logging.aiida_loglevel DEBUG
verdi config set logging.sqlalchemy_loglevel DEBUG 

the message disappears? @chrisjsewell do you confirm this?

Also, this therefore seems to happen (according to @chrisjsewell report above) only while retrieving, and these are all loose objects (there are no packs in aiida-integration-tests).
However, when writing a loose object (what the daemon does), the SQLite DB should never be accessed not even in read mode.
So my guess is that the error happens when reading an object (e.g. during parsing): the disk-objectstore will first try to check if a packed object exists with that hash key, before going to the loose version.

It's still a bit weird, though: the error message seems to hint that there was a 'reset' or 'close' or similar operation (indeed, I realise now that I probably copied only part of the error: the full error seems to be this:

04/29/2021 03:26:37 PM <43296> sqlalchemy.pool.impl.NullPool: [ERROR] Exception during reset or similar
Traceback (most recent call last):
  File "/home/pizzi/.virtualenvs/aiida-dev2.0/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 697, in _finalize_fairy
    fairy._reset(pool)
  File "/home/pizzi/.virtualenvs/aiida-dev2.0/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 893, in _reset
    pool._dialect.do_rollback(self)
  File "/home/pizzi/.virtualenvs/aiida-dev2.0/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 559, in do_rollback
    dbapi_connection.rollback()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140241880110848 and this is thread id 140240977692416.

or this

04/29/2021 03:26:37 PM <43296> sqlalchemy.pool.impl.NullPool: [ERROR] Exception closing connection <sqlite3.Connection object at 0x7f8c8de13f10>
Traceback (most recent call last):
  File "/home/pizzi/.virtualenvs/aiida-dev2.0/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 697, in _finalize_fairy
    fairy._reset(pool)
  File "/home/pizzi/.virtualenvs/aiida-dev2.0/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 893, in _reset
    pool._dialect.do_rollback(self)
  File "/home/pizzi/.virtualenvs/aiida-dev2.0/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 559, in do_rollback
    dbapi_connection.rollback()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140241880110848 and this is thread id 140240977692416.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pizzi/.virtualenvs/aiida-dev2.0/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 270, in _close_connection
    self._dialect.do_close(connection)
  File "/home/pizzi/.virtualenvs/aiida-dev2.0/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 565, in do_close
    dbapi_connection.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140241880110848 and this is thread id 140240977692416.

and the only place where I close and reopen a new session is here:
https://github.com/aiidateam/disk-objectstore/blob/a2561a40cf2e9a4d58325387f913ccf08111f5fd/disk_objectstore/container.py#L634

This should however only happen when trying to access a non-existent hashkey. But this should never happen? Why would AiiDA try to access a hash key that does not exist at all? It should only try to access those whose hash key is stored in the PSQL DB (and as such, they should be already in the disk-objectstore container), right?

In any case - I won't have time to further debug this for at least 7-10 days. If anyone has time to try to shed more light that would be great.

@unkcpz
Copy link
Member

unkcpz commented Jun 18, 2021

I encounter the same issue when updated to aiida-1.6.3.
The difference is that I do not run multiple workers but run my CalcJob in the interpreter.
My calcjob is as following, with receiving orm.SinglefileData as inputs, where I guess cause the issue:

class AtatMcsqsCalculation(CalcJob):
    """
    Calculation Job of mcsqs
    """

    @classmethod
    def define(cls, spec):
        #yapf: disable
        super().define(spec)
        spec.input('rndstr', valid_type=orm.SinglefileData, 
                   help='rndstr.in')
        spec.input('sqscell', valid_type=orm.SinglefileData, 
                   help='sqscell.out')
        spec.input('code_corrdump', valid_type=orm.Code, 
                   help='The `Code` to run corrdump before mcsqs.')

        spec.output('bestsqs', valid_type=orm.SinglefileData, help='sqs structure file')
        spec.output('bestcorr', valid_type=orm.SinglefileData, help='bestcorr of sqs')
        
        spec.input('metadata.options.parser_name', valid_type=str, default='atat.mcsqs')
        spec.input('metadata.options.resources', valid_type=dict, default={'num_machines': 1})
        spec.input('metadata.options.max_wallclock_seconds', valid_type=int, default=1800, required=True)
        spec.input('metadata.options.input_rndstr_filename', valid_type=str, default='rndstr.in')
        spec.input('metadata.options.input_sqscell_filename', valid_type=str, default='sqscell.out')
        spec.input('metadata.options.output_bestsqs_filename', valid_type=str, default='bestsqs.out')
        spec.input('metadata.options.output_bestcorr_filename', valid_type=str, default='bestcorr.out')

    def prepare_for_submission(self, folder):
        """
        In preparing the inputs of mcsqs, first generate a clusters.out which 
        contains the clusters information and then run `mcsqs`.
        """
        with folder.open(self.options.input_rndstr_filename, 'w', encoding='utf8') as handle:
            handle.write(self.inputs.rndstr.get_content())
            
        helpercodeinfo = CodeInfo()
        helpercodeinfo.code_uuid = self.inputs.code_corrdump.uuid
        helpercodeinfo.cmdline_params = ['-ro', '-noe', '-nop', '-clus', '-2=1.1', f'-l={self.options.input_rndstr_filename}']
        helpercodeinfo.withmpi = False
        
        with folder.open(self.options.input_sqscell_filename, 'w', encoding='utf8') as handle:
            handle.write(self.inputs.sqscell.get_content())
            
        codeinfo = CodeInfo()
        codeinfo.code_uuid = self.inputs.code.uuid
        codeinfo.cmdline_params = ['-rc', '-sd=1234']
        codeinfo.withmpi = False
        
        calcinfo = CalcInfo()
        calcinfo.codes_info = [helpercodeinfo, codeinfo]
        calcinfo.codes_run_mode = CodeRunMode.SERIAL
        calcinfo.retrieve_list = [self.options.output_bestcorr_filename, self.options.output_bestsqs_filename]

        return calcinfo

I stored two SinglefileData nodes and loaded them to run this calculation. Please tell me how I can give you more information for debugging.

@sphuber
Copy link
Contributor

sphuber commented Jun 18, 2021

@unkcpz I take it you mean you are on develop because we only use sqlite in the new file repository implementation which is not part of v1.6.3 or any earlier versions. Note that we don't preemptively up the version number on develop so it still points to v1.6.3 but really it is what will become v2.0. As for the actual problem, I currently don't have an idea for further debugging or the cause of the problem.

@ramirezfranciscof ramirezfranciscof added the priority/critical-blocking must be resolved before next release label Aug 10, 2021
@chrisjsewell
Copy link
Member

One thing I also noted in the tests is this warning:

tests/workflows/arithmetic/test_add_multiply.py::test_run
  /opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/sqlalchemy/sql/base.py:1147: ResourceWarning: unclosed file <_io.FileIO name=31 mode='wb' closefd=True>
    object.__setattr__(self, "_colset", set())

So a possible reason, is that you are not closing the sqlite connection (i.e. closing the file) after accessing the repository.
In disk-objectstore v0.6.0, I added context methods to the container, that will perform this close on exit, i.e.

container = Container("abc")
with container:
	# do operations

whenever we use the container in aiida, it may be prudent to do it in this manner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants