-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQLite used in multithreaded mode by the engine #4899
Comments
Pinging @sphuber @chrisjsewell |
Just to be clear, it doesn't actually except any of the AiiDA processes? There don't seem to be any actual negative side effects for now? |
yep can recreate with aiida-integration-tests ( I also get:
has this been identified/fixed yet? (I'm running with c7897c5, which is missing a few recent commits) |
Haven't seen that exception yet, so would maybe open separate issue for that |
but yeh all workchains completed successfully |
well surely this is if you are running multiple daemon workers, then each python process is a separate thread trying to write to the repository concurrently |
Are you sure? I think this error message may really come from a single Python interpreter. I don't think it is checking thread ids across system processes. And yes, the only threads that should be used is the main thread, and the communicator thread of a daemon worker. That should not deal with any data from the database, but there is no 100% guarantee and a leak might explain this error message. |
Perhaps not, but so far I have only been able to reproduce with greater than one daemon worker running. It's also of note that it only occurs when retrieving calculation outputs, and it took me a while to realise why I could not reproduce the warnings on subsequent runs: because my I've also made a note in our repo CLI document, that now (to reproduce myself, for now, I used the |
I'm not aware, but indeed great if you @chrisjsewell could open another issue for this (I don't see one yet). aiida-core/aiida/common/folders.py Lines 420 to 421 in 944f61f
with a single line: os.makedirs(sandbox, exists_ok=True) (this is a concurrency issue, happening if the very first usage of AiiDA happens through more than one daemon: both will try at the same time to create the sandbox, the second will fail as it has been created by the first one). |
I agree with @sphuber that this is not the cause. I'm quite sure that the error
is output if there are two threads in the same python process, and one of the two is using the object created in a different thread. Also because, in any case, you cannot reuse an object from a different python process as they don't share memory :-) I don't know why the error only occurs when multiple daemons are running (note that I didn't test this, though; I'm just repeating what @chrisjsewell said above). I guess the first thing is trying to track which part of the code is printing that error message.... any idea? |
That is correct - the
That's quite drastic :-)
Very important: do NOT swap point 1 and 2! Also very important: what I write above is not even enough!! A new node could be created in a PostgreSQL transaction. The objects will be created first in the container, and another process listing the objects known to AiiDA (point 2) might not see the hash keys, depending on the isolation of the transactions - and those objects would therefore be deleted. Therefore, this kind of deletion is quite "dangerous" and MUST be performed with all daemons shut down (this is why deletion is not done immediately - in addition, since there is only deduplication with only one object stored even if it belongs to multiple AiiDA nodes, it's not straightforward to decide whether to actually delete the corresponding objects or not - searching for all other instances is slow as you need to iterate over all nodes, so better to delay in a Anyway, this comment is more for a different issue, just mentioning here since it was brought up. |
@giovannipizzi as a "developer" I already understand all this (you didn't have write it all here 😜). But the point is, as a "user, I don't care about any of it (and shouldn't have to), all I know is that you have broken functionality that I previously used, and now I have to go through this more complex and potentially dangerous process, just to reclaim some disk space. |
Great if it was already clear! I still preferred to write it out because if someone else implements it and we mess that part up, we end up with big data losses, so better safe than sorry :-) However I wouldn't call the current (new) functionality "broken" - I don't see a safe and efficient way to reclaim space immediately, so that a trade-off (still, I agree that there should be a single and simple way to reclaim space). |
Quick question/note: here aiida-core/aiida/engine/runners.py Line 305 in 6c4ced3
we load in memory the node and (I think?) pass it to threads: aiida-core/aiida/engine/runners.py Line 330 in 6c4ced3
Is it needed to pass the ORM node? This might be the cause of (some of the) problems? It seems to me that only two things are needed:
|
Also pinging @CasperWA - I realise that we might be using multithreading for some web-serving engines for the API etc.
|
I don't see this being passed to the communication thread. This is purely an internal polling function that remains on the main thread of execution. It simply schedules a callback for that node. I doubt this is the origin of the message, but you can try to change it. Also, passing the loaded node around doesn't seem any different to me than simply passing the pk and using that to perform a query on the database, is it? Or maybe that through a direct query there is no possibility for some indirect connection to the repository? Because that is what is ultimately the problem right, that the repository is being accesses. I don't really see how calling |
Thanks @chrisjsewell the aiida-integration-tests is super nice and useful!
the message disappears? @chrisjsewell do you confirm this? Also, this therefore seems to happen (according to @chrisjsewell report above) only while retrieving, and these are all loose objects (there are no packs in aiida-integration-tests). It's still a bit weird, though: the error message seems to hint that there was a 'reset' or 'close' or similar operation (indeed, I realise now that I probably copied only part of the error: the full error seems to be this:
or this
and the only place where I close and reopen a new session is here: This should however only happen when trying to access a non-existent hashkey. But this should never happen? Why would AiiDA try to access a hash key that does not exist at all? It should only try to access those whose hash key is stored in the PSQL DB (and as such, they should be already in the disk-objectstore container), right? In any case - I won't have time to further debug this for at least 7-10 days. If anyone has time to try to shed more light that would be great. |
I encounter the same issue when updated to aiida-1.6.3. class AtatMcsqsCalculation(CalcJob):
"""
Calculation Job of mcsqs
"""
@classmethod
def define(cls, spec):
#yapf: disable
super().define(spec)
spec.input('rndstr', valid_type=orm.SinglefileData,
help='rndstr.in')
spec.input('sqscell', valid_type=orm.SinglefileData,
help='sqscell.out')
spec.input('code_corrdump', valid_type=orm.Code,
help='The `Code` to run corrdump before mcsqs.')
spec.output('bestsqs', valid_type=orm.SinglefileData, help='sqs structure file')
spec.output('bestcorr', valid_type=orm.SinglefileData, help='bestcorr of sqs')
spec.input('metadata.options.parser_name', valid_type=str, default='atat.mcsqs')
spec.input('metadata.options.resources', valid_type=dict, default={'num_machines': 1})
spec.input('metadata.options.max_wallclock_seconds', valid_type=int, default=1800, required=True)
spec.input('metadata.options.input_rndstr_filename', valid_type=str, default='rndstr.in')
spec.input('metadata.options.input_sqscell_filename', valid_type=str, default='sqscell.out')
spec.input('metadata.options.output_bestsqs_filename', valid_type=str, default='bestsqs.out')
spec.input('metadata.options.output_bestcorr_filename', valid_type=str, default='bestcorr.out')
def prepare_for_submission(self, folder):
"""
In preparing the inputs of mcsqs, first generate a clusters.out which
contains the clusters information and then run `mcsqs`.
"""
with folder.open(self.options.input_rndstr_filename, 'w', encoding='utf8') as handle:
handle.write(self.inputs.rndstr.get_content())
helpercodeinfo = CodeInfo()
helpercodeinfo.code_uuid = self.inputs.code_corrdump.uuid
helpercodeinfo.cmdline_params = ['-ro', '-noe', '-nop', '-clus', '-2=1.1', f'-l={self.options.input_rndstr_filename}']
helpercodeinfo.withmpi = False
with folder.open(self.options.input_sqscell_filename, 'w', encoding='utf8') as handle:
handle.write(self.inputs.sqscell.get_content())
codeinfo = CodeInfo()
codeinfo.code_uuid = self.inputs.code.uuid
codeinfo.cmdline_params = ['-rc', '-sd=1234']
codeinfo.withmpi = False
calcinfo = CalcInfo()
calcinfo.codes_info = [helpercodeinfo, codeinfo]
calcinfo.codes_run_mode = CodeRunMode.SERIAL
calcinfo.retrieve_list = [self.options.output_bestcorr_filename, self.options.output_bestsqs_filename]
return calcinfo I stored two |
@unkcpz I take it you mean you are on |
One thing I also noted in the tests is this warning:
So a possible reason, is that you are not closing the sqlite connection (i.e. closing the file) after accessing the repository.
whenever we use the container in aiida, it may be prudent to do it in this manner |
Describe the bug
When running many simulations with the daemon (many concurrent aiida-quantemespresso WorkChains) with the new version of AiiDA (current develop, after 1.6.3, with the new repository merged) I get a lot of these errors in my daemon log file:
that are interleaved with lines like this:
(note: currently I have over 6000 such lines from the past 24h - note that I've been running ~15000 processes in the same time frame).
Steps to reproduce
Steps to reproduce the behavior:
Expected behavior
I'm not sure if it's easy to double check if these errors actually created problems in storing or retrieving data (simulations seem to be running fine, but they are many so I didn't do a thorough check).
However, they clearly show a problem.
The thing that it's strange is:
So it's not clear to me why we get these errors, and who's triggering these
Your environment
The text was updated successfully, but these errors were encountered: