Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query much slower when using two filters instead of one #4856

Open
danieleongari opened this issue Apr 13, 2021 · 2 comments
Open

Query much slower when using two filters instead of one #4856

danieleongari opened this issue Apr 13, 2021 · 2 comments
Labels

Comments

@danieleongari
Copy link

danieleongari commented Apr 13, 2021

Hi,
when looping on:

for cif in cif_list:
    qb = QueryBuilder()
    qb.append(CifData, filters={'id':cif.pk}, tag='cif')
    qb.append(WorkChainNode, with_incoming='cif',
              filters={
                  'attributes.process_label': 'Cp2kPhonopyWorkChain',
                  'label': 'phonopy_test_1',  # adding this filter bumps loop time from 25s to 25 minutes
                      })
    qball_check = qb.all()

the query becomes extremely slow using two filters.
If I use just one filter it is acceptable: to give an idea for a list of 1000 it takes ca. 25 minutes instead of 25 seconds (60x)!
As an extra info, only some of the materials in the list have this outgoing WC, but from the tqdm progress bar I'm using, the time spent on each material is the same.

I recently updated to python-3.8.8/aiida_core-1.6.1 and I noted this weird behaviour of the querybuilder.
I did not have the chance to try it with the previous version 1.5 I was using, so I'm not sure it is version specific, but certainly I did similar queries with multiple filters and it was not that slow.

fyi @ltalirz

Edit (leo): the size of the db is 689681 nodes

@ltalirz
Copy link
Member

ltalirz commented Apr 13, 2021

Some more experiments on the individual query:

In [1]: pk=522398

In [5]: def query(pk, fast=False):
   ...:
   ...:     if fast:
   ...:         filters={'attributes.process_label': 'Cp2kPhonopyWorkChain'}
   ...:     else:
   ...:         filters={'attributes.process_label': 'Cp2kPhonopyWorkChain', 'label': 'phonopy_test_1'}
   ...:     qb=QueryBuilder()
   ...:     qb.append(Node, filters={'id': pk}, tag='cif')
   ...:     qb.append(WorkChainNode, with_incoming='cif', filters=filters)
   ...:     print(qb.all())
   ...:
   ...:

In [8]: time query(pk,fast=True)
[[<WorkChainNode: uuid: 6708952e-3433-4da3-84f0-198e73d2507f (pk: 613039) (aiida_lsmo.workchains.cp2k_phonopy.Cp2kPhonopyWorkChain)>]]
CPU times: user 8.74 ms, sys: 0 ns, total: 8.74 ms
Wall time: 9.48 ms

In [9]: time query(pk,fast=False)
[[<WorkChainNode: uuid: 6708952e-3433-4da3-84f0-198e73d2507f (pk: 613039) (aiida_lsmo.workchains.cp2k_phonopy.Cp2kPhonopyWorkChain)>]]
CPU times: user 10.9 ms, sys: 0 ns, total: 10.9 ms
Wall time: 1.26 s

If I'm interpreting the difference between CPU time (of the python process) vs wall time correctly (let me know if I'm wrong), then the first function call is waiting roughly ~1ms for postgresql to return the query, while the second call is waiting ~1.1s, i.e. the second postgresql query is actually a factor of 1000 slower.

@giovannipizzi is this expected?

@giovannipizzi
Copy link
Member

I doubt this is due to a change in AiiDA code, most probably due to an increase in the DB size (if I had to bet).
Yes, I think your analysis is correct: it's just the query that got much slower in the DB.

This is clearly an "issue" at the DB level.
Some hints for debugging.

  1. Print the queries (print(qb)) to see the raw SQL. The slow one looks like this:
SELECT db_dbnode_1.id, db_dbnode_1.uuid, db_dbnode_1.node_type, db_dbnode_1.process_type, db_dbnode_1.label, db_dbnode_1.description, db_dbnode_1.ctime, db_dbnode_1.mtime, db_dbnode_1.attributes, db_dbnode_1.extras, db_dbnode_1.user_id, db_dbnode_1.dbcomputer_id 
FROM db_dbnode 
AS db_dbnode_2 JOIN db_dblink AS db_dblink_1 ON db_dblink_1.input_id = db_dbnode_2.id
JOIN db_dbnode AS db_dbnode_1 ON db_dblink_1.output_id = db_dbnode_1.id 
WHERE CAST(db_dbnode_2.node_type AS VARCHAR) LIKE '%%' AND db_dbnode_2.id = 1 AND
CAST(db_dbnode_1.node_type AS VARCHAR) LIKE 'process.workflow.workchain.%%' AND
CASE WHEN (jsonb_typeof(db_dbnode_1.attributes #> %(attributes_1)s) = 'string') THEN (db_dbnode_1.attributes #>> '{process_label}') = 'Cp2kPhonopyWorkChain' ELSE false END
AND db_dbnode_1.label = 'phonopy_test_1'

and the fast one just does not have the last line AND db_dbnode_1.label = 'phonopy_test_1'.

  1. report the number of results from your query, as this can affect a lot the results.

  2. in general, avoid benchmarking functions that print (just create a list in memory) - if the list is long, it might get a lot of time just for printing

  3. Therefore, it's an issue of how the query is performed. Note: SQL engines (PSQL here) do a lot of "guessing" (well, it's called query optimisation :-) ) to decide in which order to apply filters, trying to guess what will be faster. This depends on the data. So this is not an "exact science" and doing the very same query on a slightly different DB might give very different results (and it's hard to provide a "fix" in AiiDA that would work in all cases). To check what the query is actually doing, you need to run it in psql with EXPLAIN ANALYZE in front. Or use e.g. the suggestion here: QueryBuilder: add shortcut to EXPLAIN ANALYZE #3845 (BTW: I really think we should get that code as a method of the query builder! :-) ). I can run on my DB (but since I have zero results matching, the results are not really useful); yours might give different results, so great if you could provide your results.

  4. It would be good to double check that the fields you're querying on are indexed. The label should, according to the AiiDA schema (great to double check in PSQL, though); the attributes are not. It's a bit strange that it gets slower when adding a query on an indexed field, but there are a few joins and the DB might decide to invert some queries' order; or, even if there is an index, it might decide it still needs to do a sequential scan as it thinks it's going to be faster.

  5. The heuristics discussed above depend on the statistics that the DB collects on the actual content of the DB. These might be outdated. In addition, if you've been deleting stuff from your DB, you might have the data arranged on disk in a non-optimal way. Therefore I suggest, after doing all tests above, to turn off all daemons/verdi shells, enter a psql for that profile, and run a VACUUM ANALYZE. After, go again in verdi shell and try again the timing (maybe a few times), and report if it changed (and if the query plan changed)

Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants