Query much slower when using two filters instead of one #4856

danieleongari · 2021-04-13T08:23:24Z

Hi,
when looping on:

for cif in cif_list:
    qb = QueryBuilder()
    qb.append(CifData, filters={'id':cif.pk}, tag='cif')
    qb.append(WorkChainNode, with_incoming='cif',
              filters={
                  'attributes.process_label': 'Cp2kPhonopyWorkChain',
                  'label': 'phonopy_test_1',  # adding this filter bumps loop time from 25s to 25 minutes
                      })
    qball_check = qb.all()

the query becomes extremely slow using two filters.
If I use just one filter it is acceptable: to give an idea for a list of 1000 it takes ca. 25 minutes instead of 25 seconds (60x)!
As an extra info, only some of the materials in the list have this outgoing WC, but from the tqdm progress bar I'm using, the time spent on each material is the same.

I recently updated to python-3.8.8/aiida_core-1.6.1 and I noted this weird behaviour of the querybuilder.
I did not have the chance to try it with the previous version 1.5 I was using, so I'm not sure it is version specific, but certainly I did similar queries with multiple filters and it was not that slow.

fyi @ltalirz

Edit (leo): the size of the db is 689681 nodes

ltalirz · 2021-04-13T09:20:24Z

Some more experiments on the individual query:

In [1]: pk=522398

In [5]: def query(pk, fast=False):
   ...:
   ...:     if fast:
   ...:         filters={'attributes.process_label': 'Cp2kPhonopyWorkChain'}
   ...:     else:
   ...:         filters={'attributes.process_label': 'Cp2kPhonopyWorkChain', 'label': 'phonopy_test_1'}
   ...:     qb=QueryBuilder()
   ...:     qb.append(Node, filters={'id': pk}, tag='cif')
   ...:     qb.append(WorkChainNode, with_incoming='cif', filters=filters)
   ...:     print(qb.all())
   ...:
   ...:

In [8]: time query(pk,fast=True)
[[<WorkChainNode: uuid: 6708952e-3433-4da3-84f0-198e73d2507f (pk: 613039) (aiida_lsmo.workchains.cp2k_phonopy.Cp2kPhonopyWorkChain)>]]
CPU times: user 8.74 ms, sys: 0 ns, total: 8.74 ms
Wall time: 9.48 ms

In [9]: time query(pk,fast=False)
[[<WorkChainNode: uuid: 6708952e-3433-4da3-84f0-198e73d2507f (pk: 613039) (aiida_lsmo.workchains.cp2k_phonopy.Cp2kPhonopyWorkChain)>]]
CPU times: user 10.9 ms, sys: 0 ns, total: 10.9 ms
Wall time: 1.26 s

If I'm interpreting the difference between CPU time (of the python process) vs wall time correctly (let me know if I'm wrong), then the first function call is waiting roughly ~1ms for postgresql to return the query, while the second call is waiting ~1.1s, i.e. the second postgresql query is actually a factor of 1000 slower.

@giovannipizzi is this expected?

giovannipizzi · 2021-04-22T14:47:59Z

I doubt this is due to a change in AiiDA code, most probably due to an increase in the DB size (if I had to bet).
Yes, I think your analysis is correct: it's just the query that got much slower in the DB.

This is clearly an "issue" at the DB level.
Some hints for debugging.

Print the queries (print(qb)) to see the raw SQL. The slow one looks like this:

SELECT db_dbnode_1.id, db_dbnode_1.uuid, db_dbnode_1.node_type, db_dbnode_1.process_type, db_dbnode_1.label, db_dbnode_1.description, db_dbnode_1.ctime, db_dbnode_1.mtime, db_dbnode_1.attributes, db_dbnode_1.extras, db_dbnode_1.user_id, db_dbnode_1.dbcomputer_id 
FROM db_dbnode 
AS db_dbnode_2 JOIN db_dblink AS db_dblink_1 ON db_dblink_1.input_id = db_dbnode_2.id
JOIN db_dbnode AS db_dbnode_1 ON db_dblink_1.output_id = db_dbnode_1.id 
WHERE CAST(db_dbnode_2.node_type AS VARCHAR) LIKE '%%' AND db_dbnode_2.id = 1 AND
CAST(db_dbnode_1.node_type AS VARCHAR) LIKE 'process.workflow.workchain.%%' AND
CASE WHEN (jsonb_typeof(db_dbnode_1.attributes #> %(attributes_1)s) = 'string') THEN (db_dbnode_1.attributes #>> '{process_label}') = 'Cp2kPhonopyWorkChain' ELSE false END
AND db_dbnode_1.label = 'phonopy_test_1'

and the fast one just does not have the last line AND db_dbnode_1.label = 'phonopy_test_1'.

report the number of results from your query, as this can affect a lot the results.
in general, avoid benchmarking functions that print (just create a list in memory) - if the list is long, it might get a lot of time just for printing
Therefore, it's an issue of how the query is performed. Note: SQL engines (PSQL here) do a lot of "guessing" (well, it's called query optimisation :-) ) to decide in which order to apply filters, trying to guess what will be faster. This depends on the data. So this is not an "exact science" and doing the very same query on a slightly different DB might give very different results (and it's hard to provide a "fix" in AiiDA that would work in all cases). To check what the query is actually doing, you need to run it in psql with EXPLAIN ANALYZE in front. Or use e.g. the suggestion here: QueryBuilder: add shortcut to EXPLAIN ANALYZE #3845 (BTW: I really think we should get that code as a method of the query builder! :-) ). I can run on my DB (but since I have zero results matching, the results are not really useful); yours might give different results, so great if you could provide your results.
It would be good to double check that the fields you're querying on are indexed. The label should, according to the AiiDA schema (great to double check in PSQL, though); the attributes are not. It's a bit strange that it gets slower when adding a query on an indexed field, but there are a few joins and the DB might decide to invert some queries' order; or, even if there is an index, it might decide it still needs to do a sequential scan as it thinks it's going to be faster.
The heuristics discussed above depend on the statistics that the DB collects on the actual content of the DB. These might be outdated. In addition, if you've been deleting stuff from your DB, you might have the data arranged on disk in a non-optimal way. Therefore I suggest, after doing all tests above, to turn off all daemons/verdi shells, enter a psql for that profile, and run a VACUUM ANALYZE. After, go again in verdi shell and try again the timing (maybe a few times), and report if it changed (and if the query plan changed)

Hope this helps!

danieleongari added the type/bug label Apr 13, 2021

chrisjsewell mentioned this issue Aug 12, 2021

♻️ REFACTOR: Move to sqlachemy 1.4 (and v2 API) #5069

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query much slower when using two filters instead of one #4856

Query much slower when using two filters instead of one #4856

danieleongari commented Apr 13, 2021 •

edited

Loading

ltalirz commented Apr 13, 2021 •

edited

Loading

giovannipizzi commented Apr 22, 2021

Query much slower when using two filters instead of one #4856

Query much slower when using two filters instead of one #4856

Comments

danieleongari commented Apr 13, 2021 • edited Loading

ltalirz commented Apr 13, 2021 • edited Loading

giovannipizzi commented Apr 22, 2021

danieleongari commented Apr 13, 2021 •

edited

Loading

ltalirz commented Apr 13, 2021 •

edited

Loading