Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlalchemy: adding 1 filter makes query very slow #1389

Open
ltalirz opened this issue Apr 9, 2018 · 1 comment
Open

sqlalchemy: adding 1 filter makes query very slow #1389

ltalirz opened this issue Apr 9, 2018 · 1 comment
Assignees
Labels
topic/database-sqlalchemy type/performance Issue related to how quickly AiiDA works

Comments

@ltalirz
Copy link
Member

ltalirz commented Apr 9, 2018

Consider the following AiiDA query

from aiida.orm.data.cif import CifData
from aiida.orm.data.parameter import ParameterData

qb = QueryBuilder()
qb.append(CifData, project=['uuid'],tag='cifs')

filters = {'attributes.deliverable_capacity': { '>=': 175 }}
project = ['attributes.density', 'attributes.deliverable_capacity','attributes.surface_area','uuid',  'attributes.name']
qb.append(ParameterData, filters=filters, project=project, descendant_of='cifs')

print(qb)
result = qb.all()

Using the Cof-database available here.
In short, we filter ParameterData by the deliverable_capacity attribute and require it is larger than 175 (+ get the uuid of the CifData node that the ParameterData descends from).
This query runs in ~2s on 70k ParameterData & CifData nodes on my machine.

If instead, I want to query for deliverable_capacity in the range between 175 and 300, i.e. when I replace the filter by

filters = {
   'attributes.deliverable_capacity':
        {'and': [{ '>=': 175}, {'<': 300}] }
}

the query suddenly takes 5 minutes.

I paste below the generated SQL for both queries

>175

WITH RECURSIVE anon_6(ancestor_id, descendant_id, depth) AS
(SELECT db_dblink_1.input_id AS ancestor_id, db_dblink_1.output_id AS descendant_id, CAST(0 AS INTEGER) AS depth
FROM db_dbnode AS db_dbnode_3 JOIN db_dblink AS db_dblink_1 ON db_dblink_1.input_id = db_dbnode_3.id
WHERE CAST(db_dbnode_3.type AS VARCHAR) LIKE 'data.cif.%' AND db_dblink_1.type IN ('createlink', 'inputlink') UNION ALL SELECT anon_6.ancestor_id AS ancestor_id, db_dblink_2.output_id AS descendant_id, anon_6.depth + CAST(1 AS INTEGER) AS current_depth
FROM anon_6 JOIN db_dblink AS db_dblink_2 ON db_dblink_2.input_id = anon_6.descendant_id
WHERE db_dblink_2.type IN ('createlink', 'inputlink'))
 SELECT db_dbnode_1.uuid, db_dbnode_2.attributes #> '{density}' AS anon_1, db_dbnode_2.attributes #> '{deliverable_capacity}' AS anon_2, db_dbnode_2.attributes #> '{surface_area}' AS anon_3, db_dbnode_2.uuid, db_dbnode_2.attributes #> '{name}' AS anon_4
FROM db_dbnode AS db_dbnode_1 JOIN anon_6 AS anon_5 ON anon_5.ancestor_id = db_dbnode_1.id JOIN db_dbnode AS db_dbnode_2 ON anon_5.descendant_id = db_dbnode_2.id
WHERE CAST(db_dbnode_1.type AS VARCHAR) LIKE 'data.cif.%' AND CAST(db_dbnode_2.type AS VARCHAR) LIKE 'data.parameter.%' AND jsonb_typeof(db_dbnode_2.attributes #> %(attributes_1)s) = 'number' AND CAST(db_dbnode_2.attributes #>> '{deliverable_capacity}' AS FLOAT) >= 175

>175, <300

WITH RECURSIVE anon_6(ancestor_id, descendant_id, depth) AS
(SELECT db_dblink_1.input_id AS ancestor_id, db_dblink_1.output_id AS descendant_id, CAST(0 AS INTEGER) AS depth
FROM db_dbnode AS db_dbnode_3 JOIN db_dblink AS db_dblink_1 ON db_dblink_1.input_id = db_dbnode_3.id
WHERE CAST(db_dbnode_3.type AS VARCHAR) LIKE 'data.cif.%' AND db_dblink_1.type IN ('createlink', 'inputlink') UNION ALL SELECT anon_6.ancestor_id AS ancestor_id, db_dblink_2.output_id AS descendant_id, anon_6.depth + CAST(1 AS INTEGER) AS current_depth
FROM anon_6 JOIN db_dblink AS db_dblink_2 ON db_dblink_2.input_id = anon_6.descendant_id
WHERE db_dblink_2.type IN ('createlink', 'inputlink'))
 SELECT db_dbnode_1.uuid, db_dbnode_2.attributes #> '{density}' AS anon_1, db_dbnode_2.attributes #> '{deliverable_capacity}' AS anon_2, db_dbnode_2.attributes #> '{surface_area}' AS anon_3, db_dbnode_2.uuid, db_dbnode_2.attributes #> '{name}' AS anon_4
FROM db_dbnode AS db_dbnode_1 JOIN anon_6 AS anon_5 ON anon_5.ancestor_id = db_dbnode_1.id JOIN db_dbnode AS db_dbnode_2 ON anon_5.descendant_id = db_dbnode_2.id
WHERE CAST(db_dbnode_1.type AS VARCHAR) LIKE 'data.cif.%' AND CAST(db_dbnode_2.type AS VARCHAR) LIKE 'data.parameter.%' AND jsonb_typeof(db_dbnode_2.attributes #> %(attributes_1)s) = 'number' AND CAST(db_dbnode_2.attributes #>> '{deliverable_capacity}' AS FLOAT) >= 175 AND jsonb_typeof(db_dbnode_2.attributes #> %(attributes_2)s) = 'number' AND CAST(db_dbnode_2.attributes #>> '{deliverable_capacity}' AS FLOAT) < 300

The only difference is in the last line.

P.S. I'm running aiida 0.11.4 and postgres 9.5.9

@giovannipizzi
Copy link
Member

What is the speed if you just do this filter?

filters = {
   'attributes.deliverable_capacity':
        { '<': 300 }
}

Also, as a second, independent test (do if afterwards, though): could you run, in your DB (from psql), a VACUUM ANALYZE on the relevant tables, and see if afterwards the speed changes?

@giovannipizzi giovannipizzi added this to the v1.1.0 milestone Dec 3, 2018
@sphuber sphuber removed this from the v1.1.0 milestone Feb 28, 2020
@chrisjsewell chrisjsewell added the type/performance Issue related to how quickly AiiDA works label Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/database-sqlalchemy type/performance Issue related to how quickly AiiDA works
Projects
None yet
Development

No branches or pull requests

5 participants