Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 FIX: Import archive into large DB #5740

Merged
merged 8 commits into from
Nov 4, 2022

Conversation

chrisjsewell
Copy link
Member

As detailed in https://www.sqlite.org/limits.html, SQLITE_MAX_VARIABLE_NUMBER puts a limit on how many variables can be used in a single SQL query. This size can be easily reached, if filtering by nodes in a large database, leading to:

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) too many SQL variables
[SQL: SELECT db_dbnode_1.id, db_dbnode_1.uuid, db_dbnode_1.node_type, db_dbnode_1.process_type, db_dbnode_1.label, db_dbnode_1.description, db_dbnode_1.ctime, db_dbnode_1.mtime, JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.attributes, ?)) AS anon_1, JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.extras, ?)) AS anon_2, db_dbnode_1.repository_metadata, db_dbnode_1.dbcomputer_id, db_dbnode_1.user_id 
FROM db_dbnode AS db_dbnode_1 
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE ? ESCAPE '\' AND (db_dbnode_1.uuid NOT IN (?, ?, ....

Therefore, this commit changes the filtering of UUIDs to be on the client side, then batches queries for the full node fields by a fixed number (filter_size).

As detailed in https://www.sqlite.org/limits.html, SQLITE_MAX_VARIABLE_NUMBER puts a limit on how many variables can be used in a single SQL query.
This size can be easily reached, if filtering by nodes in a large database.
Therefore, this commit changes the filtering of UUIDs to be on the client side,
then batches queries for the full nodes by a fixed number.
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chrisjsewell apart from the comments in the code, some other questions:

  • Given that this is fixing a problem for importing from large sqlite archives, does the same problem not also likely exist for exporting to/from sqlite database? I think that since these are often temporary databases the exporting-from is unlikely, but exporting-to would actually be a common use case, would it not?
  • I presume you have tested this manually? Would it be impossible or too costly to implement a test? Or can we have a test that reproduces this, simply by creating one with >999 nodes and importing that?

def import_archive(
path: Union[str, Path],
*,
archive_format: Optional[ArchiveFormatAbstract] = None,
filter_size: int = 999,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to put a comment here (or in the commit message) saying that this is the default limit in versions prior to 3.32.0 (2020-05-22)

aiida/tools/archive/imports.py Outdated Show resolved Hide resolved
@@ -117,6 +129,7 @@ def import_archive(
type_check(test_run, bool)
backend = backend or get_manager().get_profile_storage()
type_check(backend, StorageBackend)
qparams = QueryParams(batch_size=batch_size, filter_size=filter_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: why not just have query_params for increased legibility. The few extra characters shouldn't hurt, should they?

).iterdict(batch_size=batch_size)

# collect the unique entities from the input backend to be added to the output backend
ufields = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that now here you read in all the unique fields for this entity into memory from the input backend, whereas before this was directly streamed into the bulk_insert. Could this lead to memory problems, given that this very fix is to deal with large imports?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well only the those that are not already in the output backend, e.g. it will be a list of all node UUIDs that are in the archive but not in the profile.
So no I don't think this is a particularly big hit on memory usage, because you are only reading the UUIDs, as opposed to the full node content (attributes etc)

@chrisjsewell
Copy link
Member Author

Given that this is fixing a problem for importing from large sqlite archives

To clarify, this is a problem importing into large profiles, not from large archives

@sphuber
Copy link
Contributor

sphuber commented Nov 4, 2022

To clarify, this is a problem importing into large profiles, not from large archives

I see, I didn't get that clearly. Could you perhaps add more information in the comments in the relevant code changes and the commit message?

And if it is an sqlite limitation, does that mean there is something with a large sqlite database that they are importing data to? This should anyway not be recommended, should it? Where did you come across this bug?

Looking at the code, I don't really understand why the problem is with the target backend. The only point where the target backend (backend_to) is concerned is in the backend_to.bulk_insert call, but that code is untouched. Just before it, you change the batch_iter to batch on filter_size instead of batch_size but that still operates on backend_from. On top of that, filter_size is just 1 smaller compared to batch_size. So if the problem was there, was it really the problem that batch_size=1000 was 1 too big?

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Nov 4, 2022

This reproduces the problem, and is fixed after the PR, for python 3.8 from conda-forge, including sqlite v3.39.4 (h9ae0607_0). you'll note that I couldn't actually recreate it until there were 300,000 nodes (at least 200,000 was fine), so perhaps SQLITE_MAX_VARIABLE_NUMBER is bigger here 🤷

The test takes ~105 seconds to run, so maybe not great to include

from aiida import orm, manage, tools
from aiida.tools.archive import create_archive, import_archive


def test_import_into_large_profile(aiida_profile_clean, tmp_path):
    backend = manage.get_manager().get_profile_storage()
    user_id = backend.default_user.id
    pks = backend.bulk_insert(orm.EntityTypes.NODE, [{"user_id": user_id} for _ in range(300_000)], allow_defaults=True)
    assert orm.QueryBuilder().append(orm.Node).count() == 300_000
    create_archive(None, filename=tmp_path / 'archive.aiida')
    tools.delete_nodes([pks[0]], dry_run=False)
    assert orm.QueryBuilder().append(orm.Node).count() == 300_000 - 1
    import_archive(tmp_path / 'archive.aiida')

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Nov 4, 2022

Looking at the code, I don't really understand why the problem is with the target backend.

@sphuber its not a problem with the target backend, its when the target backend and input backend have lots of nodes in common, so that backend_unique_id below is very long, and you hit the parameter limit for the (input) SQLITE db query

       filters={
            unique_field: {
                '!in': list(backend_unique_id)
            }
        } if backend_unique_id else {},

@sphuber
Copy link
Contributor

sphuber commented Nov 4, 2022

This reproduces the problem, and is fixed after the PR, for python 3.8 from conda-forge

Cheers, appreciate that. Confirmed that it also breaks for me locally. I agree that we don't want to add this in the unit tests 😅

its not a problem with the target backend, its when the target backend and input backend have lots of nodes in common, so that backend_unique_id below is very long, and you hit the parameter limit for the (input) SQLITE db query

Thanks, that really helps, I now understand the problem. Think it would be great to add a comment at the line with for nrows, ufields_batch in batch_iter(ufields, qparams.filter_size): saying something to the effect of:

Batch the bulk insert in batch sizes of query_params.filter_size since the query is filtered on ufields_batch. If ufields is large and would not be batched, the maximum number of query variables that certain database backends impose (such as sqlite) can be exceeded.

Last thing: I noticed you change qparams to query_params (thanks for that) but only did a few. If you can grep-replace the remaining and add the comment described above, I will approve.

@chrisjsewell
Copy link
Member Author

Where did you come across this bug?

it was reported to me independently by @Crivella and @azadoks

@chrisjsewell chrisjsewell requested a review from sphuber November 4, 2022 17:29
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, thanks @chrisjsewell

@chrisjsewell chrisjsewell merged commit 899471a into aiidateam:main Nov 4, 2022
@chrisjsewell chrisjsewell deleted the fix-archive-import branch November 4, 2022 18:15
@chrisjsewell
Copy link
Member Author

Thanks for the review 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants