-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
export archive format forces reading of data.json file to memory #493
Comments
I started looking into this issue as well. It turns out that while json is not really meant to be streamed, in practice people have to deal with large json files and there actually are a couple of json stream parsing libraries. The one I'd probably go for is ijson - this shows how to iterate over the groups array in the import ijson
f = open('metadata.json', 'r')
objects = ijson.items(f, 'export_parameters.entities_starting_set.Group.item')
for uuid in objects:
print(uuid) It has built-in support to iterate over list items at arbitrary places in the json via the notation above - however, for some reason, iterating over keys of a dictionary (which we need for the I'll look into how to do this if I find the time. |
Support has been added in ICRAR/ijson@d4cca87 , i.e. one can now do import ijson
f = open('data.json', 'r')
for k,v in ijson.kvitems(f, 'node_attributes'):
print(k, v) In my test, the python implementation takes about 0.2ms per key-value pair (=per node). There's one more issue - the layout of AiiDA's There are several possible ways forward: A) We stick with the current format and try to implement a (suboptimal and slightly complex) batch parser with three iterators going at once. B) We change the layout of C) We switch to a new file format that is made for "seeking" and "slicing" (like HDF5) Mentioning @giovannipizzi @sphuber for comment |
I would definitely go for B, but possibly taking into account also other requirements rather than just this one in the redesign of th export format (that could possibly end up in solutions like C). If however implementing this is fast and (probably) C it's going to take longer, ok to start working on it. |
Just as a comment: There is a new format called JSON Lines (or "newline-delimited JSON) which is essentially one JSON per line and suitable for storing large numbers of nested data structures in one file (Google adopted it for BigQuery). See also discussion on "Loading Data Efficiently" for binary file formats to consider in option C) |
It seems to me that when doing a verdi import, such as
the content of
<export_file>.aiida
is first cached in the RAM, if not entirely, at least for what concerns the database. As a consequence, this actually sets the largest size of a file that can be imported to the instantaneous amount of free RAM.The text was updated successfully, but these errors were encountered: