export archive format forces reading of data.json file to memory #493

fernandogargiulo1986 · 2017-03-29T17:58:35Z

It seems to me that when doing a verdi import, such as

verdi import <export_file>.aiida

the content of <export_file>.aiida is first cached in the RAM, if not entirely, at least for what concerns the database. As a consequence, this actually sets the largest size of a file that can be imported to the instantaneous amount of free RAM.

The text was updated successfully, but these errors were encountered:

ltalirz · 2020-01-14T14:27:38Z

I started looking into this issue as well.

It turns out that while json is not really meant to be streamed, in practice people have to deal with large json files and there actually are a couple of json stream parsing libraries.
These would allow us to solve the memory issue without/before completely revising the export format.

The one I'd probably go for is ijson - this shows how to iterate over the groups array in the metadata.json file:

import ijson

f = open('metadata.json', 'r')
objects = ijson.items(f, 'export_parameters.entities_starting_set.Group.item')
for uuid in objects:
    print(uuid)

It has built-in support to iterate over list items at arbitrary places in the json via the notation above - however, for some reason, iterating over keys of a dictionary (which we need for the data.json) is not built in.

I'll look into how to do this if I find the time.

ltalirz · 2020-01-21T19:47:38Z

Support has been added in ICRAR/ijson@d4cca87 , i.e. one can now do

import ijson
f = open('data.json', 'r')
for k,v in ijson.kvitems(f, 'node_attributes'):
    print(k, v)

In my test, the python implementation takes about 0.2ms per key-value pair (=per node).
However, the wheel of the package comes with a pre-built C extension for many platforms that takes about 0.01ms per key-value pair, putting it within less than a factor of 2 of calling json.load in terms of time per k/v pair - pretty fast!

There's one more issue - the layout of AiiDA's data.json file splits node columns, attributes and extras into three separate lists.
That makes it quite difficult to get a "batch of nodes" - you need at least three iterators and it is not even clear whether all three lists always have the same length & order.

There are several possible ways forward:

A) We stick with the current format and try to implement a (suboptimal and slightly complex) batch parser with three iterators going at once.

B) We change the layout of data.json to something that can be sensibly parsed as a stream - e.g. simply move node_extras and node_attributes inside the Node dict.

C) We switch to a new file format that is made for "seeking" and "slicing" (like HDF5)

Mentioning @giovannipizzi @sphuber for comment

giovannipizzi · 2020-02-13T22:58:20Z

I would definitely go for B, but possibly taking into account also other requirements rather than just this one in the redesign of th export format (that could possibly end up in solutions like C). If however implementing this is fast and (probably) C it's going to take longer, ok to start working on it.

ltalirz · 2020-06-26T13:37:42Z

Just as a comment: There is a new format called JSON Lines (or "newline-delimited JSON) which is essentially one JSON per line and suitable for storing large numbers of nested data structures in one file (Google adopted it for BigQuery).
It might be worth considering this format when going with option B)

See also discussion on "Loading Data Efficiently" for binary file formats to consider in option C)

fernandogargiulo1986 added topic/repository topic/sharing labels Mar 29, 2017

fernandogargiulo1986 assigned szoupanos Mar 29, 2017

fernandogargiulo1986 changed the title ~~verdi import uses a lot of RAM~~ verdi import is voracious of RAM Mar 30, 2017

giovannipizzi added the topic/archive label Dec 3, 2018

giovannipizzi added this to the v1.1.0 milestone Dec 3, 2018

yakutovicha mentioned this issue Dec 8, 2018

Import/export improvements #999

Closed

ltalirz mentioned this issue Feb 5, 2020

dumps of large jsons run into MemoryError causing migrations of large databases to fail #3716

Closed

sphuber removed this from the v1.1.0 milestone Feb 28, 2020

ltalirz mentioned this issue Apr 15, 2020

memory consumption during verdi import #3936

Closed

ltalirz changed the title ~~verdi import is voracious of RAM~~ export archive format forces reading of data.json file to memory Apr 15, 2020

ramirezfranciscof removed the topic/repository label Jun 24, 2020

ltalirz mentioned this issue Oct 29, 2020

AiiDA import is eager on disk space #3575

Closed

chrisjsewell mentioned this issue Sep 24, 2021

♻️ REFACTOR: New archive format #5145

Merged

1 task

chrisjsewell linked a pull request Sep 24, 2021 that will close this issue

♻️ REFACTOR: New archive format #5145

Merged

1 task

chrisjsewell closed this as completed in #5145 Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

export archive format forces reading of data.json file to memory #493

export archive format forces reading of data.json file to memory #493

fernandogargiulo1986 commented Mar 29, 2017 •

edited

Loading

ltalirz commented Jan 14, 2020 •

edited

Loading

ltalirz commented Jan 21, 2020

giovannipizzi commented Feb 13, 2020

ltalirz commented Jun 26, 2020 •

edited

Loading

export archive format forces reading of data.json file to memory #493

export archive format forces reading of data.json file to memory #493

Comments

fernandogargiulo1986 commented Mar 29, 2017 • edited Loading

ltalirz commented Jan 14, 2020 • edited Loading

ltalirz commented Jan 21, 2020

giovannipizzi commented Feb 13, 2020

ltalirz commented Jun 26, 2020 • edited Loading

fernandogargiulo1986 commented Mar 29, 2017 •

edited

Loading

ltalirz commented Jan 14, 2020 •

edited

Loading

ltalirz commented Jun 26, 2020 •

edited

Loading