Archive export refactor (2) #4534

chrisjsewell · 2020-10-30T15:24:23Z

FYI @ltalirz just a heads up, this is the final PR I alluded to (written on top of #4532)

Based on the experience from the import refactor, it refactors the archive writer to be somewhat similar; using it as a context manage and "streaming" the data to its methods, as opposed to now where all the data is passed to it in a single dataclass.

The refactor also removes a few spurious database queries; merging them in to the entity queries.

~~A --batch-size argument is also added to the CLI, to allow control of query batch sizing.~~

The implementation tentatively works for zip so far (as in, that is what I was running in the meeting demo) but requires a bit more work to clean up and apply to tar as well.

ltalirz · 2020-10-30T16:12:06Z

Ok, I will check this after #4532 is merged.

A --batch-size argument is also added to the CLI, to allow control of query batch sizing.

Is this already impacting memory significantly?

If not, I suggest to keep this parameter at the python API only for the moment and don't expose it on the cli yet - we should limit the number of flags and having the flag there raises expectations that we may not be able to meet (yet).

ltalirz

thanks @chrisjsewell
I started having a look through and adding comments; I then saw that it's still in draft status

let me know when I should re-review this

ltalirz · 2020-11-03T09:46:15Z

aiida/cmdline/commands/cmd_export.py

@@ -101,10 +101,18 @@ def inspect(archive, version, data, meta_data):
    show_default=True,
    help='Include or exclude comments for node(s) in export. (Will also export extra users who commented).'
 )
+@click.option(


See comment #4534 (comment)

This allows for performance (cpu/memory) testing from the CLI, as we have done for other PRs (see below). I would just comment it out before merging, because it is definitely something very useful and something that will be beneficial if not now then for the new format.

commented out

aiida/tools/importexport/archive/writers.py

ltalirz · 2020-11-03T09:49:35Z

aiida/tools/importexport/archive/writers.py

-        },
-        'conversion_info': export_data.metadata.conversion_info
-    }
+    def close(self, excepted: bool):


What happens with this bool?
Also, pylint does not complain about this unused argument, perhaps I'm missing something here...

it allows the writer to decide whether it wants to actually write the output file, given that the export process excepted

ltalirz · 2020-11-03T09:53:14Z

aiida/tools/importexport/archive/writers.py

-            sharded_uuid = export_shard_uuid(uuid)
+    @abstractmethod
+    def write_metadata(self, data: ArchiveMetadata):
+        """ """


hehe, that's an interesting docstring. could use some love (also below)?

aiida/tools/importexport/common/config.py

aiida/tools/importexport/dbexport/__init__.py

aiida/tools/importexport/archive/writers.py

chrisjsewell · 2020-11-03T10:22:18Z

I started having a look through and adding comments; I then saw that it's still in draft status

Yes indeed patience lol.
The general structure of the export function I believe is finished, but the writers.py I am working on now, since at this point is a bit hacky just to get an initial working implementation.

chrisjsewell · 2020-11-04T13:45:28Z

@ltalirz this is now ready for review
(I have not yet run performance testing)

Not ZipFolder has been replaced by ZipPath.
Really, I could also write a TarPath, but I don't think I have the time now.
This is how the original code was; for the tar just writing to a folder then compressing afterwards, rather than opening the tar and writing directly to it.
I'm not sure if there was a particular reason for this

codecov · 2020-11-04T13:52:12Z

Codecov Report

Merging #4534 (86a81ab) into develop (008580e) will decrease coverage by 0.09%.
The diff coverage is 88.87%.

@@             Coverage Diff             @@
##           develop    #4534      +/-   ##
===========================================
- Coverage    79.50%   79.41%   -0.08%     
===========================================
  Files          482      481       -1     
  Lines        35325    35279      -46     
===========================================
- Hits         28083    28015      -68     
- Misses        7242     7264      +22

Flag	Coverage Δ
django	`73.57% <88.87%> (-0.09%)`	⬇️
sqlalchemy	`72.75% <88.87%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/tools/importexport/common/archive.py	`74.70% <ø> (-1.20%)`	⬇️
aiida/tools/importexport/common/config.py	`100.00% <ø> (ø)`
aiida/cmdline/commands/cmd_export.py	`90.99% <55.56%> (-3.03%)`	⬇️
aiida/tools/importexport/archive/migrators.py	`86.40% <68.00%> (-4.44%)`	⬇️
aiida/tools/importexport/archive/readers.py	`89.32% <72.73%> (-3.09%)`	⬇️
aiida/tools/importexport/archive/writers.py	`92.00% <91.01%> (-5.29%)`	⬇️
aiida/tools/importexport/dbexport/__init__.py	`97.40% <95.35%> (-0.69%)`	⬇️
aiida/cmdline/commands/cmd_import.py	`83.02% <100.00%> (+0.17%)`	⬆️
aiida/tools/importexport/archive/common.py	`76.05% <100.00%> (-3.51%)`	⬇️
...ools/importexport/archive/migrations/v03_to_v04.py	`91.16% <100.00%> (ø)`
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 008580e...86a81ab. Read the comment docs.

chrisjsewell · 2020-11-05T11:20:03Z

With ~40,000 node repo; big speedup, a little more memory usage

develop:

(base) aiida@088cc428b4c4:/$ cd tmp
(base) aiida@088cc428b4c4:/tmp$ syrupy.py -i 1 --separator=, --no-align verdi export create -G 1 -- mount_folder/export_develop.aiida
SYRUPY: Writing process resource usage samples to 'syrupy_20201105105553.ps.log'
SYRUPY: Writing raw process resource usage logs to 'syrupy_20201105105553.ps.raw'
SYRUPY: Executing command 'verdi export create -G 1 -- mount_folder/export_develop.aiida'
SYRUPY: Redirecting command output stream to 'syrupy_20201105105553.out.log'
SYRUPY: Redirecting command error stream to 'syrupy_20201105105553.err.log'
SYRUPY: Completed running: verdi export create -G 1 -- mount_folder/export_develop.aiida
SYRUPY: Started at 2020-11-05 10:55:53.406105
SYRUPY: Ended at 2020-11-05 10:59:35.599326
SYRUPY: Total run time: 0 hour(s), 03 minute(s), 42.193221 second(s)
(base) aiida@088cc428b4c4:/tmp$ python -c 'import pandas as pd; ax = pd.read_csv("syrupy_20201105105553.ps.log").set_index("ELAPSED").plot(y="RSS", grid=True); ax.get_figure().savefig("mount_folder/output_develop.png")'

this PR:

(base) aiida@088cc428b4c4:/tmp$ syrupy.py -i 1 --separator=, --no-align verdi export create -G 1 -- mount_folder/export_new.aiida
SYRUPY: Writing process resource usage samples to 'syrupy_20201105110111.ps.log'
SYRUPY: Writing raw process resource usage logs to 'syrupy_20201105110111.ps.raw'
SYRUPY: Executing command 'verdi export create -G 1 -- mount_folder/export_new.aiida'
SYRUPY: Redirecting command output stream to 'syrupy_20201105110111.out.log'
SYRUPY: Redirecting command error stream to 'syrupy_20201105110111.err.log'
SYRUPY: Completed running: verdi export create -G 1 -- mount_folder/export_new.aiida
SYRUPY: Started at 2020-11-05 11:01:11.163114
SYRUPY: Ended at 2020-11-05 11:03:34.289323
SYRUPY: Total run time: 0 hour(s), 02 minute(s), 23.126209 second(s)
(base) aiida@088cc428b4c4:/tmp$ python -c 'import pandas as pd; ax = pd.read_csv("syrupy_20201105110111.ps.log").set_index("ELAPSED").plot(y="RSS", grid=True); ax.get_figure().savefig("mount_folder/output_new.png")'

(base) aiida@088cc428b4c4:/tmp$ verdi export create -G 1 --verbosity DEBUG -- mount_folder/export_new2.aiida
/
EXPORT
--------------  ------------------------------
Archive         mount_folder/export_new2.aiida
Format          JSON Zip (compression=8)
Export version  0.9

Inclusion rules
-----------------  ----
Include Comments   True
Include Logs       True

Traversal rules
---------------------------------  -----
Follow links input calc forwards   False
Follow links input calc backwards  True
Follow links create forwards       True
Follow links create backwards      True
Follow links return forwards       True
Follow links return backwards      False
Follow links input work forwards   False
Follow links input work backwards  True
Follow links call calc forwards    True
Follow links call calc backwards   True
Follow links call work forwards    True
Follow links call work backwards   True

STARTING EXPORT...
Collecting nodes in groups               100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40065/40065
Traversing provenance via links ...      100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1
WRITING METADATA...
Writing links                            100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40064/40064
Building entity database queries         100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4
Writing entity data                      100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40066/40066
Writing group UUID -> [nodes UUIDs]
Exporting node repositories: 40065       100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40065/40065
FINALIZING EXPORT...
Exported Entities:
  - Node  : 40065
  - User  : 1
  - Group : 1

Success: wrote the export archive file to mount_folder/export_new2.aiida

ltalirz · 2020-11-05T11:37:22Z

Thanks for the performance tests!

Do you know why the memory usage increases?

And what are the units here - are 30'000 = 30 GB? That's a lot (of course also before this PR)

chrisjsewell · 2020-11-05T11:47:00Z

And what are the units here - are 30'000 = 30 GB? That's a lot (of course also before this PR)

I think your missing a zero there lol, its in ~~bytes so 3Gb~~ kilobytes so 0.3 Gb

chrisjsewell · 2020-11-05T11:51:20Z

Do you know why the memory usage increases?

Could not say off hand

ltalirz

thanks @chrisjsewell ; looks good to me

ltalirz · 2020-11-05T11:39:14Z

aiida/tools/importexport/archive/common.py

@@ -47,7 +47,7 @@ class ArchiveMetadata:
    # optional data
    graph_traversal_rules: Optional[Dict[str, bool]] = dataclasses.field(default=None)
    # Entity type -> UUID list
-    entities_starting_set: Optional[Dict[str, Set[str]]] = dataclasses.field(default=None)
+    entities_starting_set: Optional[Dict[str, List[str]]] = dataclasses.field(default=None)


I guess there is a good reason for this change; just mentioning that set is also in the name of the attribute...

yeh I went back and forth on this, because it has to be converted to a list before storing as a json and so it was a little easier to convert before passing to the writer.
but your right that the naming is now a little off

aiida/tools/importexport/archive/zip_path.py

aiida/tools/importexport/common/config.py

ltalirz · 2020-11-05T11:59:28Z

I think your missing a zero there lol, its in bytes so 3Gb

Ah, right. Thanks, that is much more sensible.

Could not say off hand

Ok. Might be worth a quick look, since memory can be a hard limit, i.e. it may prevent people from exporting.

aiida/tools/importexport/dbexport/__init__.py

chrisjsewell · 2020-11-10T01:25:22Z

I've reviewed the latest commits - some minor comments

thanks, I've responded to the comments

chrisjsewell · 2020-11-11T16:10:23Z

archive-path has just been merged into conda-forge 🎉

chrisjsewell · 2020-11-12T00:01:05Z

FYI, in case it was not clear, but the test coverage % is reduced because I've actually removed more code than I've added (including all the code for compression being moved to archive-path), so less code lines are now tested

aiida/tools/importexport/dbexport/__init__.py

ltalirz · 2020-11-12T08:16:35Z

FYI, in case it was not clear, but the test coverage % is reduced because I've actually removed more code than I've added (including all the code for compression being moved to archive-path), so less code lines are now tested

Ok! Just a last suggestion to accept and we can merge

Co-authored-by: Leopold Talirz <[email protected]>

chrisjsewell · 2020-11-12T08:27:53Z

Ok! Just a last suggestion to accept and we can merge

done cheers

ltalirz · 2020-11-12T08:58:11Z

approved!

ramirezfranciscof added this to the v1.5.0 milestone Nov 2, 2020

chrisjsewell force-pushed the archive/export-refactor2 branch from 4721f26 to 2c2c7a3 Compare November 3, 2020 04:59

chrisjsewell added 2 commits November 3, 2020 10:36

add intital implementation

681803b

minor updates

187725f

chrisjsewell force-pushed the archive/export-refactor2 branch from 79ef2ed to 187725f Compare November 3, 2020 09:36

ltalirz reviewed Nov 3, 2020

View reviewed changes

chrisjsewell added 3 commits November 3, 2020 18:17

implement zippath

6d3311a

Update writers.py

f9e265b

full implmentation

e477280

chrisjsewell marked this pull request as ready for review November 4, 2020 13:38

chrisjsewell requested a review from ltalirz November 4, 2020 13:38

chrisjsewell added 2 commits November 4, 2020 14:39

Merge branch 'develop' into archive/export-refactor2

f9feb27

fix pre-commit

fca7620

fix pre-commit (2)

95b55a0

chrisjsewell added 5 commits November 4, 2020 15:34

pre-commit fix (3!)

b3227ac

fix __all__

4f1cef7

update ZipPath

6097589

improve verdi import stdout

b3fd6bb

fix pre-commit

cd7a0ef

ltalirz reviewed Nov 5, 2020

View reviewed changes

aiida/tools/importexport/dbexport/__init__.py Show resolved Hide resolved

chrisjsewell requested a review from ltalirz November 10, 2020 01:24

chrisjsewell added 2 commits November 10, 2020 02:54

Merge branch 'develop' into archive/export-refactor2

f509685

some extra typing fixes

d7021c8

This was referenced Nov 10, 2020

Improve mypy type checking #4553

Merged

Export file size. Big exports. #2399

Closed

chrisjsewell linked an issue Nov 11, 2020 that may be closed by this pull request

Export file size. Big exports. #2399

Closed

chrisjsewell added 3 commits November 11, 2020 17:16

Merge branch 'develop' into archive/export-refactor2

2b96a3e

apple review comments

98f0b9d

Merge branch 'develop' into archive/export-refactor2

37d2a4f

ltalirz reviewed Nov 12, 2020

View reviewed changes

aiida/tools/importexport/dbexport/__init__.py Outdated Show resolved Hide resolved

Update aiida/tools/importexport/dbexport/__init__.py

a2720fe

Co-authored-by: Leopold Talirz <[email protected]>

chrisjsewell requested a review from ltalirz November 12, 2020 08:27

ltalirz previously approved these changes Nov 12, 2020

View reviewed changes

fix pre-commit

86a81ab

chrisjsewell dismissed ltalirz’s stale review via 86a81ab November 12, 2020 10:14

ltalirz self-requested a review November 12, 2020 10:45

ltalirz approved these changes Nov 12, 2020

View reviewed changes

chrisjsewell merged commit bd197f3 into aiidateam:develop Nov 12, 2020

chrisjsewell deleted the archive/export-refactor2 branch November 12, 2020 10:45

csadorf mentioned this pull request Jan 28, 2021

Release 1.5.2 is defunct (missing archive-path dependency) conda-forge/aiida-core-feedstock#46

Closed

1 task

This was referenced Jan 29, 2021

👌 IMPROVE: add additional apt packages aiidateam/aiida-integration-tests#3

Merged

Add syrupy (memory profiling) aiidateam/aiida-integration-tests#4

Open

chrisjsewell mentioned this pull request Sep 22, 2021

♻️ REFACTOR: New archive format #5145

Merged

1 task

chrisjsewell mentioned this pull request Oct 14, 2021

✨ NEW: Add Backend bulk methods #5171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archive export refactor (2) #4534

Archive export refactor (2) #4534

chrisjsewell commented Oct 30, 2020 •

edited

Loading

ltalirz commented Oct 30, 2020

ltalirz left a comment

ltalirz Nov 3, 2020 •

edited

Loading

chrisjsewell Nov 3, 2020

chrisjsewell Nov 5, 2020

ltalirz Nov 3, 2020

chrisjsewell Nov 3, 2020

ltalirz Nov 3, 2020

chrisjsewell commented Nov 3, 2020

chrisjsewell commented Nov 4, 2020 •

edited

Loading

codecov bot commented Nov 4, 2020 •

edited

Loading

chrisjsewell commented Nov 5, 2020 •

edited

Loading

ltalirz commented Nov 5, 2020

chrisjsewell commented Nov 5, 2020 •

edited

Loading

chrisjsewell commented Nov 5, 2020

ltalirz left a comment

ltalirz Nov 5, 2020

chrisjsewell Nov 5, 2020

ltalirz commented Nov 5, 2020

chrisjsewell commented Nov 10, 2020 •

edited

Loading

chrisjsewell commented Nov 11, 2020

chrisjsewell commented Nov 12, 2020 •

edited

Loading

ltalirz commented Nov 12, 2020

chrisjsewell commented Nov 12, 2020

ltalirz commented Nov 12, 2020

Archive export refactor (2) #4534

Archive export refactor (2) #4534

Conversation

chrisjsewell commented Oct 30, 2020 • edited Loading

ltalirz commented Oct 30, 2020

ltalirz left a comment

Choose a reason for hiding this comment

ltalirz Nov 3, 2020 • edited Loading

Choose a reason for hiding this comment

chrisjsewell Nov 3, 2020

Choose a reason for hiding this comment

chrisjsewell Nov 5, 2020

Choose a reason for hiding this comment

ltalirz Nov 3, 2020

Choose a reason for hiding this comment

chrisjsewell Nov 3, 2020

Choose a reason for hiding this comment

ltalirz Nov 3, 2020

Choose a reason for hiding this comment

chrisjsewell commented Nov 3, 2020

chrisjsewell commented Nov 4, 2020 • edited Loading

codecov bot commented Nov 4, 2020 • edited Loading

Codecov Report

chrisjsewell commented Nov 5, 2020 • edited Loading

ltalirz commented Nov 5, 2020

chrisjsewell commented Nov 5, 2020 • edited Loading

chrisjsewell commented Nov 5, 2020

ltalirz left a comment

Choose a reason for hiding this comment

ltalirz Nov 5, 2020

Choose a reason for hiding this comment

chrisjsewell Nov 5, 2020

Choose a reason for hiding this comment

ltalirz commented Nov 5, 2020

chrisjsewell commented Nov 10, 2020 • edited Loading

chrisjsewell commented Nov 11, 2020

chrisjsewell commented Nov 12, 2020 • edited Loading

ltalirz commented Nov 12, 2020

chrisjsewell commented Nov 12, 2020

ltalirz commented Nov 12, 2020

chrisjsewell commented Oct 30, 2020 •

edited

Loading

ltalirz Nov 3, 2020 •

edited

Loading

chrisjsewell commented Nov 4, 2020 •

edited

Loading

codecov bot commented Nov 4, 2020 •

edited

Loading

chrisjsewell commented Nov 5, 2020 •

edited

Loading

chrisjsewell commented Nov 5, 2020 •

edited

Loading

chrisjsewell commented Nov 10, 2020 •

edited

Loading

chrisjsewell commented Nov 12, 2020 •

edited

Loading