-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archive export refactor (2) #4534
Archive export refactor (2) #4534
Conversation
Ok, I will check this after #4532 is merged.
Is this already impacting memory significantly? If not, I suggest to keep this parameter at the python API only for the moment and don't expose it on the cli yet - we should limit the number of flags and having the flag there raises expectations that we may not be able to meet (yet). |
4721f26
to
2c2c7a3
Compare
79ef2ed
to
187725f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @chrisjsewell
I started having a look through and adding comments; I then saw that it's still in draft status
let me know when I should re-review this
aiida/cmdline/commands/cmd_export.py
Outdated
@@ -101,10 +101,18 @@ def inspect(archive, version, data, meta_data): | |||
show_default=True, | |||
help='Include or exclude comments for node(s) in export. (Will also export extra users who commented).' | |||
) | |||
@click.option( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment #4534 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows for performance (cpu/memory) testing from the CLI, as we have done for other PRs (see below). I would just comment it out before merging, because it is definitely something very useful and something that will be beneficial if not now then for the new format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented out
}, | ||
'conversion_info': export_data.metadata.conversion_info | ||
} | ||
def close(self, excepted: bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with this bool?
Also, pylint does not complain about this unused argument, perhaps I'm missing something here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it allows the writer to decide whether it wants to actually write the output file, given that the export process excepted
sharded_uuid = export_shard_uuid(uuid) | ||
@abstractmethod | ||
def write_metadata(self, data: ArchiveMetadata): | ||
""" """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehe, that's an interesting docstring. could use some love (also below)?
Yes indeed patience lol. |
@ltalirz this is now ready for review Not |
Codecov Report
@@ Coverage Diff @@
## develop #4534 +/- ##
===========================================
- Coverage 79.50% 79.41% -0.08%
===========================================
Files 482 481 -1
Lines 35325 35279 -46
===========================================
- Hits 28083 28015 -68
- Misses 7242 7264 +22
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
With ~40,000 node repo; big speedup, a little more memory usage develop: (base) aiida@088cc428b4c4:/$ cd tmp
(base) aiida@088cc428b4c4:/tmp$ syrupy.py -i 1 --separator=, --no-align verdi export create -G 1 -- mount_folder/export_develop.aiida
SYRUPY: Writing process resource usage samples to 'syrupy_20201105105553.ps.log'
SYRUPY: Writing raw process resource usage logs to 'syrupy_20201105105553.ps.raw'
SYRUPY: Executing command 'verdi export create -G 1 -- mount_folder/export_develop.aiida'
SYRUPY: Redirecting command output stream to 'syrupy_20201105105553.out.log'
SYRUPY: Redirecting command error stream to 'syrupy_20201105105553.err.log'
SYRUPY: Completed running: verdi export create -G 1 -- mount_folder/export_develop.aiida
SYRUPY: Started at 2020-11-05 10:55:53.406105
SYRUPY: Ended at 2020-11-05 10:59:35.599326
SYRUPY: Total run time: 0 hour(s), 03 minute(s), 42.193221 second(s)
(base) aiida@088cc428b4c4:/tmp$ python -c 'import pandas as pd; ax = pd.read_csv("syrupy_20201105105553.ps.log").set_index("ELAPSED").plot(y="RSS", grid=True); ax.get_figure().savefig("mount_folder/output_develop.png")' this PR: (base) aiida@088cc428b4c4:/tmp$ syrupy.py -i 1 --separator=, --no-align verdi export create -G 1 -- mount_folder/export_new.aiida
SYRUPY: Writing process resource usage samples to 'syrupy_20201105110111.ps.log'
SYRUPY: Writing raw process resource usage logs to 'syrupy_20201105110111.ps.raw'
SYRUPY: Executing command 'verdi export create -G 1 -- mount_folder/export_new.aiida'
SYRUPY: Redirecting command output stream to 'syrupy_20201105110111.out.log'
SYRUPY: Redirecting command error stream to 'syrupy_20201105110111.err.log'
SYRUPY: Completed running: verdi export create -G 1 -- mount_folder/export_new.aiida
SYRUPY: Started at 2020-11-05 11:01:11.163114
SYRUPY: Ended at 2020-11-05 11:03:34.289323
SYRUPY: Total run time: 0 hour(s), 02 minute(s), 23.126209 second(s)
(base) aiida@088cc428b4c4:/tmp$ python -c 'import pandas as pd; ax = pd.read_csv("syrupy_20201105110111.ps.log").set_index("ELAPSED").plot(y="RSS", grid=True); ax.get_figure().savefig("mount_folder/output_new.png")' (base) aiida@088cc428b4c4:/tmp$ verdi export create -G 1 --verbosity DEBUG -- mount_folder/export_new2.aiida
/
EXPORT
-------------- ------------------------------
Archive mount_folder/export_new2.aiida
Format JSON Zip (compression=8)
Export version 0.9
Inclusion rules
----------------- ----
Include Comments True
Include Logs True
Traversal rules
--------------------------------- -----
Follow links input calc forwards False
Follow links input calc backwards True
Follow links create forwards True
Follow links create backwards True
Follow links return forwards True
Follow links return backwards False
Follow links input work forwards False
Follow links input work backwards True
Follow links call calc forwards True
Follow links call calc backwards True
Follow links call work forwards True
Follow links call work backwards True
STARTING EXPORT...
Collecting nodes in groups 100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40065/40065
Traversing provenance via links ... 100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1
WRITING METADATA...
Writing links 100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40064/40064
Building entity database queries 100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4
Writing entity data 100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40066/40066
Writing group UUID -> [nodes UUIDs]
Exporting node repositories: 40065 100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40065/40065
FINALIZING EXPORT...
Exported Entities:
- Node : 40065
- User : 1
- Group : 1
Success: wrote the export archive file to mount_folder/export_new2.aiida |
Thanks for the performance tests! Do you know why the memory usage increases? And what are the units here - are 30'000 = 30 GB? That's a lot (of course also before this PR) |
I think your missing a zero there lol, its in |
Could not say off hand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @chrisjsewell ; looks good to me
@@ -47,7 +47,7 @@ class ArchiveMetadata: | |||
# optional data | |||
graph_traversal_rules: Optional[Dict[str, bool]] = dataclasses.field(default=None) | |||
# Entity type -> UUID list | |||
entities_starting_set: Optional[Dict[str, Set[str]]] = dataclasses.field(default=None) | |||
entities_starting_set: Optional[Dict[str, List[str]]] = dataclasses.field(default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there is a good reason for this change; just mentioning that set
is also in the name of the attribute...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh I went back and forth on this, because it has to be converted to a list before storing as a json and so it was a little easier to convert before passing to the writer.
but your right that the naming is now a little off
Ah, right. Thanks, that is much more sensible.
Ok. Might be worth a quick look, since memory can be a hard limit, i.e. it may prevent people from exporting. |
thanks, I've responded to the comments |
archive-path has just been merged into conda-forge 🎉 |
FYI, in case it was not clear, but the test coverage % is reduced because I've actually removed more code than I've added (including all the code for compression being moved to archive-path), so less code lines are now tested |
Ok! Just a last suggestion to accept and we can merge |
Co-authored-by: Leopold Talirz <[email protected]>
done cheers |
approved! |
FYI @ltalirz just a heads up, this is the final PR I alluded to (written on top of #4532)
Based on the experience from the import refactor, it refactors the archive writer to be somewhat similar; using it as a context manage and "streaming" the data to its methods, as opposed to now where all the data is passed to it in a single dataclass.
The refactor also removes a few spurious database queries; merging them in to the entity queries.
A--batch-size
argument is also added to the CLI, to allow control of query batch sizing.The implementation tentatively works for zip so far (as in, that is what I was running in the meeting demo) but requires a bit more work to clean up and apply to tar as well.