Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make export data more robust, i.e maybe allow for dirty exports, and or create helpers for cleaning data #4631

Open
broeder-j opened this issue Dec 14, 2020 · 2 comments

Comments

@broeder-j
Copy link
Member

Is your feature request related to a problem? Please describe

Exporting not sealed nodes is not allowed. (I understand why that is because they should not run in an imported database, and also exporting a process node without outputs is not good.)
Also per default a missing repository folder for a node will also end in a critical failure of an export.

While this is all well to keep exports clean, it is very annoying, since the engine currently "looses" some processes very easily which never get sealed (in every of my databases I have some of these, sometimes process kills are incomplete, etc), and you have to seal them per hand or explicitly exclude them and all their provenance which would pick them up for the export. Also if you seal them by hand, (sometimes this even does not work, no idea why) you might end up with the error that for some of them there is no repo folder and you have to create dummy folders these.

While for a publication it is clear, that I want to go through all this and clean the whole graph and publish clear data. (So I export only all the "good" and "relevant" stuff.
There are situations where one does not care, i.e If I just want to do a quick export for a backup, or give may database to a colleague.
Also consider the case where you want to create a backup, but there are still not finished processes, but the daemon is not running.

Describe the solution you'd like

Of course the ideal solution would be if the engine never looses any processes and fails to seal nodes on the way, but since I am not sure that this can be fixed, it might make more sense to work with it.

And as a user I do not want to write hundreds of lines of code and try several times to export just that aiida allows me to export some data.
(Also non export user might not be able to do this at all).

  • Maybe add a verdi helper command which allows you to get rid of these "zombie" processes and makes sure everything in your database is 'exportable'.
  • and/or add an option to auto-exclude such processes from the export (Pro: export clean. Con: Might be bad for a backup, because it might be incomplete)
  • and/or add option to auto-seal all these nodes just in the export. (Might be bad for a backup, because again if they then still finish or further change, what to do on import)
  • and/or add an option to export non sealed nodes none the less (i.e let the user force it)
    (This might be overall problematic and not wanted, so this should be carefully discussed)

In all the last 3 cases I still want to be warned instead of the critical error message.

What do you think? Maybe the aiida-fleur plugin generates these 'zombie' processes more easily and the general 'user experience' is a different one.

@broeder-j broeder-j changed the title Make export data more robust, i.e maybe allow for dirty exports, and or create helpers for cleaning work Make export data more robust, i.e maybe allow for dirty exports, and or create helpers for cleaning data Dec 14, 2020
@broeder-j
Copy link
Member Author

Add on, I have also unsealed data nodes. So these are not processes only

@sphuber
Copy link
Contributor

sphuber commented Dec 14, 2020

If I understand correctly, you have two use cases for exporting

  1. Backing up the profile
  2. Exporting portion of graph to share

Regarding backing up: I don't think we should ever allow unsealed nodes to be backed up, because as you say, what do you do on import? The logic to adjudicate between "clashes" will be intractable. But most importantly, backing up should not be done through export archives. There is so much overhead that it is extremely inefficient. The real way to backup is by dumping database contents and copying the repository folder. The only reason we haven't made a verdi command to automatically do this is that backing up the repository is intractable due to its design. With the new repository implementation this approach will be feasible and in fact very fast. This will be released with aiida-core==2.0.0 which is slated for beginning of Q2 of 2021. We will ship that with a command verdi backup that will automatically backup (and maybe restore) profiles.

Then, the use case to share the (intermediate) data with a colleague. We are faced with the same problem. I would be very hesitant to allow unsealed nodes to be exported, because what do you do when these are imported? What if you export the same node later after it has been sealed. If you import it now there is two nodes with the same UUID with very different properties. You might say just take the one that is sealed as the correct one, but I am not sure if it is easy to oversee all the consequences.

The point where I agree is that we should limit the problems when exporting due to bugs in AiiDA itself. So I would be tempted to go in the direction of adding helpers as you propose to clean faulty nodes. This is easy in some cases, but not all.

  1. Nodes missing a repository folder. This one is easy. With v2.0 this problem will no longer exist. If we think this is really necessary to be possible on 1.x and so need a fix there, we need to agree what the behavior should be. Maybe there we can change the critical into a warning and still let the node be exported.
  2. Unsealed nodes. This one I see as more complicated. How would such a universal helper work? It cannot simply seal all unsealed process nodes, because some of them may actually still be running. So how do we distinguish "zombie" processes. I don't know. The only solution I see is for the user to manually indicate exactly which process nodes are to be sealed, but that is exactly the situation we already have.

So ultimately, I am not really sure what to do here other than:

  1. Make backing up a profile easy (will be in v2.0 with verdi backup)
  2. Fix bugs that generate zombie processes.

Allowing unsealed nodes to be exported (even with a force flag) will have huge (unforeseen) consequences.

Add on, I have also unsealed data nodes. So these are not processes only

Data nodes can never be sealed. The sealing concept only applies to process nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants