Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate CREATE links when importing old export file #4125

Open
greschd opened this issue May 27, 2020 · 12 comments
Open

Duplicate CREATE links when importing old export file #4125

greschd opened this issue May 27, 2020 · 12 comments

Comments

@greschd
Copy link
Member

greschd commented May 27, 2020

While importing an old export file (export version 0.3), the migrations run through without error, but the actual import raises:

Error: an exception occurred while importing the archive export-migrated.tar.gz
Traceback (most recent call last):
  File "/home/a-dogres/.virtualenvs/aiida_data_import/lib/python3.7/site-packages/aiida/cmdline/commands/cmd_import.py", line 65, in _try_import
    import_data(file_to_import, group, **kwargs)
  File "/home/a-dogres/.virtualenvs/aiida_data_import/lib/python3.7/site-packages/aiida/tools/importexport/dbimport/__init__.py", line 73, in import_data
    return import_data_dj(in_path, group=group, silent=silent, **kwargs)
  File "/home/a-dogres/.virtualenvs/aiida_data_import/lib/python3.7/site-packages/aiida/tools/importexport/dbimport/backends/django/__init__.py", line 590, in import_data_dj
    source.uuid, link_type, link['label']
aiida.tools.importexport.common.exceptions.ImportValidationError: Node<c4e6207a-c205-47d8-9b52-97c94d08657d> already has an outgoing LinkType.CREATE link with label "parameters"

This is due to the "unique pair" check, here: https://github.com/aiidateam/aiida-core/blob/develop/aiida/tools/importexport/dbimport/backends/django/__init__.py#L656

I can't quite remember, was it allowed in a previous version of AiiDA to have multiple outgoing link labels of the same name? If so, what would be a good way to resolve this issue?

In this particular case I can just patch the code to rename the links (since I don't care too much about their names), but in general that might be dangerous.

The export I'm trying to import here can be found at https://polybox.ethz.ch/index.php/s/I1se1WGAiP2iLYX, but it's large and also suffers from #3450 (which I fixed with a sed on the data.json) - so it's not great for testing on.

@greschd greschd changed the title Duplicate create links when importing old export file Duplicate CREATE links when importing old export file May 27, 2020
@giovannipizzi
Copy link
Member

Pinging @CasperWA for comments as he's been working a lot with the migrations of export files

@greschd could you please quickly check all the outgoing links (type and label pairs) and report them here?
In the past there was the possibility of both a return and create outgoing links from the same node. However, the migration tools should detect this specific case and not complain, so it's maybe something else

@greschd
Copy link
Member Author

greschd commented Jun 2, 2020

From the data.json, I checked the links where (input, label, type) is not unique:

  • There are a bunch of input_work and input_calc links. Those are harmless I think, it's just the same input being passed to different sub-workflows/calculations.
  • The create links are all from CalcFunctionNode to Dict. AFAICT these are what is causing this issue.

I think the code producing these issues is here, although I'm not 100% sure this is the right commit version: https://github.com/greschd/aiida-tbextraction/blob/034f8a6ceaa300118064f1eb287037e310f3b45f/aiida_tbextraction/fp_run/_helpers/_inline_calcs.py

The labels of the offending links are {'bands', 'kpoints', 'parameters', 'result', 'wannier_parameters'}.

Also, I was using a sketchy development version of AiiDA at the time - it's entirely possible this is some edge case related to that.

@CasperWA
Copy link
Contributor

CasperWA commented Jun 2, 2020

We've tried to tackle the most general patches of these, but to get just right for a particular graph it ususally takes a bit of manual labour, since the desired outcome may vary from owner to owner, e.g., you don't care about the labels, others might.

Since this migration is most likely due to migrating from v0.3 to v0.4 of the export version, we can try to do this the "hard" way, by migrating to only this version first, then going over the export data.json file and adapt whatever you may need / try to implement your use case in the migration.

There might also be a log file somewhere with the list of links affected (perhaps in the folder where from where you ran the migrate/export command?

@greschd
Copy link
Member Author

greschd commented Jun 2, 2020

Yeah, the problem here is that the links aren't affected by the migration at all - so they also don't appear in the migration-<id>.log files.

It appears to me the "inline calculation" produced duplicates of the output nodes, which are both linked with the same label.

I'll give it a quick try at reproducing this with a clean 1.0.0a1. If that doesn't work I think we can close this - I can get around it for my own use case, and unless someone else reports the same problem it might just as well be due to the wonky setup I had at the time.

@greschd
Copy link
Member Author

greschd commented Jun 2, 2020

Found the culprit...
image
it's caching.

In 1.0.0a1, running an InlineCalculation with caching enabled produced two output links, one to the "initial" output and one to a cloned duplicate.

If this is the only place where this can occur I'd be tempted to mark this wontfix - it's an alpha release, and I was pretty much the only one using caching back then.

However, this made me remember how you could have duplicate CREATE links in previous versions: Outputs did not have to be unique, but then you needed to access them with linkname_<pk> to get a specific one:

In [17]: inline_calc.get_outputs_dict()
Out[17]: 
{u'res': <Data: uuid: 52bed73b-cabc-49ce-aae1-588479454123 (pk: 15)>,
 u'res_15': <Data: uuid: 52bed73b-cabc-49ce-aae1-588479454123 (pk: 15)>,
 u'res_16': <Data: uuid: c4005890-1bb5-47f4-b1fb-eeb85cfe0144 (pk: 16)>}

In [18]: inline_calc.out.res
Out[18]: <Data: uuid: 52bed73b-cabc-49ce-aae1-588479454123 (pk: 15)>

In [19]: inline_calc.out.res_15
Out[19]: <Data: uuid: 52bed73b-cabc-49ce-aae1-588479454123 (pk: 15)>

In [20]: inline_calc.out.res_16
Out[20]: <Data: uuid: c4005890-1bb5-47f4-b1fb-eeb85cfe0144 (pk: 16)>

The link labels don't contain the _<pk> suffix:

In [28]: for link in inline_calc.dbnode.output_links.all():
    ...:     print(link, link.label)
    ...:     
(<DbLink: InlineCalculation (14) --> Float (16)>, u'res')
(<DbLink: InlineCalculation (14) --> Float (15)>, u'res')

Here's the link to the relevant documentation in 0.12.3 (the "note" at the very bottom): https://aiida.readthedocs.io/projects/aiida-core/en/v0.12.3/working_with_aiida/resultmanager.html

If memory serves me right, this could happen in any calculation / workchain / whatever, and the caching issue isn't actually creating an invalid link state as far as the old AiiDA version is concerned.

Maybe a way to fix this would be explicitly adding the _<pk> suffix to these link labels - with the "bare" link referring to the lowest-pk descendant (I think this was how it was decided, but that would need checking in the old code).

@giovannipizzi
Copy link
Member

Hi!
Great that you tracked this!
The _pk was added only in python when using node.out. as, for outgoing links of data nodes, uniqueness is not required (as you noted also above).

Anyway - I agree. Let's set this to wontfix unless someone else has similar issues!

@giovannipizzi giovannipizzi added type/wontfix apply only to closed issues and removed type/bug labels Jun 3, 2020
@greschd
Copy link
Member Author

greschd commented Jun 3, 2020

Sorry, I wasn't being totally clear: Since the caching / inline calcs are in fact not the only place where this can occur I would suggest not to mark this as wontfix.

The uniqueness was not required in the previous version, but now you can only have one CREATE link with the same label. I think there should be some workaround for this, since it is "normal" use of the previous AiiDA version.

P.S.: Even if we decide to mark it wontfix, the bug label would still apply IMO.

@greschd greschd reopened this Jun 3, 2020
@greschd greschd added type/bug and removed type/wontfix apply only to closed issues labels Jun 3, 2020
@greschd
Copy link
Member Author

greschd commented Jun 3, 2020

Hmm, I think I misunderstood something: Duplicate outgoing links with the same label still appear to be valid:

In [10]: from aiida.orm import CalcJobNode

In [11]: from aiida.orm import Data

In [12]: calc = CalcJobNode()

In [13]: d1 = Data()

In [14]: d2 = Data()

In [15]: d1.add_incoming(calc, link_type=LinkType.CREATE, link_label='test')

In [16]: d2.add_incoming(calc, link_type=LinkType.CREATE, link_label='test')

So.. is this just an extra validation in the import that actually shouldn't be there?

@greschd
Copy link
Member Author

greschd commented Jun 3, 2020

Ha, just figured out how I tricked the link validation in the example above: Links only show up in get_outgoing after the target has been stored

In [16]: calc = CalcJobNode()

In [17]: calc.store()
Out[17]: <CalcJobNode: uuid: 6e3b970f-cc00-465e-a6c5-67507a3a7740 (pk: 49831)>

In [18]: d1 = Data()

In [19]: d1.add_incoming(calc, link_type=LinkType.CREATE, link_label='test')

In [20]: calc.get_outgoing(link_type=LinkType.CREATE, link_label_filter='test', only_uuid=True).all()
Out[20]: []

In [21]: d1.store()
Out[21]: <Data: uuid: a776a906-dbfc-48d8-8b56-ac2627b07a28 (pk: 49832)>

In [22]: calc.get_outgoing(link_type=LinkType.CREATE, link_label_filter='test', only_uuid=True).all()
Out[22]: [LinkTriple(node='a776a906-dbfc-48d8-8b56-ac2627b07a28', link_type=<LinkType.CREATE: 'create'>, link_label='test')]

This is relevant here because validate_links uses get_outgoing to check for existing links, see https://github.com/aiidateam/aiida-core/blob/develop/aiida/orm/utils/links.py#L163

I'm not sure if

  • get_outgoing should include also unstored links
  • validate_link should not rely on get_outgoing
  • those both work as intended, because we never link unstored nodes anyway - and the issue is in fact with the export migration
  • something else?

Tagging @sphuber for comment.

@sphuber
Copy link
Contributor

sphuber commented Jun 3, 2020

I am confused. I thought this behavior was not allowed and therefore your example should be a bug. If you take your example and then store the nodes, there will be no exception and we have two links that violate the uniqueness constraints:

In [4]: from aiida.orm import CalcJobNode 
   ...: from aiida.orm import Data 
   ...: calc = CalcJobNode() 
   ...: d1 = Data() 
   ...: d2 = Data() 
   ...: d1.add_incoming(calc, link_type=LinkType.CREATE, link_label='test') 
   ...: d2.add_incoming(calc, link_type=LinkType.CREATE, link_label='test') 
   ...: calc.store()                                                                                                                                                                                 
Out[4]: <CalcJobNode: uuid: 94def715-33ab-4065-b750-aca73266d0f0 (pk: 11081)>

In [5]: calc.get_outgoing().all()                                                                                                                                                                                
Out[5]: []

In [6]: d1.store()                                                                                                                                                                                               
Out[6]: <Data: uuid: 023c9af7-328d-43ca-8ef5-05f4f0bb773c (pk: 11082)>

In [7]: calc.get_outgoing().all()                                                                                                                                                                                
Out[7]: [LinkTriple(node=<Data: uuid: 023c9af7-328d-43ca-8ef5-05f4f0bb773c (pk: 11082)>, link_type=<LinkType.CREATE: 'create'>, link_label='test')]

In [8]: d2.store()                                                                                                                                                                                               
Out[8]: <Data: uuid: e24e1f7f-6337-469b-a624-c1d13aa1c8eb (pk: 11083)>

In [9]: calc.get_outgoing().all()                                                                                                                                                                                
Out[9]: 
[LinkTriple(node=<Data: uuid: 023c9af7-328d-43ca-8ef5-05f4f0bb773c (pk: 11082)>, link_type=<LinkType.CREATE: 'create'>, link_label='test'),
 LinkTriple(node=<Data: uuid: e24e1f7f-6337-469b-a624-c1d13aa1c8eb (pk: 11083)>, link_type=<LinkType.CREATE: 'create'>, link_label='test')]

both links are stored but that should not have been allowed. That can be shown by calling calc.outputs which expects the labels to be unique:

In [10]: calc.outputs.test                                                                                                                                                                                       
---------------------------------------------------------------------------
MultipleObjectsError                      Traceback (most recent call last)
<ipython-input-10-6f9a623c1bc3> in <module>
----> 1 calc.outputs.test

~/code/aiida/env/dev/aiida-core/aiida/orm/utils/managers.py in __getattr__(self, name)
     81         """
     82         try:
---> 83             return self._get_node_by_link_label(label=name)
     84         except NotExistent:
     85             # Note: in order for TAB-completion to work, we need to raise an

~/code/aiida/env/dev/aiida-core/aiida/orm/utils/managers.py in _get_node_by_link_label(self, label)
     62         if self._incoming:
     63             return self._node.get_incoming(link_type=self._link_type).get_node_by_label(label)
---> 64         return self._node.get_outgoing(link_type=self._link_type).get_node_by_label(label)
     65 
     66     def __dir__(self):

~/code/aiida/env/dev/aiida-core/aiida/orm/utils/links.py in get_node_by_label(self, label)
    296                 else:
    297                     raise exceptions.MultipleObjectsError(
--> 298                         'more than one neighbor with the label {} found'.format(label)
    299                     )
    300 

MultipleObjectsError: more than one neighbor with the label test found

Weirdly, when looking at the tests in tests/orm/test_node.py we find the following test:

    def test_node_outdegree_unique_triple(self):
        """Test that the validation of links with outdegree `unique_triple` works correctly

        The example here is a `CalculationNode` that has two outgoing CREATE links with the same label, but to different
        target nodes. This is legal and should pass validation.
        """
        creator = CalculationNode().store()
        data_one = Data()
        data_two = Data()

        # Verify that adding two create links with the same link label but to different target is allowed from the
        # perspective of the source node (the CalculationNode in this case)
        data_one.add_incoming(creator, link_type=LinkType.CREATE, link_label='create')
        data_two.add_incoming(creator, link_type=LinkType.CREATE, link_label='create')
        data_one.store()
        data_two.store()

        uuids_outgoing = set(node.uuid for node in creator.get_outgoing().all_nodes())
        uuids_expected = set([data_one.uuid, data_two.uuid])
        self.assertEqual(uuids_outgoing, uuids_expected)

which tests exactly the example here and claims this should be find. However, the link validation code in aiida.orm.utils.links.validate_link we see:

    link_mapping = {
        LinkType.CALL_CALC: (WorkflowNode, CalculationNode, 'unique_triple', 'unique'),
        LinkType.CALL_WORK: (WorkflowNode, WorkflowNode, 'unique_triple', 'unique'),
        LinkType.CREATE: (CalculationNode, Data, 'unique_pair', 'unique'),
        LinkType.INPUT_CALC: (Data, CalculationNode, 'unique_triple', 'unique_pair'),
        LinkType.INPUT_WORK: (Data, WorkflowNode, 'unique_triple', 'unique_pair'),
        LinkType.RETURN: (WorkflowNode, Data, 'unique_pair', 'unique_triple'),
    }

clearly stating that CalculationNode -> Data should be unique_pair so target node + link label should be unique. I think the test is just wrong and it is not something we want to allow and there is a bug in the validation which I haven't found yet. Will report later.

@greschd
Copy link
Member Author

greschd commented Jun 3, 2020

bug in the validation

The validation relies on the outgoing node being stored because it uses get_outgoing.

@greschd
Copy link
Member Author

greschd commented Jun 3, 2020

I agree that the test is probably just wrong, since it's also pretty impossible to get into this situation with "normal" user code (workchains, calcjobs, etc.).

@sphuber sphuber self-assigned this Jun 3, 2020
@sphuber sphuber added the priority/critical-blocking must be resolved before next release label Jun 9, 2020
@sphuber sphuber added this to the v1.2.2 milestone Jun 9, 2020
@sphuber sphuber modified the milestones: v1.2.2, v1.3.1 Jun 22, 2020
@sphuber sphuber modified the milestones: v1.3.1, v1.3.2 Aug 27, 2020
@sphuber sphuber removed this from the v1.3.2 milestone Sep 24, 2020
@sphuber sphuber modified the milestones: v1.4.1, v1.4.2 Sep 24, 2020
@sphuber sphuber modified the milestones: v1.4.2, v1.5.1 Oct 28, 2020
@sphuber sphuber modified the milestones: v1.5.1, v1.6.1 Dec 2, 2020
@sphuber sphuber modified the milestones: Preparations for the new repository, v2.0.0 Apr 28, 2021
@chrisjsewell chrisjsewell added priority/important and removed priority/critical-blocking must be resolved before next release labels May 5, 2021
@chrisjsewell chrisjsewell modified the milestones: v2.0.0, Post 2.0 May 5, 2021
@sphuber sphuber removed this from the v2.3.0 milestone Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants