Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running with an executor _sometimes_ hits a bug #48

Open
liamhuber opened this issue Oct 23, 2023 · 12 comments
Open

Running with an executor _sometimes_ hits a bug #48

liamhuber opened this issue Oct 23, 2023 · 12 comments
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@liamhuber
Copy link
Member

It's quite infrequent (< 1 in 10), but sometimes this test fails in the following way:

======================================================================
FAIL: test_with_executor (test_macro.TestMacro.test_with_executor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\pyiron_workflow\pyiron_workflow\tests\unit\test_macro.py", line 522, in test_with_executor
    self.assertIs(
AssertionError: <pyiron_workflow.macro.Macro object at 0x0000021B611DB610> is not None : Returned nodes should get the macro as their parent

----------------------------------------------------------------------

I.e. when a composite is run on an executor and copies of its children are returned as the result, these new children are somehow failing to get the parent-child relationship set up reflexively (which is the expected behaviour). I vaguely suspect it has something to do with Future.result() returning new instances each call, or some sort of delayed execution that I am wrongly assuming is happening synchronously, or similar, but working with executors is pretty new to me and the error is not consistently reproducible, so I haven't managed to track it down yet.

@liamhuber liamhuber added the bug Something isn't working label Oct 23, 2023
@liamhuber
Copy link
Member Author

Happened again today. Was on the Windows 3.11 CI (I want to start noting this here in case there is some pattern of where it comes up, e.g. on windows)

@liamhuber liamhuber added the wontfix This will not be worked on label Nov 2, 2023
@liamhuber
Copy link
Member Author

Happened again on the windows 3.11 CI

1/10 seems like an overestimate of how often it crops up, and so far really seems like a windows thing

@liamhuber
Copy link
Member Author

Happened on Ubuntu 3.11 today, so I guess it's not just windows.

@liamhuber
Copy link
Member Author

Happened on Ubuntu 3.11 today.

I can't remember the last time it happened on my local machine, so maybe it's exclusively an issue in the CI environment?

@liamhuber
Copy link
Member Author

Seems to be happening consistency today in #70, so maybe there's a chance to nail this down

@liamhuber
Copy link
Member Author

Ok, I added a little sleep and then suddenly the CI passed, so maybe this is fixed after all? Could it be simply that the callback function is executing to slowly on the CI (but fast enough on my local machine) that the test comes up before the state is ready for it?

returned_nodes = result.result()  # Wait for the process to finish
        from time import sleep  # The "fixing" change (so far)
        sleep(1)  # The "fixing" change (so far)
        self.assertIsNot(
            original_one,
            returned_nodes.one,
            msg="Executing in a parallel process should be returning new instances"
        )
        self.assertIs(
            macro,
            macro.nodes.one.parent,
            msg="Returned nodes should get the macro as their parent"
            # Once upon a time there was some evidence that this test was failing
            # stochastically, but I just ran the whole test suite 6 times and this test
            # 8 times and it always passed fine, so maybe the issue is resolved...
        )

@liamhuber
Copy link
Member Author

Finally happened again today (windows), but it's been over a month, so I guess the sleep call is doing the trick.

@liamhuber
Copy link
Member Author

Again on windows today. Pretty rare, but horrible that it exists at all. Adding the sleep when collecting results really seems to have helped, but it's already a full second of extra waiting -- more would be a big pain, but is this going to have horrible scaling when the amount of data on the nodes increases? I still find this bug overall very disconcerting.

@liamhuber
Copy link
Member Author

It finally happened on windows again! #241. It's been a while.

@liamhuber
Copy link
Member Author

Again today on windows on #773. Maybe the months long stretch was just some kind of good luck?

@liamhuber
Copy link
Member Author

The interval is nice and long, but it is still happening. Windows running on merge to main. https://github.com/pyiron/pyiron_workflow/actions/runs/9420579496/job/25952940926

@liamhuber
Copy link
Member Author

Again after two months. Must be a thread safety thing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

1 participant