Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

likely race condition between task garbage collection and @join_app intense app launching #1981

Closed
benclifford opened this issue Feb 26, 2021 · 4 comments
Labels

Comments

@benclifford
Copy link
Collaborator

Describe the bug

In a CI run of a PR against master 02d3b93, the following exception occurred on the htex_local test run:

>       assert fibonacci(10).result() == 55
parsl/tests/test_python_apps/test_fibonacci_recursive.py:28: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/python/3.6.7/lib/python3.6/concurrent/futures/_base.py:432: in result
    return self.__get_result()
/opt/python/3.6.7/lib/python3.6/concurrent/futures/_base.py:384: in __get_result
    raise self._exception
parsl/dataflow/dflow.py:365: in handle_join_update
    res = self._unwrap_remote_exception_wrapper(inner_app_future)
parsl/dataflow/dflow.py:439: in _unwrap_remote_exception_wrapper
    result = future.result()
/opt/python/3.6.7/lib/python3.6/concurrent/futures/_base.py:425: in result
    return self.__get_result()
/opt/python/3.6.7/lib/python3.6/concurrent/futures/_base.py:384: in __get_result
    raise self._exception
parsl/dataflow/dflow.py:286: in handle_exec_update
    res = self._unwrap_remote_exception_wrapper(future)
parsl/dataflow/dflow.py:439: in _unwrap_remote_exception_wrapper
    result = future.result()
/opt/python/3.6.7/lib/python3.6/concurrent/futures/_base.py:425: in result
    return self.__get_result()

...

ERROR    parsl.dataflow.dflow:dflow.py:317 Task 411 failed after 0 retry attempts
Traceback (most recent call last):
  File "/home/travis/build/Parsl/parsl/parsl/dataflow/dflow.py", line 286, in handle_exec_update
    res = self._unwrap_remote_exception_wrapper(future)
  File "/home/travis/build/Parsl/parsl/parsl/dataflow/dflow.py", line 439, in _unwrap_remote_exception_wrapper
    result = future.result()
  File "/opt/python/3.6.7/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/opt/python/3.6.7/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/travis/build/Parsl/parsl/parsl/dataflow/dflow.py", line 493, in launch_if_ready
    task_id, task_record['func'], *new_args, **kwargs)
  File "/home/travis/build/Parsl/parsl/parsl/dataflow/dflow.py", line 582, in launch_task
    exec_fu = executor.submit(executable, self.tasks[task_id]['resource_specification'], *args, **kwargs)
  File "/home/travis/build/Parsl/parsl/parsl/executors/high_throughput/executor.py", line 579, in submit
    return self.tasks[task_id]
KeyError: 187

That kind of key error against self.tasks has previously shown up when there has been a race condition between tasks completing, and other parsl of parsl trying to interact with that task (for example, when tasks complete very fast).

Alternatively, this might be happening before task 187 was stored in the task table? (a race at job creation, not job completion).

I am suspicious that the increased concurrency introduced by join apps might be making this happen a bit more.

To Reproduce

This is non-deterministic. I have only seen it once.

Expected behavior
The task record related exception should not occur.

Environment
CI

@yadudoc
Copy link
Member

yadudoc commented Feb 26, 2021

We could try setting Config.garbage_collect = False for some tests.

@benclifford
Copy link
Collaborator Author

or we could fix parsl.

@yadudoc
Copy link
Member

yadudoc commented Feb 26, 2021

Before fixing, we need to find out whether the race-condition is at the launch stage or at garbage collection. I was just saying that turning that feature off, might be a useful quick test.

@benclifford
Copy link
Collaborator Author

closing in favour of #2033

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants