Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pants 2.15.0rc4 can hang under pantsd #18211

Closed
jsirois opened this issue Feb 9, 2023 · 7 comments · Fixed by #18216
Closed

Pants 2.15.0rc4 can hang under pantsd #18211

jsirois opened this issue Feb 9, 2023 · 7 comments · Fixed by #18216
Assignees
Labels
Milestone

Comments

@jsirois
Copy link
Contributor

jsirois commented Feb 9, 2023

I have suffered hangs over in scie-pants using Pants 2.15.0rc4, see:

The context is an integration test where Pants is launched (--pantsd mode) in a subprocess with stdout piped for test use in assertions. This code here: https://github.com/pantsbuild/scie-pants/blob/3d269eb21ab7d764054496cd33530aaa364c6252/package/src/main.rs#L863-L892 The hang happens in the assert_pants_bin_name calls. The links above on the 1st or 2nd call, the example below in the 3rd or 4th call (can't say for sure if the hang is on pantsd down or pantsd up, but I think its on the bring-down which would make these 1st and 3rd).

I used this ssh rig to debug: pantsbuild/scie-pants#114

The commands run in the ssh session were:

$ sudo rm -rf target/
$ cargo run -p package -- test --tools-pex-mismatch-warn
<hang>
ctrl-Z
$ sudo apt install gdb
$ sudo gdb -p 16881

log-slim.txt
log-slim2.txt

@jsirois jsirois added the bug label Feb 9, 2023
@jsirois
Copy link
Contributor Author

jsirois commented Feb 9, 2023

So my blundering reading leads here:

///
/// Run a Future and return its resolved Result.
///
/// This should never be called from in a Future context, and should only ever be called in
/// something that resembles a main method.
///
/// Even after this method returns, work `spawn`ed into the background may continue to run on the
/// threads owned by this Executor.
///
pub fn block_on<F: Future>(&self, future: F) -> F::Output {
// Make sure to copy our (thread-local) logging destination into the task.
// When a daemon thread kicks off a future, it should log like a daemon thread (and similarly
// for a user-facing thread).
self.handle.block_on(future_with_correct_context(future))
}

That is implicated in a non-main thread in the backtrace capture attached in the OP.
If the comment is to be believed, I find scheduler_execute used by :

def _execute(
self, execution_request: ExecutionRequest
) -> tuple[tuple[tuple[Any, Return], ...], tuple[tuple[Any, Throw], ...]]:
start_time = time.time()
try:
raw_roots = native_engine.scheduler_execute(
self.py_scheduler,
self.py_session,
execution_request.native,
)
except native_engine.PollTimeout:
raise ExecutionTimeoutError("Timed out")

def execute(self, execution_request: ExecutionRequest) -> list[Any]:
"""Invoke the engine for the given ExecutionRequest, returning successful values or raising.
:return: A sequence of per-request results.
"""
returns, throws = self._execute(execution_request)

def product_request(
self,
product: type,
subjects: Sequence[Any | Params],
*,
poll: bool = False,
poll_delay: float | None = None,
timeout: float | None = None,
) -> list:
"""Executes a request for a single product for some subjects, and returns the products.
:param product: A product type for the request.
:param subjects: A list of subjects or Params instances for the request.
:param poll: See self.execution_request.
:param poll_delay: See self.execution_request.
:param timeout: See self.execution_request.
:returns: A list of the requested products, with length match len(subjects).
"""
request = self.execution_request(
[(product, subject) for subject in subjects],
poll=poll,
poll_delay=poll_delay,
timeout=timeout,
)
return self.execute(request)

Which is used by:

$ git grep determine_bootstrap_environment
src/python/pants/bin/local_pants_runner.py:from pants.core.util_rules.environments import determine_bootstrap_environment
src/python/pants/bin/local_pants_runner.py:            determine_bootstrap_environment(self.graph_session.scheduler_session),
src/python/pants/bsp/protocol.py:from pants.core.util_rules.environments import determine_bootstrap_environment
src/python/pants/bsp/protocol.py:        self._env_name = determine_bootstrap_environment(self._scheduler_session)
src/python/pants/core/util_rules/environments.py:def determine_bootstrap_environment(session: SchedulerSession) -> EnvironmentName:
src/python/pants/engine/streaming_workunit_handler.py:from pants.core.util_rules.environments import determine_bootstrap_environment
src/python/pants/engine/streaming_workunit_handler.py:            determine_bootstrap_environment(self._scheduler),
src/python/pants/goal/explorer.py:from pants.core.util_rules.environments import determine_bootstrap_environment
src/python/pants/goal/explorer.py:        env_name = determine_bootstrap_environment(graph_session.scheduler_session)
src/python/pants/init/engine_initializer.py:from pants.core.util_rules.environments import determine_bootstrap_environment
src/python/pants/init/engine_initializer.py:        env_name = determine_bootstrap_environment(self.scheduler_session)
src/python/pants/init/plugin_resolver.py:from pants.core.util_rules.environments import determine_bootstrap_environment
src/python/pants/init/plugin_resolver.py:        params = Params(request, determine_bootstrap_environment(session))
src/python/pants/init/specs_calculator.py:from pants.core.util_rules.environments import determine_bootstrap_environment
src/python/pants/init/specs_calculator.py:    bootstrap_environment = determine_bootstrap_environment(session)

@stuhood stuhood self-assigned this Feb 9, 2023
@stuhood stuhood added this to the 2.15.x milestone Feb 9, 2023
@stuhood
Copy link
Member

stuhood commented Feb 9, 2023

Thanks John.

The only thing that jumps out at me there is that there are 4 filesystem watching tasks (threads 11, 9, 7, 5), when I would expect 1. I don't see any lock interleavings/inversions in the stacktraces of any of the threads.

It's possible that the 4 filesystem watching tasks are using up enough blocking threads on the tokio runtime to cause a deadlock (depending on the computed value of rule_threads_max), but I'll need to investigate further.

@stuhood
Copy link
Member

stuhood commented Feb 9, 2023

It's possible that the 4 filesystem watching tasks are using up enough blocking threads on the tokio runtime to cause a deadlock (depending on the computed value of rule_threads_max), but I'll need to investigate further.

This was a red herring: the filesystem watching threads are dedicated threads, running outside the runtime. They just confusingly don't have thread names assigned.

Will continue looking.

@stuhood
Copy link
Member

stuhood commented Feb 9, 2023

It looks like this is actually a pantsd client hang instead -- the stack is much more obviously weird: pants-client.bt.txt

Patch forthcoming.

stuhood added a commit that referenced this issue Feb 9, 2023
Give dedicated threads names to assist in debugging. Otherwise, the OS will occasionally give them the names of _other_ threads ([see](#18211 (comment))), which is the opposite of helpful.
@jsirois
Copy link
Contributor Author

jsirois commented Feb 9, 2023

By obviously weird do you mean the non-main thread trying to get the GIL? That is not obviously weird to me. I think it might be good to record your analysis here since - unless I'm the only one who this isn't obvious to - you appear to be a man on the bus.

stuhood added a commit that referenced this issue Feb 9, 2023
Always explicitly shutdown executors, to avoid them being dropped on arbitrary threads (including under the GIL).

Fixes #18211.
stuhood added a commit to stuhood/pants that referenced this issue Feb 9, 2023
Always explicitly shutdown executors, to avoid them being dropped on arbitrary threads (including under the GIL).

Fixes pantsbuild#18211.
stuhood added a commit that referenced this issue Feb 9, 2023
Always explicitly shutdown executors, to avoid them being dropped on arbitrary threads (including under the GIL).

Fixes #18211.
@stuhood
Copy link
Member

stuhood commented Feb 9, 2023

By obviously weird do you mean the non-main thread trying to get the GIL?

Yes, but not just: "trying to get the GIL" -- rather: "blocked forever trying to get the GIL". The reason it can't get the GIL is because the main thread didn't release it before trying to shut down the Executor. The Executor is shutting down due to frame_dealloc -> _Py_Dealloc... i.e., it was dropped on the Python side.

As mentioned on the ticket: impl Drop for PyExecutor could theoretically acquire-then-release the GIL with allow_threads (because pyo3 doesn't expose an operation to ensure that the GIL isn't held without acquiring it first), but that might introduce other races if the object was dropped under some other Python-side lock. So I went for the log::warn, and fixed the case demonstrated in this stack.

@jsirois
Copy link
Contributor Author

jsirois commented Feb 9, 2023

Thanks - that's much better. I'd think obvious to probably no one working Pants right now except for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants