Connections entering unstable state with waiting_for_connection error #2422

joshbautista · 2024-10-07T16:32:00Z

Current Behavior

Beginning in versions 0.38.0 and 0.39.0, we've randomly observed the following error from PGAdapter:

CallOptions deadline exceeded after 54.999971981s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[buffered_nanos=55012051961, waiting_for_connection]]]

The deadline of ~55 seconds is expected since we run SET statement_timeout = 55000; on every Read query to prevent Spanner from running queries longer than necessary.

When a PGAdapter instance enters this unstable state, it seems some connections in the pool are tainted while others work just fine. We can run psql against a faulted instance and run simple queries (e.g. select * from table limit 1;) successfully a few times, but it will stall randomly and eventually time out with the waiting_for_connection error. This stall occurs without a statement_timeout set as well, we just won't see the error come through as it's attempting to run indefinitely.

We have also seen this on write statements (e.g. UPDATE, INSERT) albeit with a lower timeout, 30 seconds, which does seem to match Spanner gRPC configuration defaults.

Reproducing the error state is quite difficult since we get all sorts of traffic to each pod and it only seems to manifest itself against one or two pods out of 120 at a time. It will generally occur 24 hours after a new Deployment, but sometimes will show up in as little as a few hours.

Reverting to version 0.37.0 fixes this behavior (so far at least).

Context (Environment)

Running PGAdapter 0.38.0 or 0.39.0 as a sidecar in GKE
120 Pods in the Deployment
PGAdapter configured with 1 vCPU and 2GB Memory limits (actual CPU use hovers at ~ 100mCPU)
PGAdapter executed with the following args:

- args:
  - -p
  - <REDACTED>
  - -i
  - <REDACTED>
  - -d
  - <REDACTED>
  - -enable_otel
  - -otel_trace_ratio=0.05

Environment Variables:
JDK_JAVA_OPTIONS = "-Xmx1600M -Xms1600M -XshowSettings:vm"

PGAdapter is running as a sidecar with a Node.js app using Knex.js as a query builder and connection pooler (using tarn.js under the hood)
The application creates two Knex.js connection pools, one for reading and one for writing where the read pool is configured to use SET statement_timeout = 55000;
Knex.js read pool is configured with 30/75 min/max connections
Knex.js write pool is configured with 40/40 min/max connections
Both pools point to the same PGAdapter instance
PGAdapter session values are kept at the default, which we believe is 100/400/4 min/max/gRPC Channels

Other Information

CPU resources seem fine as it hovers around 100mCPU
Memory resources seem fine based on values reported in the thread dumps and no indication of aggressive GC
Attempted to issue RESET ALL to see if it was a state issue but there was no improvement

We were able to gather some thread dumps (kill -3 <PID>) in case you're able to glean insight from these:

pgadapter-0.37-good-thread-dump.txt - PGAdapter 0.37.0 in a good state running for a few days
pgadapter-0.39-broken-thread-dump.txt - PGAdapter 0.39.0 in a broken state sitting idle
pgadapter-0.39-broken-thread-dump-during-query.txt - PGAdapter 0.39.0 in a broken state while psql is waiting for a query to execute
pgadapter-0.39-good-thread-dump-new-pod.txt - PGAdapter 0.39.0 in a good state running for 30 minutes
pgadapter-0.39-good-thread-dump-old-pod.txt - PGAdapter 0.39.0 in a good state running for at least 3 days

The text was updated successfully, but these errors were encountered:

olavloite · 2024-10-14T08:26:27Z

Possibly relevant changes between 0.37.0 and 0.38.0:

com.google.cloud:libraries-bom upgraded from 26.44.0 to 26.47.0
The upgrade of libraries-bom means: com.google.cloud:google-cloud-spanner bumped from 6.72.0 to 6.74.1
v0.38.0 added support for shutdown modes SMART, FAST, and IMMEDIATE and added shutdown handlers.
v0.38.0 added support for the open-source Golang migrate project. These changes only add a detection mechanism for the project, and a some simple query workarounds.

Possibly relevant changes between google-cloud-spanner 6.72.0 and 6.74.1:

Cancel queries if the client is closed while queries are still running: feat: add option for cancelling queries when closing client googleapis/java-spanner#3276
Set the core pool size to 1 for the multiplexed session maintainer: fix: use core pool size 1 for maintainer googleapis/java-spanner#3314

One interesting difference in the thread dumps:

Both the broken thread dumps contain this entry:

"grpc-transport-17440" #18151 [18152] daemon prio=5 os_prio=0 cpu=4.50ms elapsed=8555.78s tid=0x00007dd26801e650 nid=18152 waiting on condition  [0x00007dd26d5f9000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000000caf17060> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/Unknown Source)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/Unknown Source)
	at java.util.concurrent.LinkedBlockingQueue.poll([email protected]/Unknown Source)
	at com.google.cloud.spanner.GrpcStreamIterator.computeNext(GrpcStreamIterator.java:107)
	at com.google.cloud.spanner.GrpcStreamIterator.computeNext(GrpcStreamIterator.java:35)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
	at com.google.cloud.spanner.ResumableStreamIterator.computeNext(ResumableStreamIterator.java:246)
	at com.google.cloud.spanner.ResumableStreamIterator.computeNext(ResumableStreamIterator.java:55)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
	at com.google.cloud.spanner.GrpcValueIterator.ensureReady(GrpcValueIterator.java:134)
	at com.google.cloud.spanner.GrpcValueIterator.getMetadata(GrpcValueIterator.java:110)
	at com.google.cloud.spanner.GrpcResultSet.next(GrpcResultSet.java:81)
	at com.google.cloud.spanner.SessionPool$PooledSession.keepAlive(SessionPool.java:1804)
	at com.google.cloud.spanner.SessionPool$PooledSession.access$4900(SessionPool.java:1588)
	at com.google.cloud.spanner.SessionPool$PoolMaintainer.keepAliveSessions(SessionPool.java:2133)
	at com.google.cloud.spanner.SessionPool$PoolMaintainer.maintainPool(SessionPool.java:2066)
	at com.google.cloud.spanner.SessionPool$PoolMaintainer$$Lambda/0x00007dd34a31fd40.run(Unknown Source)
	at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Unknown Source)
	at java.util.concurrent.FutureTask.runAndReset([email protected]/Unknown Source)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run([email protected]/Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/Unknown Source)
	at java.lang.Thread.runWith([email protected]/Unknown Source)
	at java.lang.Thread.run([email protected]/Unknown Source)

None of the working thread dumps contain this entry. This stack dump comes from the keep-alive queries in the normal session pool. Note that this should be unrelated to the change to the core pool size for maintaining multiplexed sessions, as that is a completely separate part of the code, and only sounds/look the same due to the names.

The thread dump seems to indicate that the keep-alive query is stuck while waiting for results to be returned from Spanner for a simple SELECT 1 query.

olavloite · 2024-10-14T09:23:12Z

The thread dump where a psql query is stuck contains this stackdump for the query that is stuck:

"ConnectionHandler-51" #18180 [26025] daemon prio=5 os_prio=0 cpu=76.47ms elapsed=443.52s tid=0x00007dd3080503a0 nid=26025 waiting on condition  [0x00007dd26d4f9000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000000b6ab0538> (a com.google.common.util.concurrent.AbstractCatchingFuture$CatchingFuture)
	at java.util.concurrent.locks.LockSupport.park([email protected]/Unknown Source)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:564)
	at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:91)
	at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:67)
	at com.google.cloud.spanner.SpannerApiFutures.getOrNull(SpannerApiFutures.java:31)
	at com.google.cloud.spanner.SpannerApiFutures.get(SpannerApiFutures.java:26)
	at com.google.cloud.spanner.connection.ConnectionImpl.internalExecuteQuery(ConnectionImpl.java:1753)
	at com.google.cloud.spanner.connection.ConnectionImpl.internalExecute(ConnectionImpl.java:1184)
	at com.google.cloud.spanner.connection.ConnectionImpl.execute(ConnectionImpl.java:1163)
	at com.google.cloud.spanner.pgadapter.statements.BackendConnection$Execute.executeOnSpannerWithLogging(BackendConnection.java:481)
	at com.google.cloud.spanner.pgadapter.statements.BackendConnection$Execute.executeOnSpanner(BackendConnection.java:460)
	at com.google.cloud.spanner.pgadapter.statements.BackendConnection$Execute.analyzeOrExecute(BackendConnection.java:442)
	at com.google.cloud.spanner.pgadapter.statements.BackendConnection$Execute.doExecute(BackendConnection.java:368)
	at com.google.cloud.spanner.pgadapter.statements.BackendConnection$BufferedStatement.execute(BackendConnection.java:234)
	at com.google.cloud.spanner.pgadapter.statements.BackendConnection.flush(BackendConnection.java:1165)
	at com.google.cloud.spanner.pgadapter.statements.BackendConnection.sync(BackendConnection.java:1049)
	at com.google.cloud.spanner.pgadapter.statements.ExtendedQueryProtocolHandler.sync(ExtendedQueryProtocolHandler.java:194)
	at com.google.cloud.spanner.pgadapter.wireprotocol.SyncMessage.sendPayload(SyncMessage.java:40)
	at com.google.cloud.spanner.pgadapter.wireprotocol.WireMessage.send(WireMessage.java:81)
	at com.google.cloud.spanner.pgadapter.statements.SimpleQueryStatement.execute(SimpleQueryStatement.java:111)
	at com.google.cloud.spanner.pgadapter.wireprotocol.QueryMessage.sendPayload(QueryMessage.java:65)
	at com.google.cloud.spanner.pgadapter.wireprotocol.WireMessage.send(WireMessage.java:81)
	at com.google.cloud.spanner.pgadapter.ConnectionHandler.handleMessages(ConnectionHandler.java:506)
	at com.google.cloud.spanner.pgadapter.ConnectionHandler.runConnection(ConnectionHandler.java:420)
	at com.google.cloud.spanner.pgadapter.ConnectionHandler.run(ConnectionHandler.java:341)
	at java.lang.Thread.runWith([email protected]/Unknown Source)
	at java.lang.Thread.run([email protected]/Unknown Source)

This indicates that PGAdapter is waiting for the Java client to execute and return the results of a query. The corresponding Spanner connection that is executing the query:

"connection-executor-50" #18181 [26026] daemon prio=5 os_prio=0 cpu=30.51ms elapsed=431.87s tid=0x00007dd1d80201e0 nid=26026 waiting on condition  [0x00007dd26d3fa000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000000b6ab0638> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/Unknown Source)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/Unknown Source)
	at java.util.concurrent.LinkedBlockingQueue.poll([email protected]/Unknown Source)
	at com.google.cloud.spanner.GrpcStreamIterator.computeNext(GrpcStreamIterator.java:107)
	at com.google.cloud.spanner.GrpcStreamIterator.computeNext(GrpcStreamIterator.java:35)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
	at com.google.cloud.spanner.ResumableStreamIterator.computeNext(ResumableStreamIterator.java:246)
	at com.google.cloud.spanner.ResumableStreamIterator.computeNext(ResumableStreamIterator.java:55)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
	at com.google.cloud.spanner.GrpcValueIterator.ensureReady(GrpcValueIterator.java:134)
	at com.google.cloud.spanner.GrpcValueIterator.getMetadata(GrpcValueIterator.java:110)
	at com.google.cloud.spanner.GrpcResultSet.next(GrpcResultSet.java:81)
	at com.google.cloud.spanner.connection.DirectExecuteResultSet.<init>(DirectExecuteResultSet.java:68)
	at com.google.cloud.spanner.connection.DirectExecuteResultSet.ofResultSet(DirectExecuteResultSet.java:62)
	at com.google.cloud.spanner.connection.SingleUseTransaction.lambda$executeQueryAsync$0(SingleUseTransaction.java:267)
	at com.google.cloud.spanner.connection.SingleUseTransaction$$Lambda/0x00007dd34a482ca8.call(Unknown Source)
	at io.grpc.Context$2.call(Context.java:583)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
	at io.opentelemetry.context.Context.lambda$wrap$1(Context.java:212)
	at io.opentelemetry.context.Context$$Lambda/0x00007dd34a487b70.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/Unknown Source)
	at java.lang.Thread.runWith([email protected]/Unknown Source)
	at java.lang.Thread.run([email protected]/Unknown Source)

This is an indication that also this query is just stuck waiting for results to come back from Spanner. The session pool keep-alive query does not use a custom timeout, and instead just uses the default timeouts of the Spanner library. This could be consistent with the observation that INSERT and UPDATE statements from the application that use the default timeouts in the Spanner library also suffer from the same issue.

olavloite · 2024-10-14T10:05:00Z

Interesting additional fact: The keep-alive queries that are executed by the session pools in the cases where the queries are stuck, are being executed by a grpc-transport-executor thread. That is also the only existing grpc-transport-executor thread at the time of those queries, which to me indicates that no results will ever be delivered, because that would normally be done by the grpc-transport-executor thread(s).

The session pool uses the gRPC transport executor for session maintenance. This executor should not be used by any other tasks than internal gRPC tasks. This change therefore creates simple executors for session maintenance and session creation. Updates GoogleCloudPlatform/pgadapter#2422

olavloite · 2024-10-15T14:34:27Z

The grpc-transport-executor threads are a dead end. Those threads do not play a role in the gRPC calls involved in this case.
The keep-alive query that shows up in the broken thread dumps is not the cause, but a logical result of the state that PGAdapter is in. The background thread that executes the keep-alive queries does so once every 10 seconds, and there is never more than 1 of them running at any time. PGAdapter is in a state where no data is being received from Spanner. That means that these keep-alive queries are also just waiting for an answer from Spanner. Any PGAdapter instance that has been in this state for at least 10 seconds would therefore have one of these queries stuck.

The only way so far that I've been able to reproduce this issue is by dropping all incoming network traffic to the underlying Spanner client. That means that outgoing messages do not cause any errors, but queries or other statements just wait indefinitely for a response from Spanner.

olavloite · 2024-10-25T20:41:59Z

The specific waiting_for_connection part of the error message is logged here: https://github.com/grpc/grpc-java/blob/0b2c17d0da84e92fbb104bd46ccb2f61860db04b/core/src/main/java/io/grpc/internal/DelayedStream.java#L111

This is logged when the timeout occurs while the client is waiting for the stream (connection) to be established. That is: The client has started an attempt to execute a streaming RPC. This then times out while waiting for the stream to be established.

grpc-core versions in the relevant PGAdapter versions are:

v0.37.0: io.grpc:grpc-core:jar:1.65.1
v0.38.0: io.grpc:grpc-core:jar:1.66.0

olavloite self-assigned this Oct 14, 2024

olavloite mentioned this issue Oct 14, 2024

chore: use an own executor for session maintenance googleapis/java-spanner#3409

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connections entering unstable state with waiting_for_connection error #2422

Connections entering unstable state with waiting_for_connection error #2422

joshbautista commented Oct 7, 2024

olavloite commented Oct 14, 2024

olavloite commented Oct 14, 2024

olavloite commented Oct 14, 2024

olavloite commented Oct 15, 2024

olavloite commented Oct 25, 2024

Connections entering unstable state with waiting_for_connection error #2422

Connections entering unstable state with waiting_for_connection error #2422

Comments

joshbautista commented Oct 7, 2024

Current Behavior

Context (Environment)

Other Information

olavloite commented Oct 14, 2024

olavloite commented Oct 14, 2024

olavloite commented Oct 14, 2024

olavloite commented Oct 15, 2024

olavloite commented Oct 25, 2024