server: Investigate ECONNRESET #271

pmespresso · 2020-03-31T03:51:34Z

frontend == ExternalIP ==> Loadbalancer ====> Server Deployment ===ClusterIP ==> Prisma (Node Watcher) port 4466 problem area here

the error comes from the server => prisma networking.

so far we attempted:

turn it off and back on again (delete/replace server and nodewatcher deployments)
scale the server (add a replicaset)

neither seem to have actually solved the underlying issue.

trying this now....
3. scale the nodewatcher (double replicas)

The text was updated successfully, but these errors were encountered:

pmespresso · 2020-03-31T10:57:09Z

so it seems like the nomidotwatcher Loadbalancer Service is abruptly cutting the connection....

https://stackoverflow.com/questions/17245881/how-do-i-debug-error-econnreset-in-node-js#17637900

indeed there seem to be some spikes in resource consumption around the problem periods of time:

pmespresso · 2020-03-31T12:54:10Z

ok so as we discussed on Riot kubernetes/kubernetes#79365 (comment)

it looks very likely that our issue is the nodewatcher is a pod made up of multiple containers, since we use the side-car model for GCP.

according to that github comment, it means we need to explicitly set the resources we need.

otherwise, we get https://matrix.parity.io/_matrix/media/r0/download/matrix.parity.io/BlNsHgxHVFckekECbDLzOeYL

type of error.

Tbaut · 2020-03-31T15:29:52Z

For the record, we used to have:

kubectl get hpa
NAME          REFERENCE                TARGETS                        MINPODS   MAXPODS   REPLICAS   AGE
nodewatcher   Deployment/nodewatcher   <unknown>/85%, <unknown>/80%   1         5         4          41d

Which I removed for now, to see if this happens again. We should configure an hpa correctly then.

Tbaut · 2020-04-03T09:23:58Z

Nodewatcher recently got a couple pods evicted, the GCP console says:

Pod The node was low on resource: [MemoryPressure].

pod describe says the same:

Message: The node was low on resource: memory. Container prisma was using 1278848Ki, which exceeds its request of 0. Container cloudsql-proxy was using 9364Ki, which exceeds its request of 0.

that's on last 50k. The running pod is as usual healthy and not showing any error. I'm worried about the "exceeds its request of 0." though

Tbaut · 2020-04-27T08:31:55Z

It happened and it looks like there's a memory leak in primsa:
This is the nodewatcher pod

The deployment:

describe pod didn't give any info.
Logs from the prisma pod:

2020-04-27 07:28:32.786 CEST
{"key":"error/handled","payload":{"message":"No Node for the model Session with value 3612 for index found.","variables":"{\"data\":{\"index\":717,\"totalPoints\":\"0x00000000\",\"individualPoints\":{\"set\":\"0x00\"},\"eraStartSessionIndex\":{\"connect\":{\"index\":3612}}}}","stack_trace":"com.pris…
2020-04-27 07:28:38.335 CEST
{"key":"error/handled","payload":{"variables":"{\"data\":{\"index\":717,\"totalPoints\":\"0x00000000\",\"individualPoints\":{\"set\":\"0x00\"},\"eraStartSessionIndex\":{\"connect\":{\"index\":3612}}}}","stack_trace":"com.prisma.api.connector.jdbc.impl.NestedConnectInterpreter.$anonfun$addAction$1(Ne…
2020-04-27 07:28:44.756 CEST
{"clientId":"default$default","key":"error/handled","payload":{"stack_trace":"com.prisma.api.connector.jdbc.impl.NestedConnectInterpreter.$anonfun$addAction$1(NestedConnectInterpreter.scala:97)\\n slick.basic.BasicBackend$DatabaseDef.$anonfun$runInContextInline$1(BasicBackend.scala:172)\\n scala.con…
2020-04-27 07:28:50.219 CEST
{"requestId":"local:ck9i1k2liik3e0734mctmftpd","clientId":"default$default","key":"error/handled","payload":{"stack_trace":"com.prisma.api.connector.jdbc.impl.NestedConnectInterpreter.$anonfun$addAction$1(NestedConnectInterpreter.scala:97)\\n slick.basic.BasicBackend$DatabaseDef.$anonfun$runInContex…
2020-04-27 07:29:02.522 CEST
{"requestId":"local:ck9i1kc37ik460734vbdt9c8z","clientId":"default$default","key":"error/handled","payload":{"stack_trace":"com.prisma.api.connector.jdbc.impl.NestedConnectInterpreter.$anonfun$addAction$1(NestedConnectInterpreter.scala:97)\\n slick.basic.BasicBackend$DatabaseDef.$anonfun$runInContex…
2020-04-27 08:11:51.346 CEST
[Warning] Management authentication is disabled. Enable it in your Prisma config to secure your server.
2020-04-27 08:11:51.350 CEST
Warning: Management API authentication is disabled. To protect your management server you should provide one (not both) of the environment variables 'CLUSTER_PUBLIC_KEY' (asymmetric, deprecated soon) or 'PRISMA_MANAGEMENT_API_JWT_SECRET' (symmetric JWT).
2020-04-27 08:11:54.104 CEST
Warning: Management API authentication is disabled. To protect your management server you should provide one (not both) of the environment variables 'CLUSTER_PUBLIC_KEY' (asymmetric, deprecated soon) or 'PRISMA_MANAGEMENT_API_JWT_SECRET' (symmetric JWT).
2020-04-27 08:11:57.831 CEST
Warning: Management API authentication is disabled. To protect your management server you should provide one (not both) of the environment variables 'CLUSTER_PUBLIC_KEY' (asymmetric, deprecated soon) or 'PRISMA_MANAGEMENT_API_JWT_SECRET' (symmetric JWT).
2020-04-27 08:13:48.584 CEST
Exception in thread "database-3" java.lang.OutOfMemoryError: Java heap space
2020-04-27 08:14:29.522 CEST
[WARNING] {} - Thread starvation or clock leap detected (housekeeper delta={}).
2020-04-27 08:14:57.620 CEST
Exception in thread "database-5" java.lang.OutOfMemoryError: Java heap space
2020-04-27 08:15:31.389 CEST
[WARNING] {} - Thread starvation or clock leap detected (housekeeper delta={}).
2020-04-27 08:16:24.234 CEST
Exception in thread "database-2" java.lang.OutOfMemoryError: Java heap space
2020-04-27 08:16:54.198 CEST
[WARNING] {} - Thread starvation or clock leap detected (housekeeper delta={}).
2020-04-27 08:34:40.609 CEST
Exception in thread "database-1" java.lang.OutOfMemoryError: Java heap space
2020-04-27 08:38:35.107 CEST
Warning: Management API authentication is disabled. To protect your management server you should provide one (not both) of the environment variables 'CLUSTER_PUBLIC_KEY' (asymmetric, deprecated soon) or 'PRISMA_MANAGEMENT_API_JWT_SECRET' (symmetric JWT).
2020-04-27 08:38:46.618 CEST
Warning: Management API authentication is disabled. To protect your management server you should provide one (not both) of the environment variables 'CLUSTER_PUBLIC_KEY' (asymmetric, deprecated soon) or 'PRISMA_MANAGEMENT_API_JWT_SECRET' (symmetric JWT).

pmespresso added the help wanted Extra attention is needed label Mar 31, 2020

This was referenced Mar 31, 2020

fix: server deployment strategy, nodewatcher replicas #270

Merged

fix: remove fillin-job, update image for last50k #272

Merged

server: subscriptions for validators and blocknumber not working #273

Closed

Tbaut mentioned this issue Mar 31, 2020

Prisma DB not reachable despite no error #249

Closed

pmespresso mentioned this issue Apr 1, 2020

feat: hpa and nodewatcher deployment resources #281

Closed

pmespresso added the P0-dropeverything label Apr 11, 2020

pmespresso mentioned this issue Apr 11, 2020

feat: set cpu/mem requests and limits for prisma and cloudsql #298

Merged

pmespresso linked a pull request Apr 27, 2020 that will close this issue

fix(#267): nodewatcher memory leak(s) #321

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: Investigate ECONNRESET #271

server: Investigate ECONNRESET #271

pmespresso commented Mar 31, 2020

pmespresso commented Mar 31, 2020 •

edited

Loading

pmespresso commented Mar 31, 2020

Tbaut commented Mar 31, 2020

Tbaut commented Apr 3, 2020

Tbaut commented Apr 27, 2020

server: Investigate ECONNRESET #271

server: Investigate ECONNRESET #271

Comments

pmespresso commented Mar 31, 2020

pmespresso commented Mar 31, 2020 • edited Loading

pmespresso commented Mar 31, 2020

Tbaut commented Mar 31, 2020

Tbaut commented Apr 3, 2020

Tbaut commented Apr 27, 2020

pmespresso commented Mar 31, 2020 •

edited

Loading