Skip to content
This repository has been archived by the owner on Apr 11, 2022. It is now read-only.

server: Investigate ECONNRESET #271

Open
pmespresso opened this issue Mar 31, 2020 · 5 comments · May be fixed by #321
Open

server: Investigate ECONNRESET #271

pmespresso opened this issue Mar 31, 2020 · 5 comments · May be fixed by #321
Labels
help wanted Extra attention is needed P0-dropeverything

Comments

@pmespresso
Copy link
Contributor

frontend == ExternalIP ==> Loadbalancer ====> Server Deployment ===ClusterIP ==> Prisma (Node Watcher) port 4466 problem area here

the error comes from the server => prisma networking.

so far we attempted:

  1. turn it off and back on again (delete/replace server and nodewatcher deployments)
  2. scale the server (add a replicaset)

neither seem to have actually solved the underlying issue.

trying this now....
3. scale the nodewatcher (double replicas)

@pmespresso
Copy link
Contributor Author

pmespresso commented Mar 31, 2020

so it seems like the nomidotwatcher Loadbalancer Service is abruptly cutting the connection....

https://stackoverflow.com/questions/17245881/how-do-i-debug-error-econnreset-in-node-js#17637900

indeed there seem to be some spikes in resource consumption around the problem periods of time:

Screenshot 2020-03-31 at 18 59 54

@pmespresso
Copy link
Contributor Author

ok so as we discussed on Riot kubernetes/kubernetes#79365 (comment)

it looks very likely that our issue is the nodewatcher is a pod made up of multiple containers, since we use the side-car model for GCP.

according to that github comment, it means we need to explicitly set the resources we need.

otherwise, we get https://matrix.parity.io/_matrix/media/r0/download/matrix.parity.io/BlNsHgxHVFckekECbDLzOeYL

type of error.

@Tbaut
Copy link
Contributor

Tbaut commented Mar 31, 2020

For the record, we used to have:

kubectl get hpa
NAME          REFERENCE                TARGETS                        MINPODS   MAXPODS   REPLICAS   AGE
nodewatcher   Deployment/nodewatcher   <unknown>/85%, <unknown>/80%   1         5         4          41d

Which I removed for now, to see if this happens again. We should configure an hpa correctly then.
image

@Tbaut
Copy link
Contributor

Tbaut commented Apr 3, 2020

Nodewatcher recently got a couple pods evicted, the GCP console says:

Pod The node was low on resource: [MemoryPressure].

pod describe says the same:

Message: The node was low on resource: memory. Container prisma was using 1278848Ki, which exceeds its request of 0. Container cloudsql-proxy was using 9364Ki, which exceeds its request of 0.

that's on last 50k. The running pod is as usual healthy and not showing any error. I'm worried about the "exceeds its request of 0." though

@Tbaut
Copy link
Contributor

Tbaut commented Apr 27, 2020

It happened and it looks like there's a memory leak in primsa:
This is the nodewatcher pod
image

The deployment:
image

describe pod didn't give any info.
Logs from the prisma pod:

2020-04-27 07:28:32.786 CEST
{"key":"error/handled","payload":{"message":"No Node for the model Session with value 3612 for index found.","variables":"{\"data\":{\"index\":717,\"totalPoints\":\"0x00000000\",\"individualPoints\":{\"set\":\"0x00\"},\"eraStartSessionIndex\":{\"connect\":{\"index\":3612}}}}","stack_trace":"com.pris…
2020-04-27 07:28:38.335 CEST
{"key":"error/handled","payload":{"variables":"{\"data\":{\"index\":717,\"totalPoints\":\"0x00000000\",\"individualPoints\":{\"set\":\"0x00\"},\"eraStartSessionIndex\":{\"connect\":{\"index\":3612}}}}","stack_trace":"com.prisma.api.connector.jdbc.impl.NestedConnectInterpreter.$anonfun$addAction$1(Ne…
2020-04-27 07:28:44.756 CEST
{"clientId":"default$default","key":"error/handled","payload":{"stack_trace":"com.prisma.api.connector.jdbc.impl.NestedConnectInterpreter.$anonfun$addAction$1(NestedConnectInterpreter.scala:97)\\n slick.basic.BasicBackend$DatabaseDef.$anonfun$runInContextInline$1(BasicBackend.scala:172)\\n scala.con…
2020-04-27 07:28:50.219 CEST
{"requestId":"local:ck9i1k2liik3e0734mctmftpd","clientId":"default$default","key":"error/handled","payload":{"stack_trace":"com.prisma.api.connector.jdbc.impl.NestedConnectInterpreter.$anonfun$addAction$1(NestedConnectInterpreter.scala:97)\\n slick.basic.BasicBackend$DatabaseDef.$anonfun$runInContex…
2020-04-27 07:29:02.522 CEST
{"requestId":"local:ck9i1kc37ik460734vbdt9c8z","clientId":"default$default","key":"error/handled","payload":{"stack_trace":"com.prisma.api.connector.jdbc.impl.NestedConnectInterpreter.$anonfun$addAction$1(NestedConnectInterpreter.scala:97)\\n slick.basic.BasicBackend$DatabaseDef.$anonfun$runInContex
2020-04-27 08:11:51.346 CEST
[Warning] Management authentication is disabled. Enable it in your Prisma config to secure your server.
2020-04-27 08:11:51.350 CEST
Warning: Management API authentication is disabled. To protect your management server you should provide one (not both) of the environment variables 'CLUSTER_PUBLIC_KEY' (asymmetric, deprecated soon) or 'PRISMA_MANAGEMENT_API_JWT_SECRET' (symmetric JWT).
2020-04-27 08:11:54.104 CEST
Warning: Management API authentication is disabled. To protect your management server you should provide one (not both) of the environment variables 'CLUSTER_PUBLIC_KEY' (asymmetric, deprecated soon) or 'PRISMA_MANAGEMENT_API_JWT_SECRET' (symmetric JWT).
2020-04-27 08:11:57.831 CEST
Warning: Management API authentication is disabled. To protect your management server you should provide one (not both) of the environment variables 'CLUSTER_PUBLIC_KEY' (asymmetric, deprecated soon) or 'PRISMA_MANAGEMENT_API_JWT_SECRET' (symmetric JWT).
2020-04-27 08:13:48.584 CEST
Exception in thread "database-3" java.lang.OutOfMemoryError: Java heap space
2020-04-27 08:14:29.522 CEST
[WARNING] {} - Thread starvation or clock leap detected (housekeeper delta={}).
2020-04-27 08:14:57.620 CEST
Exception in thread "database-5" java.lang.OutOfMemoryError: Java heap space
2020-04-27 08:15:31.389 CEST
[WARNING] {} - Thread starvation or clock leap detected (housekeeper delta={}).
2020-04-27 08:16:24.234 CEST
Exception in thread "database-2" java.lang.OutOfMemoryError: Java heap space
2020-04-27 08:16:54.198 CEST
[WARNING] {} - Thread starvation or clock leap detected (housekeeper delta={}).
2020-04-27 08:34:40.609 CEST
Exception in thread "database-1" java.lang.OutOfMemoryError: Java heap space
2020-04-27 08:38:35.107 CEST
Warning: Management API authentication is disabled. To protect your management server you should provide one (not both) of the environment variables 'CLUSTER_PUBLIC_KEY' (asymmetric, deprecated soon) or 'PRISMA_MANAGEMENT_API_JWT_SECRET' (symmetric JWT).
2020-04-27 08:38:46.618 CEST
Warning: Management API authentication is disabled. To protect your management server you should provide one (not both) of the environment variables 'CLUSTER_PUBLIC_KEY' (asymmetric, deprecated soon) or 'PRISMA_MANAGEMENT_API_JWT_SECRET' (symmetric JWT).

@pmespresso pmespresso linked a pull request Apr 27, 2020 that will close this issue
6 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Extra attention is needed P0-dropeverything
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants