-
Notifications
You must be signed in to change notification settings - Fork 205
Aeron Reliability #890
Comments
We had a bunch of issues with aeron (and we also run in GKE 😉), and for long running jobs here is our de-facto settings, to be taken with a grain of salt I might add...
# Timeout for client liveness in nanoseconds.
aeron.client.liveness.timeout=20000000000
# Timeout for image liveness in nanoseconds.
aeron.image.liveness.timeout=20000000000
# Increase the size of the maximum transmission unit to reduce system calls in a throughput scenario.
aeron.mtu.length=16384
# Set the initial window for flow control to account for BDP.
#aeron.rcv.initial.window.length=2097152
# Increase the size of OS socket receive buffer (SO_RCVBUF) to account for Bandwidth Delay Product (BDP) on a high bandwidth network.
#aeron.socket.so_rcvbuf=2097152
# Increase the size of OS socket send buffer (SO_SNDBUF) to account for Bandwidth Delay Product (BDP) on a high bandwidth network.
#aeron.socket.so_sndbuf=2097152
# Length (in bytes) of the log buffers for publication terms.
aeron.term.buffer.length=65536
# Do not use sparse files for the term buffers to avoid page faults.
aeron.term.buffer.sparse.file=true
# Disable bound checking to reduce instruction path on private secure networks.
agrona.disable.bounds.checks=true and we run aeron is a sidecar container with the ...
- args:
- /oscaro/etc/aeron.properties
env:
- name: JAVA_OPTS
value: -Xmx256m
- name: PROMETHEUS_METRICS_PORT
value: "8091"
image: eu.gcr.io/oscaro-cloud/oscaro/aeron-driver:1.9.3-e678e95
imagePullPolicy: IfNotPresent
name: aeron
ports:
- containerPort: 40200
protocol: TCP
- containerPort: 40200
protocol: UDP
- containerPort: 8091
name: aeron-metrics
protocol: TCP
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
volumeMounts:
- mountPath: /oscaro/etc
name: config
- mountPath: /dev/shm
name: aeron
...
volumes:
- configMap:
name: pipeline
name: config
- emptyDir:
medium: Memory
name: aeron Apparantly as was mentioned elsewhere, we should not be setting cpu limits to the container. We'll see what happens but for now it seems relatively stable even it is a bit of a dampening in terms of latency. We are unable to set the UDP buffers as we run our cluster on COS and it just doesn't allow as of 1.10 to change systemctl parameters within the pods. |
That's a ton of great information, thanks! I was reluctant to increase settings like the liveness timeout (beyond our current 10 seconds) because I was afraid we were just masking the issue. Did you confirm that the cpu throttling is your issue and you're just trying to mitigate at this point? |
Just for reference here are our
We also spent countless hours trying to find a good configuration in archived Slack discussions, so this thread is much appreciated. |
We took our onyx cluster (well the 0.14 one) and isolated it into its own pool and dropped the CPU limits. It seems to have done the trick. I didn't want to jinx it over the weekend, but we've been running since Friday afternoon with no Aeron exceptions. Previously we couldn't go 24 hours without the exception and a killed job. No matter which way you slice, even if we get the exception today this is a tremendous improvement. |
We're having trouble with aeron exceptions in the onyx client. They are most often Client Conductor Timeouts though occasionally we see other aeron related exceptions. These exceptions kill the job between 1-4x per day (it's a long running job).
We can't seem to make these exceptions go away. GC does not appear to be an issue, nor do we see CPU usage spikes (our systems are running in GKE). Increasing CPU limits doesn't appear to help. The threads just seem to not be woken up in time to conduct their checks.
I'm pretty much stuck at this point trying various fixes while planning backup plans not involving Onyx. Any help to point me in the right direction would be appreciated.
The text was updated successfully, but these errors were encountered: