You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have deployed a m3db cluster with the operator. I have a default namespace for live data and an aggregated namespace for long-term storage. My spec is:
Source Prometheus logs are filled with remote_write errors:
ts=2021-12-28T17:22:17.510Z caller=dedupe.go:112 component=remote level=error remote_name=8ae741 url=https://REDACTED/api/v1/prom/remote/write msg="non-recoverable error" count=3000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: {\"status\":\"error\",\"error\":\"bad_request_errors: count=58, last=datapoint for aggregation too far in past: off_by=10m17.435332333s, timestamp=2021-12-28T17:10:00Z, past_limit=2021-12-28T17:20:17Z, timestamp_unix_nanos=1640711400000000000, past_limit_unix_nanos=1640712017435332333\"}"
ts=2021-12-28T17:24:58.540Z caller=dedupe.go:112 component=remote level=error remote_name=8ae741 url=https://REDACTED/api/v1/prom/remote/write msg="non-recoverable error" count=3000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: {\"status\":\"error\",\"error\":\"bad_request_errors: count=2, last=datapoint for aggregation too far in past: off_by=1m5.057778422s, timestamp=2021-12-28T17:21:53Z, past_limit=2021-12-28T17:22:58Z, timestamp_unix_nanos=1640712113450000000, past_limit_unix_nanos=1640712178507778422\"}"
And on the m3db nodes I get datapoint for aggregation too far in past errors, ranging from a few seconds to about 10 minutes.\
{"level":"error","ts":1640712238.5445695,"msg":"write error","rqID":"f13324a3-0126-45cd-b05b-d81b8ddc7a15","remoteAddr":"192.168.4.56:37888","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":1,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=5.067515455s, timestamp=2021-12-28T17:21:53Z, past_limit=2021-12-28T17:21:58Z, timestamp_unix_nanos=1640712113450000000, past_limit_unix_nanos=1640712118517515455"}
{"level":"error","ts":1640712257.5282857,"msg":"write error","rqID":"51c82440-a10a-464d-a701-fe4e29dfa196","remoteAddr":"192.168.92.76:26034","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":59,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=10m17.485669938s, timestamp=2021-12-28T17:12:00Z, past_limit=2021-12-28T17:22:17Z, timestamp_unix_nanos=1640711520000000000, past_limit_unix_nanos=1640712137485669938"}
{"level":"error","ts":1640712267.5369058,"msg":"write error","rqID":"7da4065e-50db-48e4-afcb-3d721cad7d93","remoteAddr":"192.168.55.139:51756","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":2,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=42.383129547s, timestamp=2021-12-28T17:21:45Z, past_limit=2021-12-28T17:22:27Z, timestamp_unix_nanos=1640712105132000000, past_limit_unix_nanos=1640712147515129547"}
{"level":"error","ts":1640712329.0677392,"msg":"write error","rqID":"1a31c55c-b894-4208-a279-4559e9f12ae9","remoteAddr":"192.168.92.76:26034","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":2,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=1m43.902580804s, timestamp=2021-12-28T17:21:45Z, past_limit=2021-12-28T17:23:29Z, timestamp_unix_nanos=1640712105132000000, past_limit_unix_nanos=1640712209034580804"}
{"level":"error","ts":1640712387.462137,"msg":"write error","rqID":"72c66e32-cfe1-4d65-b999-5050cd3c0f93","remoteAddr":"192.168.92.76:26032","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":2,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=2m42.280133072s, timestamp=2021-12-28T17:21:45Z, past_limit=2021-12-28T17:24:27Z, timestamp_unix_nanos=1640712105132000000, past_limit_unix_nanos=1640712267412133072"}
{"level":"error","ts":1640712387.5179935,"msg":"write error","rqID":"c3bf9e0b-b8c0-4871-b61f-bcfcdd45d6b0","remoteAddr":"192.168.92.76:26030","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":1,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=2m42.356698292s, timestamp=2021-12-28T17:21:45Z, past_limit=2021-12-28T17:24:27Z, timestamp_unix_nanos=1640712105132000000, past_limit_unix_nanos=1640712267488698292"}
{"level":"error","ts":1640712387.5640953,"msg":"write error","rqID":"e03314d0-3466-40fa-a968-435cc0094e36","remoteAddr":"192.168.55.139:51756","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":2,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=2m42.410037855s, timestamp=2021-12-28T17:21:45Z, past_limit=2021-12-28T17:24:27Z, timestamp_unix_nanos=1640712105132000000, past_limit_unix_nanos=1640712267542037855"}
{"level":"error","ts":1640712429.2815886,"msg":"write error","rqID":"3204ed52-ece5-45f9-a9ee-203dd75766e6","remoteAddr":"192.168.92.76:26032","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":60,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=10m9.226459202s, timestamp=2021-12-28T17:15:00Z, past_limit=2021-12-28T17:25:09Z, timestamp_unix_nanos=1640711700000000000, past_limit_unix_nanos=1640712309226459202"}
Checking the Prometheus remote_write metrics, it looks like no samples at all are successfully sent.
It doesn't seem to be a throughput issue (checking max shards is never reached and remote_write duration p99 is fairly low). There's no large backlog of samples either. I've also verified that there's no time synchronization issue on either the Prometheus sources or M3DB cluster.
I've seen references in Github and Slack to downsample bufferPastLimits (e.g. m3db/m3#2355) but I don't see any way to customize this value with the operator.
The text was updated successfully, but these errors were encountered:
m3db-operator-0.13.0
I have deployed a m3db cluster with the operator. I have a default namespace for live data and an aggregated namespace for long-term storage. My spec is:
My namespaces are initializing correctly:
I have configured a couple of test Prometheus environment to remote_write to M3DB.
Source Prometheus logs are filled with remote_write errors:
And on the m3db nodes I get
datapoint for aggregation too far in past
errors, ranging from a few seconds to about 10 minutes.\Checking the Prometheus remote_write metrics, it looks like no samples at all are successfully sent.
It doesn't seem to be a throughput issue (checking max shards is never reached and remote_write duration p99 is fairly low). There's no large backlog of samples either. I've also verified that there's no time synchronization issue on either the Prometheus sources or M3DB cluster.
I've seen references in Github and Slack to downsample bufferPastLimits (e.g. m3db/m3#2355) but I don't see any way to customize this value with the operator.
The text was updated successfully, but these errors were encountered: