Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Compactor Fails to Progress, Goes OOM #517

Closed
wleese opened this issue Sep 13, 2018 · 6 comments
Closed

Thanos Compactor Fails to Progress, Goes OOM #517

wleese opened this issue Sep 13, 2018 · 6 comments

Comments

@wleese
Copy link
Contributor

wleese commented Sep 13, 2018

Thanos, Prometheus and Golang version used
thanos, version 0.1.0-rc.2 (branch: master, revision: 8b7169b)

What happened
Compactor runs, uses up 23GB of RAM, gets killed, rinse and repeat

Full logs to relevant components

level=info ts=2018-09-13T08:28:28.280610487Z caller=downsample.go:213 msg="downloaded block" id=01CMEKAPKRKB35ST3SJA9Q594B du...
level=info ts=2018-09-13T08:25:46.085395594Z caller=compact.go:132 msg="start of GC"
level=info ts=2018-09-13T08:25:46.147804176Z caller=compact.go:171 msg="start first pass of downsampling"
level=info ts=2018-09-13T08:25:44.303181744Z caller=compact.go:126 msg="start sync of metas"
level=info ts=2018-09-13T08:25:44.302939213Z caller=compact.go:231 msg="starting compact node"
level=info ts=2018-09-13T08:25:44.303183157Z caller=main.go:243 msg="Listening for metrics" address=0.0.0.0:10902

**Anything else we need to know**

The log messages always seem to reference block 01CMEKAPKRKB35ST3SJA9Q594B.
It takes about 20 minutes from container startup to OOM, which is at 23GB.
@bwplotka
Copy link
Member

When it ooms? After what exactly log line? The logs you provided does not suggest anything.

I am suspecting this is a duplicate of #297

Let me know if it's not, I will reopen.

@bwplotka
Copy link
Member

Also running freshest master can help. Use tag master-<date>-<sha>

@wleese
Copy link
Contributor Author

wleese commented Sep 13, 2018

When it ooms? After what exactly log line? The logs you provided does not suggest anything.

Indeed, that was the last log entry. After that k8s kills the pod.

@bwplotka
Copy link
Member

Can you try to do kubectl logs -p <container>

@wleese
Copy link
Contributor Author

wleese commented Sep 13, 2018

 kubectl logs -f -n monitoring thanos-compactor-platform-001-98b9ff865-gzth7
level=info ts=2018-09-13T12:52:38.295421948Z caller=compact.go:231 msg="starting compact node"
level=info ts=2018-09-13T12:52:38.295661682Z caller=compact.go:126 msg="start sync of metas"
level=info ts=2018-09-13T12:52:38.295833249Z caller=main.go:243 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-09-13T12:52:40.87061162Z caller=compact.go:132 msg="start of GC"
level=info ts=2018-09-13T12:53:24.94102998Z caller=compact.go:347 msg="compact blocks" count=4 mint=1536796800000 maxt=1536825600000 ulid=01CQ9FVKFMP7CKCFCF1T56ZAJC sources="[01CQ8DXZZDS5SQ0M1WEE0A19WE 01CQ8MSQ7DZC8C4K2RG6VM35EA 01CQ8VNEFE431RN7T5ZVD8M8K0 01CQ92H5QPE24M3Z4C4482ZBB0]"
level=info ts=2018-09-13T12:53:44.594828247Z caller=compact.go:767 compactionGroup="0@{prometheus=\"monitoring/platform-001\",prometheus_replica=\"prometheus-platform-001-0\"}" msg="deleting compacted block" old_block=01CQ8DXZZDS5SQ0M1WEE0A19WE result_block=01CQ9FVKFMP7CKCFCF1T56ZAJC
level=info ts=2018-09-13T12:53:44.934144275Z caller=compact.go:767 compactionGroup="0@{prometheus=\"monitoring/platform-001\",prometheus_replica=\"prometheus-platform-001-0\"}" msg="deleting compacted block" old_block=01CQ8MSQ7DZC8C4K2RG6VM35EA result_block=01CQ9FVKFMP7CKCFCF1T56ZAJC
level=info ts=2018-09-13T12:53:45.210870696Z caller=compact.go:767 compactionGroup="0@{prometheus=\"monitoring/platform-001\",prometheus_replica=\"prometheus-platform-001-0\"}" msg="deleting compacted block" old_block=01CQ8VNEFE431RN7T5ZVD8M8K0 result_block=01CQ9FVKFMP7CKCFCF1T56ZAJC
level=info ts=2018-09-13T12:53:45.50357404Z caller=compact.go:767 compactionGroup="0@{prometheus=\"monitoring/platform-001\",prometheus_replica=\"prometheus-platform-001-0\"}" msg="deleting compacted block" old_block=01CQ92H5QPE24M3Z4C4482ZBB0 result_block=01CQ9FVKFMP7CKCFCF1T56ZAJC
level=info ts=2018-09-13T12:54:23.116394883Z caller=compact.go:347 msg="compact blocks" count=4 mint=1536796800000 maxt=1536825600000 ulid=01CQ9FXMBDDWRAN7J2QJES0A0V sources="[01CQ8DXZZWPW71F97TZN6TW7ES 01CQ8MSQ7JW8PN0ZRSM4RQY325 01CQ8VNEFT58WDZF612QP7R99B 01CQ92H5M6ZNTNZBTBKAQP4YJY]"
level=info ts=2018-09-13T12:54:42.299413443Z caller=compact.go:767 compactionGroup="0@{prometheus=\"monitoring/platform-001\",prometheus_replica=\"prometheus-platform-001-1\"}" msg="deleting compacted block" old_block=01CQ8DXZZWPW71F97TZN6TW7ES result_block=01CQ9FXMBDDWRAN7J2QJES0A0V
level=info ts=2018-09-13T12:54:42.559009551Z caller=compact.go:767 compactionGroup="0@{prometheus=\"monitoring/platform-001\",prometheus_replica=\"prometheus-platform-001-1\"}" msg="deleting compacted block" old_block=01CQ8MSQ7JW8PN0ZRSM4RQY325 result_block=01CQ9FXMBDDWRAN7J2QJES0A0V
level=info ts=2018-09-13T12:54:42.837562197Z caller=compact.go:767 compactionGroup="0@{prometheus=\"monitoring/platform-001\",prometheus_replica=\"prometheus-platform-001-1\"}" msg="deleting compacted block" old_block=01CQ8VNEFT58WDZF612QP7R99B result_block=01CQ9FXMBDDWRAN7J2QJES0A0V
level=info ts=2018-09-13T12:54:43.300771064Z caller=compact.go:767 compactionGroup="0@{prometheus=\"monitoring/platform-001\",prometheus_replica=\"prometheus-platform-001-1\"}" msg="deleting compacted block" old_block=01CQ92H5M6ZNTNZBTBKAQP4YJY result_block=01CQ9FXMBDDWRAN7J2QJES0A0V
level=info ts=2018-09-13T12:54:43.540063915Z caller=compact.go:126 msg="start sync of metas"
level=info ts=2018-09-13T12:54:43.75699419Z caller=compact.go:132 msg="start of GC"
level=info ts=2018-09-13T12:54:44.045586745Z caller=compact.go:171 msg="start first pass of downsampling"
level=info ts=2018-09-13T12:57:30.299085098Z caller=downsample.go:213 msg="downloaded block" id=01CMEKAPKRKB35ST3SJA9Q594B duration=2m44.888409649s
rpc error: code = Unknown desc = Error: No such container: 7ac5e68d6062fd02f315b495e807395047569e9a20b2c4385a407a68294c1c21%     

This is reproducable btw.

@bwplotka
Copy link
Member

Nice, so clearly downsampling being not optmized. Let's move discussion to the original ticket: #297

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants