Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scope stops collecting container metrics #1795

Closed
foot opened this issue Aug 15, 2016 · 18 comments
Closed

Scope stops collecting container metrics #1795

foot opened this issue Aug 15, 2016 · 18 comments
Labels
bug Broken end user or developer functionality; not working as the developers intended it

Comments

@foot
Copy link
Contributor

foot commented Aug 15, 2016

From node-details call, node.metrics.0.samples is null.

Which the UI doesn't handle very gracefully right now.

Process/Host is still collected.

env: docker-machine

report.2.json.zip

@foot foot added the bug Broken end user or developer functionality; not working as the developers intended it label Aug 15, 2016
@rade
Copy link
Member

rade commented Aug 15, 2016

what do you mean by "stops"? you see metrics and then you don't?

@foot
Copy link
Contributor Author

foot commented Aug 15, 2016

Yeah, was testing out the 0.17.0 release, clicking around. At some point I started getting JS errors. Investigating, it was the sparklines failing to render on the details panel of an alpine container. Where there had been sparklines a few minutes ago.

@foot
Copy link
Contributor Author

foot commented Aug 15, 2016

After the metrics stopped I

  1. ./scope stop succeeded
  2. ./scope launch hung.
  3. docker-machine restart weave-1 succeeded and launched scope again for more testing, metrics are back.

@foot
Copy link
Contributor Author

foot commented Aug 15, 2016

This did happen last week too. It might be my env.

What command should I try running next time this happens?

  1. docker logs weavescope
  2. ??

@2opremio
Copy link
Contributor

@foot Could you try to reproduce with 0.16.2 to confirm whether it's a regression?

What command should I try running next time this happens?

The logs and the report are the first stage yep.

@foot
Copy link
Contributor Author

foot commented Aug 16, 2016

I can't solidly repro :(. Here are logs from last time it happened.

logs.txt

If no-one has seen this on dev/prod I wouldn't block the release, I'll keep an eye out for it and start running scope w/ --debug.

@2opremio
Copy link
Contributor

2opremio commented Aug 16, 2016

Uhm, this is most probably the culprit:

<probe> ERRO: 2016/08/15 16:56:00.691200 docker container: error reading event for 5720c372a042f761befd46912841f65b8c621f076d7f7e4c2cbe5a9b56b5d155: only encoded map or array can be decoded into a struct
<probe> ERRO: 2016/08/15 16:56:41.761662 Error gather stats for container: 2b80711ecba7327cec9e2eb3c5195a441f5e16377b5a1940702a3c42e8fc7f68

However, the latter type of message could be a leftover (expected error) from #1687 . It's difficult to tell because we don't print the error but I also get that for containers which stopped existing.

What docker version are you running? I recently bumped the docker client library ( #1787 ), maybe that's part of the problem?

Can you check whether the log errors are systematic? (regardless of the UI error)

Also, the logs suggest you were pausing/unpausing containers. I also get the second error when doing so.

In the meantime, I am going to try silencing the expected stats errors.

@2opremio
Copy link
Contributor

@foot I am starting to suspect that the problem is that metrics are null while the container is paused, which I believe is legitimate. I have managed to reproduce with the following error in the UI:

screen shot 2016-08-16 at 1 46 06 pm

Is that what you saw?

I am worried about the decoding error though. That one I cannot reproduce.

@foot
Copy link
Contributor Author

foot commented Aug 16, 2016

Yes!!! Nice one @2opremio!

I can't repro immediately, do you have to wait a little after pausing for the metrics to be dropped?

@2opremio
Copy link
Contributor

I can't repro immediately, do you have to wait a little after pausing for the metrics to be dropped?

I am not sure, I just paused and unpaused a few times.

I am more worried about the only encoded map or array can be decoded into a struct error though. It can be due to a bug in the ugorji library (which we use for decoding the docker events).

@foot
Copy link
Contributor Author

foot commented Aug 16, 2016

Still can't repro w/ the pause/unpause, have been getting a couple of the other error though, what is the id it gives? I can't find any containers/images w/ that id.

@2opremio
Copy link
Contributor

what is the id it gives? I can't find any containers/images w/ that id.

Are you referring to the Error gather stats problem? just fixed it with #1798

Would you mind rebuilding scope with the latest commits in the release branch?

@foot
Copy link
Contributor Author

foot commented Aug 16, 2016

Nope, the other one:

<probe> ERRO: 2016/08/15 16:56:00.691200 docker container: error reading event for 5720c372a042f761befd46912841f65b8c621f076d7f7e4c2cbe5a9b56b5d155: only encoded map or array can be decoded into a struct

What is the 5720c372a042f761befd46912841f65b8c621f076d7f7e4c2cbe5a9b56b5d155 ?

@foot
Copy link
Contributor Author

foot commented Aug 16, 2016

I got the system into a bad state again! By stopping and starting and pausing and restarting a container until docker stopped responding.

  • I get the JS error we've both experienced consistently.
  • But underlying docker seems to be in a bad place too, can't exec into container anymore from command line.
  • docker ps hanging sometimes.
  • sometimes pausing hangs and sometimes it returns w/ "API error (500): Cannot pause container e08c76d956eed562ecc69ece49e8f049e8fda72c0bff75695af8ad92f098587f: rpc error: code = 2 desc = \"container not running\\n\"\n" Even though docker ps reports that it is running.

So seems to be some undefined docker state will give the samples: null

@2opremio
Copy link
Contributor

What is the 5720c372a042f761befd46912841f65b8c621f076d7f7e4c2cbe5a9b56b5d155 ?

It's the container ID

@foot
Copy link
Contributor Author

foot commented Aug 16, 2016

Cool, I couldn't find 5720c372a042f761befd46912841f65b8c621f076d7f7e4c2cbe5a9b56b5d155 in docker ps -a so maybe that's where the encoding error is coming from, a very short-lived container.

But I guess we should handle null samples somewhere, BE or FE?

Cleaner to strip them on the BE, but kind of provides some extra info about the state of the system through the http endpoints too which is nice.

@rade rade added this to the August2016 milestone Aug 24, 2016
@2opremio
Copy link
Contributor

@foot Have you run into this again?

@foot
Copy link
Contributor Author

foot commented Sep 13, 2016

Nope I have not, closing for now, will re-open if it comes up again.

@foot foot closed this as completed Sep 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Broken end user or developer functionality; not working as the developers intended it
Projects
None yet
Development

No branches or pull requests

3 participants