Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana dashboard not displaying any logs in loki panels #761

Closed
candlerb opened this issue Apr 17, 2024 · 8 comments
Closed

Grafana dashboard not displaying any logs in loki panels #761

candlerb opened this issue Apr 17, 2024 · 8 comments

Comments

@candlerb
Copy link
Contributor

candlerb commented Apr 17, 2024

Required information

  • Distribution: Ubuntu
  • Distribution version: 22.04
  • Incus: 6.0-202404040310-ubuntu22.04

Issue description

This is an issue with the grafana dashboard, https://grafana.com/grafana/dashboards/19727-incus/

Metrics display is working fine, but the loki panels at the bottom are empty. This is because the LogQL queries are wrong.

They have {app="incus",type="lifecycle",instance="$job"} for the first panel, and {app="incus",type="logging",instance="$job"} for the second.

However the "instance" label in logs don't contain the job name or the container name. Also, the $job variable in Grafana is the prometheus scrape job name, and has nothing to do with loki logs.

Here are some example logs:

# logcli query '{app="incus"}' --since=1h --limit=2000
2024/04/17 08:18:06 http://localhost:3100/loki/api/v1/query_range?direction=BACKWARD&end=1713341886833710643&limit=1000&query=%7Bapp%3D%22incus%22%7D&start=1713338286833710643
2024/04/17 08:18:06 Common labels: {app="incus", instance="none", location="none"}
2024-04-17T07:57:25Z {name="netbox4", project="default", type="lifecycle"} requester-username="root" action="instance-started" source="/1.0/instances/netbox4" requester-address="@" requester-protocol="unix" instance-started
2024-04-17T07:57:25Z {type="logging"}                                      context-action="start" context-created="2024-04-03 13:50:03.830202559 +0000 UTC" context-ephemeral="false" context-instance="netbox4" context-instanceType="container" context-project="default" context-stateful="false" context-used="2024-04-03 13:50:09.473557693 +0000 UTC" level="info" Started instance
2024-04-17T07:57:24Z {type="logging"}                                      context-action="start" context-created="2024-04-03 13:50:03.830202559 +0000 UTC" context-ephemeral="false" context-instance="netbox4" context-instanceType="container" context-project="default" context-stateful="false" context-used="2024-04-03 13:50:09.473557693 +0000 UTC" level="info" Starting instance
2024-04-17T07:57:23Z {name="netbox4", project="default", type="lifecycle"} requester-protocol="unix" requester-username="root" action="instance-shutdown" source="/1.0/instances/netbox4" requester-address="@" instance-shutdown
2024-04-17T07:57:20Z {type="logging"}                                      context-action="shutdown" context-created="2024-04-03 13:50:03.830202559 +0000 UTC" context-ephemeral="false" context-instance="netbox4" context-instanceType="container" context-project="default" context-timeout="10m0s" context-used="2024-04-03 13:50:09.473557693 +0000 UTC" level="info" Shutting down instance
2024-04-17T07:55:50Z {name="nfsen", project="default", type="lifecycle"}   source="/1.0/instances/nfsen" context-command="[su -l]" action="instance-exec" instance-exec
2024-04-17T07:41:45Z {type="logging"}                                      level="info" Done updating images
2024-04-17T07:41:45Z {type="logging"}                                      context-err="Failed getting remote image info: Failed getting image: The requested image couldn't be found" context-fingerprint="d9802562f513aa78eb45bfb68a47194b144da9a08639e6fcf8c54f9b261e3c65" level="error" Failed to update the image
2024-04-17T07:41:44Z {type="logging"}                                      context-err="Failed getting remote image info: Failed getting image: The requested image couldn't be found" context-fingerprint="c533845b5db1747674ee915cbb20df6eb47c953bb7caf1fec5b35ae9ccf98c18" level="error" Failed to update the image
2024-04-17T07:41:41Z {type="logging"}                                      level="info" Done pruning expired backups
2024-04-17T07:41:41Z {type="logging"}                                      level="info" Pruning expired backups
2024-04-17T07:41:41Z {type="logging"}                                      level="info" Updating images
2024-04-17T07:40:43Z {type="logging"}                                      level="info" Done updating images
2024-04-17T07:40:43Z {type="logging"}                                      context-err="Failed getting remote image info: Failed getting image: The requested image couldn't be found" context-fingerprint="d9802562f513aa78eb45bfb68a47194b144da9a08639e6fcf8c54f9b261e3c65" level="error" Failed to update the image
2024-04-17T07:40:42Z {type="logging"}                                      level="info" Done pruning expired backups
2024-04-17T07:40:42Z {type="logging"}                                      level="info" Updating images
2024-04-17T07:40:42Z {type="logging"}                                      level="info" Pruning expired backups
2024-04-17T07:35:46Z {type="logging"}                                      level="info" Done pruning expired backups
2024-04-17T07:35:46Z {type="logging"}                                      level="info" Done updating images
2024-04-17T07:35:46Z {type="logging"}                                      level="info" Updating images
2024-04-17T07:35:46Z {type="logging"}                                      level="info" Pruning expired backups
2024/04/17 08:18:06 http://localhost:3100/loki/api/v1/query_range?direction=BACKWARD&end=1713339346929598024&limit=1000&query=%7Bapp%3D%22incus%22%7D&start=1713338286833710643
2024/04/17 08:18:06 Common labels: {app="incus", instance="none", location="none", type="logging"}
#

Notice how some logs have the name of the container as a label (name="netbox4"), but some other logs relating to this container don't. They may have it buried in the logfmt data though, e.g. context-instance="nfsen" or source="/1.0/instances/nfsen". If you filter logs by container, you still want to see those logs.

I propose that at simplest, the queries need to change to:

# option 1 (simple)
{app="incus", type="lifecycle", name=~"|$name", project=~"|$project"}
{app="incus", type="logging", name=~"|$name", project=~"|$project"}

The vertical bar inside the regexp is because the "name" and "project" labels may be missing (even for logs specific to one container), so we must allow through lines where this label is missing.

However, that will also show logs for other containers, when those logs have no name or project label. A bit of additional filtering can ensure that the container name appears somewhere in the log line:

# option 2 (more selective)
{app="incus", type="lifecycle", name=~"|$name", project=~"|$project"} |~ "$name"
{app="incus", type="logging", name=~"|$name", project=~"|$project"} |~ "$name"

This now works as expected:

image

image

However, if one container name is a prefix of another container name, or two projects have containers with the same name, it may show some logs for another container. A more sophisticated filter is possible:

# option 3 (complex)
{app="incus", type="lifecycle", name=~"|$name", project=~"|$project"}  | logfmt | context_instance=~"|$name" | context_project=~"|$project"
{app="incus", type="logging", name=~"|$name", project=~"|$project"}  | logfmt | context_instance=~"|$name"  | context_project=~"|$project"

This assumes that every log relating to container X either has label name="X" or the log message contains context_instance="X". (Note that hyphens in logfmt attributes are converted to underscores, so that they become valid LogQL label names)

This is true for the logs shown above. In fact, in these examples the lifecycle logs all have name="X",project="Y" and the event logs have context_instance="X",context_project="Y", so the queries can simplify to:

# option 4 (final)
{app="incus", type="lifecycle", name=~"|$name", project=~"|$project"}
{app="incus", type="logging"}  | logfmt | context_instance=~"|$name" | context_project=~"|$project"

I've tried this and it works for me. However I'm not sure if those conditions are true in general for all possible logs from incus. It could be argued it's hard-coding too much info about the log attributes.

Steps to reproduce

  1. Install loki
  2. On incus host, turn on Loki logging: incus config set loki.api.url=http://loki.example.net:3100
  3. Start and stop a container, check it creates some logs in Loki
  4. Install grafana, add Loki as data source. Use "Explore" to browse Loki logs and check you can see the incus logs.
  5. Install the incus grafana dashboard, 19727
  6. Open the incus dashboard, find that no Loki logs are visible
  7. Edit panels and adjust the queries to those given above; logs appear
  8. Try changing the selected project and/or container, and check that logs are filtered appropriately.

Cross-reference

Issue appears to be inherited from lxd dashboard, raised previously: canonical/lxd#13165

@stgraber
Copy link
Member

You need to set loki.instance to match the name of your prometheus job as that's what we get through the dropdown at the top of the dashboard.

@candlerb
Copy link
Contributor Author

But if you do that, you lose the visibility of which host each log message came from. If your prometheus job for scraping metrics is called "incus" then you'll have to set loki.instance="incus" on all nodes, and then all logs will just say instance="incus". That isn't very useful.

Whereas for metrics, instance="XXX" tells you the host which is scraped, and therefore which host the container is running on. This is true even for clusters according to the documentation:

In a cluster environment, Incus returns only the values for instances running on the server that is being accessed. Therefore, you must scrape each cluster member separately.

It would be strange if metrics had instance="nuc1", instance="nuc2", instance="nuc3" but logs all had instance="incus".

@stgraber
Copy link
Member

You don't, the location field still tells you where things are.

@stgraber
Copy link
Member

image

@candlerb
Copy link
Contributor Author

candlerb commented Apr 17, 2024

I get location="none" in my loki logs (you can see it in the examples above), and I don't see a way to override it. Unless you mean I should use loki.labels to set it?

I also get instance="none" when loki.instance is unset (reported separately at #762). Documentation does say it should default to the hostname, which implies it should work like a prometheus instance label.

@stgraber
Copy link
Member

Yeah, those two fields seem wrong in non-clustered case, both should default to your hostname in non-clustered cases and instance should be overrideable through loki.instance.

@stgraber
Copy link
Member

From your suggested changes above, I'll be taking the project~="|$project" part as that makes sense and works well.

The $name part doesn't work because when you select All it will pass a regexp of all individual instances through the filter rather than passing an empty string. So in my case it means passing several thousand instance names through which seriously impacts Loki and it also means that any log or lifecycle even which doesn't have a name field set won't work.

Also name in a Loki entry doesn't necessarily mean instance name, if the event is network or storage related, it may refer to a network or storage pool. So I think we need to stay away from filtering based on instances for now.

@candlerb
Copy link
Contributor Author

The $name part doesn't work because when you select All it will pass a regexp of all individual instances through the filter rather than passing an empty string. So in my case it means passing several thousand instance names through which seriously impacts Loki and it also means that any log or lifecycle even which doesn't have a name field set won't work.

Oops, I had hoped it would do something sensible like empty string or ".*".

However, apparently there is a "custom all value", which I found via here and here.

Also name in a Loki entry doesn't necessarily mean instance name, if the event is network or storage related, it may refer to a network or storage pool. So I think we need to stay away from filtering based on instances for now.

OK, fair enough - although if "All" did match .* then it would be OK.

stgraber added a commit to stgraber/incus that referenced this issue Apr 17, 2024
@tych0 tych0 closed this as completed in 9d31814 Apr 17, 2024
simondeziel pushed a commit to simondeziel/lxd that referenced this issue May 3, 2024
Closes canonical#13165 and lxc/incus#761

Signed-off-by: Stéphane Graber <[email protected]>
(cherry picked from commit 9d31814a93669e38d6b6a2a8215175f546f582d1)
Signed-off-by: Simon Deziel <[email protected]>
License: Apache-2.0
simondeziel pushed a commit to simondeziel/lxd that referenced this issue May 3, 2024
Closes canonical#13165 and lxc/incus#761

Signed-off-by: Stéphane Graber <[email protected]>
(cherry picked from commit 9d31814a93669e38d6b6a2a8215175f546f582d1)
Signed-off-by: Simon Deziel <[email protected]>
License: Apache-2.0
simondeziel pushed a commit to simondeziel/lxd that referenced this issue May 3, 2024
Closes canonical#13165 and lxc/incus#761

Signed-off-by: Stéphane Graber <[email protected]>
(cherry picked from commit 9d31814a93669e38d6b6a2a8215175f546f582d1)
Signed-off-by: Simon Deziel <[email protected]>
License: Apache-2.0
stgraber added a commit that referenced this issue May 27, 2024
tomponline pushed a commit to tomponline/lxd that referenced this issue Jun 6, 2024
Closes canonical#13165 and lxc/incus#761

Signed-off-by: Stéphane Graber <[email protected]>
(cherry picked from commit 9d31814a93669e38d6b6a2a8215175f546f582d1)
Signed-off-by: Simon Deziel <[email protected]>
License: Apache-2.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants