-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loki queries not split across queriers #9195
Comments
querier:
max_concurrent: 5 This configuration should help you get a more balanced load. This is a scheduling problem. After seeing this config, I believe you can understand what is going on. |
@liguozhong thanks for your feedback here. While waiting for a response from the community, I tried a few more things including what you suggested:
Unfortunately |
You set |
Any updates @gtorre on solving your issue? I'm struggling with the same thing and any insights you can share would be valuable! :) |
@luddskunk no progress unfortunately, we're now looking at alternatives. |
@gtorre sorry for the delay in the response. I was looking at the issue and found the root cause for it. I am looking at a temporary fix for this with config changes until the permanent fix is out. I will keep posting the progress here. |
@sandeepsukhani Thanks for letting us know, and getting back to it. Do you have any details on the root cause to share? |
The issue is in the code for initializing |
@sandeepsukhani Would be interesting if we can explicitly set this somewhere in the helm chart values-file or similar. |
Thanks @sandeepsukhani! Please keep us posted 🤞 |
…-scheduler replicas (#9477) **What this PR does / why we need it**: Currently, we have a bug in our code when running Loki in SSD mode and using the ring for query-scheduler discovery. It causes queries to not be distributed to all the available read pods. I have explained the issue in detail in [the PR which fixes the code](#9471). Since this bug causes a major query performance impact and code release might take time, in this PR we are doing a new helm release which fixes the issue by using the k8s service for discovering `query-scheduler` replicas. **Which issue(s) this PR fixes**: Fixes #9195
Hey Folks, we have released a new helm chart version i.e. 5.5.3, which should fix the issue. Please reopen this issue if you folks still face any problems with query distribution to all the available |
Just wanted to let you know I tried it out and I see a dramatic increase in query performance. Thanks alot for your efforts here @sandeepsukhani! |
…-scheduler replicas (grafana#9477) **What this PR does / why we need it**: Currently, we have a bug in our code when running Loki in SSD mode and using the ring for query-scheduler discovery. It causes queries to not be distributed to all the available read pods. I have explained the issue in detail in [the PR which fixes the code](grafana#9471). Since this bug causes a major query performance impact and code release might take time, in this PR we are doing a new helm release which fixes the issue by using the k8s service for discovering `query-scheduler` replicas. **Which issue(s) this PR fixes**: Fixes grafana#9195
**What this PR does / why we need it**: When we run the `query-scheduler` in `ring` mode, `queriers` and `query-frontend` discover the available `query-scheduler` instances using the ring. However, we have a problem when `query-schedulers` are not running in the same process as queriers and query-frontend since [we try to get the ring client interface from the scheduler instance](https://github.com/grafana/loki/blob/abd6131bba18db7f3575241c5e6dc4eed879fbc0/pkg/loki/modules.go#L358). This causes queries not to be spread across all the available queriers when running in SSD mode because [we point querier workers to query frontend when there is no ring client and scheduler address configured](https://github.com/grafana/loki/blob/b05f4fced305800b32641ae84e3bed5f1794fa7d/pkg/querier/worker_service.go#L115). I have fixed this issue by adding a new hidden target to initialize the ring client in `reader`/`member` mode based on which service is initializing it. `reader` mode will be used by `queriers` and `query-frontend` for discovering `query-scheduler` instances from the ring. `member` mode will be used by `query-schedulers` for registering themselves in the ring. I have also made a couple of changes not directly related to the issue but it fixes some problems: * [reset metric registry for each integration test](18c4fe5) - Previously we were reusing the same registry for all the tests and just [ignored the attempts to register same metrics](https://github.com/grafana/loki/blob/01f0ded7fcb57e3a7b26ffc1e8e3abf04a403825/integration/cluster/cluster.go#L113). This causes the registry to have metrics registered only from the first test so any updates from subsequent tests won't reflect in the metrics. metrics was the only reliable way for me to verify that `query-schedulers` were connected to `queriers` and `query-frontend` when running in ring mode in the integration test that I added to test my changes. This should also help with other tests where earlier it was hard to reliably check the metrics. * [load config from cli as well before applying dynamic config](f9e2448) - Previously we were applying dynamic config considering just the config from config file. This results in unexpected config changes, for example, [this config change](https://github.com/grafana/loki/blob/4148dd2c51cb827ec3889298508b95ec7731e7fd/integration/loki_micro_services_test.go#L66) was getting ignored and [dynamic config tuning was unexpectedly turning on ring mode](https://github.com/grafana/loki/blob/52cd0a39b8266564352c61ab9b845ab597008770/pkg/loki/config_wrapper.go#L94) in the config. It is better to do any config tuning based on both file and cli args configs. **Which issue(s) this PR fixes**: Fixes #9195
**What this PR does / why we need it**: When we run the `query-scheduler` in `ring` mode, `queriers` and `query-frontend` discover the available `query-scheduler` instances using the ring. However, we have a problem when `query-schedulers` are not running in the same process as queriers and query-frontend since [we try to get the ring client interface from the scheduler instance](https://github.com/grafana/loki/blob/abd6131bba18db7f3575241c5e6dc4eed879fbc0/pkg/loki/modules.go#L358). This causes queries not to be spread across all the available queriers when running in SSD mode because [we point querier workers to query frontend when there is no ring client and scheduler address configured](https://github.com/grafana/loki/blob/b05f4fced305800b32641ae84e3bed5f1794fa7d/pkg/querier/worker_service.go#L115). I have fixed this issue by adding a new hidden target to initialize the ring client in `reader`/`member` mode based on which service is initializing it. `reader` mode will be used by `queriers` and `query-frontend` for discovering `query-scheduler` instances from the ring. `member` mode will be used by `query-schedulers` for registering themselves in the ring. I have also made a couple of changes not directly related to the issue but it fixes some problems: * [reset metric registry for each integration test](grafana@18c4fe5) - Previously we were reusing the same registry for all the tests and just [ignored the attempts to register same metrics](https://github.com/grafana/loki/blob/01f0ded7fcb57e3a7b26ffc1e8e3abf04a403825/integration/cluster/cluster.go#L113). This causes the registry to have metrics registered only from the first test so any updates from subsequent tests won't reflect in the metrics. metrics was the only reliable way for me to verify that `query-schedulers` were connected to `queriers` and `query-frontend` when running in ring mode in the integration test that I added to test my changes. This should also help with other tests where earlier it was hard to reliably check the metrics. * [load config from cli as well before applying dynamic config](grafana@f9e2448) - Previously we were applying dynamic config considering just the config from config file. This results in unexpected config changes, for example, [this config change](https://github.com/grafana/loki/blob/4148dd2c51cb827ec3889298508b95ec7731e7fd/integration/loki_micro_services_test.go#L66) was getting ignored and [dynamic config tuning was unexpectedly turning on ring mode](https://github.com/grafana/loki/blob/52cd0a39b8266564352c61ab9b845ab597008770/pkg/loki/config_wrapper.go#L94) in the config. It is better to do any config tuning based on both file and cli args configs. **Which issue(s) this PR fixes**: Fixes grafana#9195 (cherry picked from commit 0a5e149)
Hi! As a result of 79b876b, the configuration for SSD mode is gone... |
…-scheduler replicas (grafana#9477) **What this PR does / why we need it**: Currently, we have a bug in our code when running Loki in SSD mode and using the ring for query-scheduler discovery. It causes queries to not be distributed to all the available read pods. I have explained the issue in detail in [the PR which fixes the code](grafana#9471). Since this bug causes a major query performance impact and code release might take time, in this PR we are doing a new helm release which fixes the issue by using the k8s service for discovering `query-scheduler` replicas. **Which issue(s) this PR fixes**: Fixes grafana#9195
TL;DR: How do I split queries across all of the available loki-read pods (queriers)? Do I need to setup the query-frontend as a separate deployment per the docs or is there something I'm missing in our config?
Hello all! I've recently setup Loki in our environment, we ship logs from 200K+ assets to Loki every 30 minutes or so via the Loki push API. We have two labels, and Loki seems pretty happy right now in regards to writes (our backend is S3.) The problem is querying logs. When I search a few hours back, queries are pretty snappy. However, we need to look back as far as 30 days and this is where we're seeing issues. I believe queries are not being split up and load balanced across the available
loki-read
pods (queriers).We're using Loki in the single binary mode with
loki-read
andloki-write
pods after reading this blog post. There are quite a few Helm charts, but it seems the developers are saying to use the officialgrafana/loki
chart. What I'm confused about is that even though the recommendation seems to be to use thegrafana/loki
chart which creates two statefulsets and a deployment for the gateway, I still see the query-frontend being mentioned in the official docs, which makes it seem like we should be setting up a separate instance of thequery-frontend
. However, when I hit the/services
endpoint on theloki-read
pods I can see that thequery-frontend
service is already running, 🤔Here is our config currently:
When we query for two days of logs, we get this error:
During this query ☝️ some of the queriers weren't doing anything at all:
In addition, when I look through the logs, I would expect each querier to perform a small query different from other queriers, i.e. I should see different values for
start_delta
andend_delta
in each query, but I see the same exact queries being performed by the queriers.Finally, here are our
loki-read
pod resources:The text was updated successfully, but these errors were encountered: