Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing logs in query when using boltdb-shipper on local filesystem #3214

Closed
woodytec opened this issue Jan 22, 2021 · 8 comments
Closed

Missing logs in query when using boltdb-shipper on local filesystem #3214

woodytec opened this issue Jan 22, 2021 · 8 comments
Labels
stale A stale issue or PR that will automatically be closed.

Comments

@woodytec
Copy link

I noticed that some queries did not give any output on some time ranges. After restarting Loki, the exact same query & time range give an output.

Steps to reproduce the behavior:

  1. Loki version : version=2.0.0, branch=HEAD, revision=6978ee5d
  2. Logs are shipped from vector to loki
  3. Whatever the query is, before restarting loki some data are missing on some time ranges

Loki configuration:

auth_enabled: false

server:
  http_listen_port: 30631

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 1m 
  chunk_retain_period: 30s

schema_config:
  configs:
  - from: 2020-11-13
    store: boltdb-shipper
    object_store: filesystem
    schema: v11
    index:
      prefix: index_
      period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: C:\ProgramData\loki\boltdb-shipper-active
    cache_location: C:\ProgramData\loki\boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem
  filesystem:
    directory: C:\ProgramData\loki\data

compactor:
  working_directory: C:\ProgramData\loki\boltdb-shipper-compactor
  shared_store: filesystem

table_manager:
  retention_deletes_enabled: true
  retention_period: 168h

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

Expected behavior
The query should give an output, but it does not, as if they were no logs stored. After a restart, I finally got all the logs stored, without any gap, what means logs are effectively stored in Loki!

Environment:

  • Infrastructure: Loki is running in monolithic mode on Windows 10
  • Deployment: Deploying Loki through the executable

Loki logs:
Having a look at Loki's logs gave me some informartion, however I have no idea what it is due to nor if this is linked to the bug I'm reporting:

level=error ts=2021-01-20T11:09:59.6501308Z caller=table_manager.go:81 msg="error syncing local boltdb files with storage" err="rename C:\\ProgramData\\loki\\boltdb-shipper-cache/index_18647/R640-VM-HA01-1611138599617813500-1611138750.gz.temp C:\\ProgramData\\loki\\boltdb-shipper-cache/index_18647/R640-VM-HA01-1611138599617813500-1611138750.gz: The process cannot access the file because it is being used by another process."

I also get this type of error:

level=error ts=2021-01-22T08:29:59.7459788Z caller=table.go:235 msg="failed to remove file" path="C:\\ProgramData\\Woody technologies\\Woody in2it\\loki\\boltdb-shipper-compactor\\index_18649\\R640-VM-HA01-1611138599617813500-1611301500.gz" err="remove C:\\ProgramData\\Woody technologies\\Woody in2it\\loki\\boltdb-shipper-compactor\\index_18649\\R640-VM-HA01-1611138599617813500-1611301500.gz: The process cannot access the file because it is being used by another process."

I'm not able to find any process using these files, except from Loki which is running with admin privileges.

Any idea from where it could come?

@woodytec
Copy link
Author

Hey, I had time to investigate a bit further. I plotted the log rate in Grafana and here is the output:
Before restarting Loki:
Loki_before_restart

and after restarting it:
loki_after_restart

We can observe huge gaps in the first panel... Hope this helps understand the matter.

@pcbl
Copy link

pcbl commented Mar 17, 2021

Hi!
I am facing the same issue. The logs "disappear" until I make a Loki restart... Only then I can see all logs again...

@woodytec , did you found/have any news from your end since your last Answer?

Will keep you posted if I find something. But I also get a similar issue like you:

failed to remove file" path=C:\ProgramData\loki\compactor\index_18703\EC2AMAZ-J742J6O-1615856721972207000-1615980600.gz err="remove C:\ProgramData\loki\compactor\index_18703\EC2AMAZ-J742J6O-1615856721972207000-1615980600.gz: The process cannot access the file because it is being used by another process.

BR
Pedro

@pcbl
Copy link

pcbl commented Mar 17, 2021

A sample screenshot with the lines before(15) and after(22) loki restart:

image

@pcbl
Copy link

pcbl commented Mar 17, 2021

Could be that it was solved by this PR: #3300

@woodytec
Copy link
Author

Hi!

Glad to hear I'm not the only one to face this issue :)

The workaround I found was to go back to simple filesytem storage without using boltd-shipper, what gives, in terms of configuration:

storage_config:
  boltdb:
    directory: '/path'
  filesystem:
    directory: '/otherpath'

I do not face the issue anymore but I would enjoy going back to boltd-shipper if the problem is solved!

KR,

@pcbl
Copy link

pcbl commented Mar 17, 2021

@woodytec , from my end, moving away from boltdb-shipper will bring back some old isssues we had(already forgot them, but they were gone once we moved to boltdb-shipper).

I will give a try with the newest version(2.2, https://github.com/grafana/loki/releases/tag/v2.2.0) which includes the PR : #3300

From the BR description, this looks Promising:

We encountered that internal canaries were reporting missing logs intermittently.
While the logs were there in the storage they were sometimes not being returned in the query results.
Investigating the queries which had the problem I noticed that we were not retaining dbs long enough to cover the expiry of the cached index by queriers. We need to cover the expiry as well since queriers could keep serving chunks for stale index which could be missing recently flushed chunk ids.

@pcbl
Copy link

pcbl commented Mar 18, 2021

@woodytec , fyi: so far, everything worked as expected with v2.2! :-)
I updated it on my "productive" setup and will use it on more places.... I think the bug is gone(hopefully! 👍 )

@stale
Copy link

stale bot commented Jun 3, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Jun 3, 2021
@stale stale bot closed this as completed Jun 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale A stale issue or PR that will automatically be closed.
Projects
None yet
Development

No branches or pull requests

2 participants