Missing logs in query when using boltdb-shipper on local filesystem #3214

woodytec · 2021-01-22T09:57:05Z

I noticed that some queries did not give any output on some time ranges. After restarting Loki, the exact same query & time range give an output.

Steps to reproduce the behavior:

Loki version : version=2.0.0, branch=HEAD, revision=6978ee5d
Logs are shipped from vector to loki
Whatever the query is, before restarting loki some data are missing on some time ranges

Loki configuration:

auth_enabled: false

server:
  http_listen_port: 30631

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 1m 
  chunk_retain_period: 30s

schema_config:
  configs:
  - from: 2020-11-13
    store: boltdb-shipper
    object_store: filesystem
    schema: v11
    index:
      prefix: index_
      period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: C:\ProgramData\loki\boltdb-shipper-active
    cache_location: C:\ProgramData\loki\boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem
  filesystem:
    directory: C:\ProgramData\loki\data

compactor:
  working_directory: C:\ProgramData\loki\boltdb-shipper-compactor
  shared_store: filesystem

table_manager:
  retention_deletes_enabled: true
  retention_period: 168h

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

Expected behavior
The query should give an output, but it does not, as if they were no logs stored. After a restart, I finally got all the logs stored, without any gap, what means logs are effectively stored in Loki!

Environment:

Infrastructure: Loki is running in monolithic mode on Windows 10
Deployment: Deploying Loki through the executable

Loki logs:
Having a look at Loki's logs gave me some informartion, however I have no idea what it is due to nor if this is linked to the bug I'm reporting:

level=error ts=2021-01-20T11:09:59.6501308Z caller=table_manager.go:81 msg="error syncing local boltdb files with storage" err="rename C:\\ProgramData\\loki\\boltdb-shipper-cache/index_18647/R640-VM-HA01-1611138599617813500-1611138750.gz.temp C:\\ProgramData\\loki\\boltdb-shipper-cache/index_18647/R640-VM-HA01-1611138599617813500-1611138750.gz: The process cannot access the file because it is being used by another process."

I also get this type of error:

level=error ts=2021-01-22T08:29:59.7459788Z caller=table.go:235 msg="failed to remove file" path="C:\\ProgramData\\Woody technologies\\Woody in2it\\loki\\boltdb-shipper-compactor\\index_18649\\R640-VM-HA01-1611138599617813500-1611301500.gz" err="remove C:\\ProgramData\\Woody technologies\\Woody in2it\\loki\\boltdb-shipper-compactor\\index_18649\\R640-VM-HA01-1611138599617813500-1611301500.gz: The process cannot access the file because it is being used by another process."

I'm not able to find any process using these files, except from Loki which is running with admin privileges.

Any idea from where it could come?

The text was updated successfully, but these errors were encountered:

woodytec · 2021-01-27T09:39:07Z

Hey, I had time to investigate a bit further. I plotted the log rate in Grafana and here is the output:
Before restarting Loki:

and after restarting it:

We can observe huge gaps in the first panel... Hope this helps understand the matter.

pcbl · 2021-03-17T12:04:19Z

Hi!
I am facing the same issue. The logs "disappear" until I make a Loki restart... Only then I can see all logs again...

@woodytec , did you found/have any news from your end since your last Answer?

Will keep you posted if I find something. But I also get a similar issue like you:

failed to remove file" path=C:\ProgramData\loki\compactor\index_18703\EC2AMAZ-J742J6O-1615856721972207000-1615980600.gz err="remove C:\ProgramData\loki\compactor\index_18703\EC2AMAZ-J742J6O-1615856721972207000-1615980600.gz: The process cannot access the file because it is being used by another process.

BR
Pedro

pcbl · 2021-03-17T12:22:45Z

A sample screenshot with the lines before(15) and after(22) loki restart:

pcbl · 2021-03-17T13:08:54Z

Could be that it was solved by this PR: #3300

woodytec · 2021-03-17T13:18:15Z

Hi!

Glad to hear I'm not the only one to face this issue :)

The workaround I found was to go back to simple filesytem storage without using boltd-shipper, what gives, in terms of configuration:

storage_config:
  boltdb:
    directory: '/path'
  filesystem:
    directory: '/otherpath'

I do not face the issue anymore but I would enjoy going back to boltd-shipper if the problem is solved!

KR,

pcbl · 2021-03-17T13:26:41Z

@woodytec , from my end, moving away from boltdb-shipper will bring back some old isssues we had(already forgot them, but they were gone once we moved to boltdb-shipper).

I will give a try with the newest version(2.2, https://github.com/grafana/loki/releases/tag/v2.2.0) which includes the PR : #3300

From the BR description, this looks Promising:

We encountered that internal canaries were reporting missing logs intermittently.
While the logs were there in the storage they were sometimes not being returned in the query results.
Investigating the queries which had the problem I noticed that we were not retaining dbs long enough to cover the expiry of the cached index by queriers. We need to cover the expiry as well since queriers could keep serving chunks for stale index which could be missing recently flushed chunk ids.

pcbl · 2021-03-18T13:08:57Z

@woodytec , fyi: so far, everything worked as expected with v2.2! :-)
I updated it on my "productive" setup and will use it on more places.... I think the bug is gone(hopefully! 👍 )

stale · 2021-06-03T19:50:12Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

pcbl mentioned this issue Mar 17, 2021

increase db retain period in ingesters to cover index cache validity period as well #3300

Merged

stale bot added the stale A stale issue or PR that will automatically be closed. label Jun 3, 2021

stale bot closed this as completed Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing logs in query when using boltdb-shipper on local filesystem #3214

Missing logs in query when using boltdb-shipper on local filesystem #3214

woodytec commented Jan 22, 2021

woodytec commented Jan 27, 2021

pcbl commented Mar 17, 2021

pcbl commented Mar 17, 2021

pcbl commented Mar 17, 2021

woodytec commented Mar 17, 2021

pcbl commented Mar 17, 2021 •

edited

Loading

pcbl commented Mar 18, 2021

stale bot commented Jun 3, 2021

Missing logs in query when using boltdb-shipper on local filesystem #3214

Missing logs in query when using boltdb-shipper on local filesystem #3214

Comments

woodytec commented Jan 22, 2021

woodytec commented Jan 27, 2021

pcbl commented Mar 17, 2021

pcbl commented Mar 17, 2021

pcbl commented Mar 17, 2021

woodytec commented Mar 17, 2021

pcbl commented Mar 17, 2021 • edited Loading

pcbl commented Mar 18, 2021

stale bot commented Jun 3, 2021

pcbl commented Mar 17, 2021 •

edited

Loading