Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Backup stuck forever on v0.18.0, eating up resources #3261

Closed
ghost opened this issue Sep 5, 2023 · 16 comments
Closed

[Bug] Backup stuck forever on v0.18.0, eating up resources #3261

ghost opened this issue Sep 5, 2023 · 16 comments

Comments

@ghost
Copy link

ghost commented Sep 5, 2023

I'm no expect on pocketbase or SQLite, so I have no clue what's going on here.

This is something I did not observe on previous versions but has been happening 80-90% (but not always!) of the time on 0.18.0.

Taking a backup:

  • gets stuck forever
  • takes up increasingly more and more disk space, to the point of eventually filling the entire disk
  • increases cpu usage to 100%

Hard resetting the machine or killing the container resets disk usage to normal.

Backups on the instance are configured to save to S3 (confirmed working previously).

There is nothing relevant in stdout/stderr.

@ganigeorgiev
Copy link
Member

I'm not able to reproduce it. We also have tests that runs without error.

How big is your database?
Could you test with an earlier version to rule v0.18.0 issue?

@ghost
Copy link
Author

ghost commented Sep 5, 2023

You are right, it is also happening with the previous version (0.17.7).

DB size:

kalmard@instance-1:~$ sudo du -hsc /data/pb_data/*
32K     /data/pb_data/backups
139M    /data/pb_data/data.db
32K     /data/pb_data/data.db-shm
3.1M    /data/pb_data/data.db-wal
160M    /data/pb_data/logs.db
32K     /data/pb_data/logs.db-shm
120M    /data/pb_data/logs.db-wal
420M    total

I admit I have been recently experimenting with litestream - I suspect now that adding (and then removing) that has caused this issue. I still have no clue how to fix it unfortunately.

@ganigeorgiev
Copy link
Member

I don't think this is PocketBase related and most likely it is some misconfiguration with Litestream but I'm not much familiar with it and I'm not sure how to help you debug it.

Note that when performing PocketBase backups we wrap the archive generation in a transaction, so double check whether Litestream is not holding a lock on the db or something else similarly.

In any case, if you think that this is a PocketBase issue, please provide more information about your deployment setup and ideally some minimal reproducible steps and I'll have a look.

@ghost
Copy link
Author

ghost commented Sep 5, 2023

This persists even with litestream removed. So at the very least it's not holding a lock, right?

@ganigeorgiev
Copy link
Member

ganigeorgiev commented Sep 5, 2023

I've tested it both locally (Linux) and on S3 (Wasabi free trial) and it works fine for me - both the automated backups and manual initialized. My test pb_data is around ~500MB.

If you use Docker, ensure that the volume is mounted properly and accessible.

If you use S3 - double check your storage configuration and permissions.
It could be also an issue with a vendor S3 APIs compatibility (in the past several users reported issues with Cloudflare R2 - #2952, #2068).
To rule out S3 issue, you can try to temp disable it and test it with the local filesystem.

There is something happening in your setup but I'm not really sure how to help you further without info or at least some reproducible steps.

@ganigeorgiev
Copy link
Member

Also I forgot, If you have custom Go/JS hooks try to temp comment it to rule out a conflicting operation.

@ghost
Copy link
Author

ghost commented Sep 5, 2023

Thank you for the responses, I'm looking into these options. But practically speaking backups were working fine and nothing has changed besides:

  • upgrading from 0.17.7 to 0.18.0, then downgrading for verification
  • setting up litestream, then disabling it for verification

For the record, I did a manual sqlite VACUUM on the DBs which brought down their size but the issue still persists.

@ghost
Copy link
Author

ghost commented Sep 5, 2023

A few more things I tried with no luck:

  • disabling S3 for backup storage
  • removing all functionality from pocketbase, running only the core app
  • running with --debug to see if there's any useful information

@ganigeorgiev
Copy link
Member

Again, without information about your deployment setup I'm not really sure how to help you.

Where PocketBase is running and what command you are executing?
Are you able to reproduce it consistently?
Can you reproduce the issue locally?

@ghost
Copy link
Author

ghost commented Sep 5, 2023

I understand that you can't help without more infromation, at this point I'm just writing down my process for record keeping. I will try to share more if I can't resolve this otherwise.

Right now it is 100% reproducible on both version 0.17.7 and 0.18.0. I have not yet tried to reproduce it locally, but I will.

PB is running in a docker container on a GCP VM, the command to run it is ./backend serve (for 0.18.0, and the old syntax for the previous version).

@ghost
Copy link
Author

ghost commented Sep 5, 2023

I may have found the issue. I realized in pb_data there are some litestream files:

kalmard@instance-1:/data/pb_data$ find .
.
./logs.db-shm
./.autocert_cache
find: ‘./.autocert_cache’: Permission denied
./logs.db-wal
./.logs.db-litestream
./.logs.db-litestream/generation
./.logs.db-litestream/generations
./.logs.db-litestream/generations/459f183b154035ce
./.logs.db-litestream/generations/459f183b154035ce/wal
./.logs.db-litestream/generations/459f183b154035ce/wal/00000083.wal
./.logs.db-litestream/generations/459f183b154035ce/wal/00000082.wal
./data.db
./.data.db-litestream
./.data.db-litestream/generation
./.data.db-litestream/generations
./.data.db-litestream/generations/f23109187d440d8a
./.data.db-litestream/generations/f23109187d440d8a/wal
./.data.db-litestream/generations/f23109187d440d8a/wal/00000037.wal
./.data.db-litestream/generations/f23109187d440d8a/wal/00000038.wal
./backups
./backups/pb_backup_20230814172737.zip.attrs
./backups/pb_backup_20230814172737.zip
./data.db-shm
./logs.db
./data.db-wal

The litestream folders were tiny, <100k each. However, removing them fixed the backup issue (so far). I managed to take backups consistently on both tested versions now.

I'll try to find the time to reproduce the problem based on this information.
Thanks again for the help!

@ghost
Copy link
Author

ghost commented Sep 5, 2023

I tried to recreate this issue and failed. The only thing I can think of is that the litestream files were very large on disk, causing significant load during compression, which basically killed the system. Unfortunately I'm not sure about their actual size.

@ghost
Copy link
Author

ghost commented Sep 5, 2023

Actually after more investigation I managed to reproduce the issue locally. Based on this I believe there is a genuine bug in pocketbase.

A minimal Dockerfile for repro:

FROM alpine:3.18
ADD https://github.com/pocketbase/pocketbase/releases/download/v0.18.0/pocketbase_0.18.0_linux_amd64.zip /tmp/pocketbase.zip
RUN unzip /tmp/pocketbase.zip -d /usr/local/bin && rm /tmp/pocketbase.zip
ADD /tmp /pb_data
CMD ["pocketbase", "serve", "--http=0.0.0.0:8090", "--debug"]

The contents of tmp:

$ find tmp/
tmp/
tmp/.data.db-litestream
tmp/.data.db-litestream/generation
tmp/.data.db-litestream/generations
tmp/.data.db-litestream/generations/b8788b0b2b06c882
tmp/.data.db-litestream/generations/b8788b0b2b06c882/wal
tmp/.data.db-litestream/generations/b8788b0b2b06c882/wal/00000000.wal
tmp/.logs.db-litestream
tmp/.logs.db-litestream/generation
tmp/.logs.db-litestream/generations
tmp/.logs.db-litestream/generations/6940142b5547a8f1
tmp/.logs.db-litestream/generations/6940142b5547a8f1/wal
tmp/.logs.db-litestream/generations/6940142b5547a8f1/wal/00000005.wal
tmp/.logs.db-litestream/generations/6940142b5547a8f1/wal/00000006.wal

File sizes:

$ du -ah tmp/
1.0K    tmp/.data.db-litestream/generation
8.0K    tmp/.data.db-litestream/generations/b8788b0b2b06c882/wal/00000000.wal
8.0K    tmp/.data.db-litestream/generations/b8788b0b2b06c882/wal
8.0K    tmp/.data.db-litestream/generations/b8788b0b2b06c882
8.0K    tmp/.data.db-litestream/generations
9.0K    tmp/.data.db-litestream
1.0K    tmp/.logs.db-litestream/generation
200K    tmp/.logs.db-litestream/generations/6940142b5547a8f1/wal/00000005.wal
196K    tmp/.logs.db-litestream/generations/6940142b5547a8f1/wal/00000006.wal
396K    tmp/.logs.db-litestream/generations/6940142b5547a8f1/wal
396K    tmp/.logs.db-litestream/generations/6940142b5547a8f1
396K    tmp/.logs.db-litestream/generations
397K    tmp/.logs.db-litestream
406K    tmp/

I'm not sure if the exact contents of the files matter. However I'd rather not share them publicly as I'm not sure if they contain anything sensitive. Let me know how to get them to you.

When the server starts up, /pb_data is as expected:

/ # du -hsc /pb_data/
1.2M    /pb_data/

However once you create an admin user and trigger a backup, the backup keeps running and the temporary backup file in /pb_data grows indefinitely:

$ du -hsc /pb_data/.pb_temp_to_delete/pb_backup_3DPT
5.5G /pb_data/.pb_temp_to_delete/pb_backup_3DPT

@ghost
Copy link
Author

ghost commented Sep 5, 2023

After playing around with the backup code (https://github.com/pocketbase/pocketbase/blob/f3fc7f78d747c7e8b9c2da74ffe7c7bf73039910/tools/archive/create.go) the explanation is pretty simple: the zip file containing the backup is picked up during the iteration of the files to be backed up, creating a recursive backup where the system tries to back up the current backup archive into itself indefinitely.

I'm not sure why it's triggered in this specific case, I believe it may be due to a race condition depending on the size and number of files to be iterated.

It can be solved trivially by adding LocalTempDirName to the skipped path when calling archive.Create() (

return archive.Create(app.DataDir(), tempPath, LocalBackupsDirName)
).

@ganigeorgiev
Copy link
Member

ganigeorgiev commented Sep 6, 2023

Indeed it could be a race condition introduced with v0.17.3 (7a3223e).

I'll investigate it further today, but in all cases as you've noted the LocalTempDirName should be definitely excluded.

@ganigeorgiev
Copy link
Member

I wasn't able to make it "stuck" simulating your directory structure, but it is certainly a bug to have the local temp dir as part of the backups so that was removed in v0.18.1. Sorry for this.

Feel free to let me know if you stumble on the same issue again and I'll attempt to investigate it further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant