-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle gap in remote xlog sequence #366
Comments
Do you have any details of the cases where this has happened? We're uploading very large number of WAL segments each day and recovering those frequently as well and haven't really run into this. Also if actually proceeding with some kind of bookkeeping that needs to be opt-in since large deployments can upload upwards of 100,000 WAL segments each day and iterating through those could be prohibitively slow. |
It happens when:
In this case pghoard cannot recover missing wal segment because server has just deleted them. About performance, the scan is made only at startup. I made test with around 80000 wal segments on remote storage. It blocks startup of pghoard during ~ 30sec. So with higher traffic it can be much longer. I think it's important to known what exactly is in the remote storage. If pghoard is running for weeks without restart or errors, it should be ok. I don't say that pghoard is not stable but in some environment like docker/kubernetes pghoard can be killed by the orchestrator and may not be restarted. Log can also disappear. In this case, how do you check if there are missing wal segments ? You can check manually with a script or test a restoration. So I propose to add those new metrics. |
Sometimes pghoard failed to upload some wal segments. Then pghoard restart and upload new wal segments. In this case, on remote sotrage, there is a gap in wal segment sequence.
I have implemented some new metrics to detects this case:
In the following example:
To implements those metrics pghoard need to list files uploaded on remote storage and keep this list up to date.
Scanning remote storage (only on startup) can also improve cleanup because when there is a gap in wal sequence pghoard stop to delete useless wal segments. With the list of uploaded files we can delete all wal segments before the first basebackup even with gap.
https://github.com/aiven/pghoard/blob/master/pghoard/pghoard.py#L255-L258
The text was updated successfully, but these errors were encountered: