-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement WAL replay and markers for loki.write
#5590
Conversation
loki.write1
loki.write1
loki.write
81d8bb3
to
bee3e71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this! I am super happy for this change. IMO it doesn't have to be super polished in order to be merged, but I just have a few main concerns:
- I'd like us to protect ourselves from corrupted marker files.
- Some of the new interfaces could use a few more comments :)
- I guess performance isn't impacted much, and we don't have to run any special benchmarks?
component/common/loki/client/internal/marker_file_handler_test.go
Outdated
Show resolved
Hide resolved
2f59c24
to
26d4e6b
Compare
26d4e6b
to
f3cde09
Compare
component/common/loki/client/internal/marker_file_handler_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone deletes the WAL, but does not delete the marker file, would the WAL replay code work ok? I suppose in that case the code should know to ignore the segment ID in the marker file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the marking logic is not connected to the WAL, so it can tell if it was deleted or not. I guess for WAL deletions we should advice that the marker has to be deleted as well.
On the other hand, the watcher, once it reads from the marker, tries to find the following segment by looking into disk, so since the WAL doesn't exists it will enter a retry loop until the WAL get's re-created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could add an additional logic to the marker interface, so the WAL can force write to the marker file. Like if it detects WAL deletion, force mark a segment
* encoding wip * using in LastMarkedSegment * atomic marker write * Apply suggestions from code review Co-authored-by: Paulin Todev <[email protected]> --------- Co-authored-by: Paulin Todev <[email protected]>
@ptodev @mattdurham just added on top of the previous queue client benchmarks, once comparing the nil marker and real marker implementations. No significant diff whatsoever:
|
@@ -58,6 +59,15 @@ func NewWatcherMetrics(reg prometheus.Registerer) *WatcherMetrics { | |||
}, | |||
[]string{"id"}, | |||
), | |||
replaySegment: prometheus.NewGaugeVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some additional metrics that would be handy but if you want to do them in another more focused PR thats fine. WAL size in bytes, and WAL oldest timestamp and current timestamp. Kind of like segment but its more actionable for alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Down for checking that in another PR, but since our WAL writer uses prometheus'es wlog implementation, we inherit all this metrics https://github.com/prometheus/prometheus/blob/3450572840881c870012ef80bff1877568698638/tsdb/wlog/wlog.go#L196-L205. I think we don't have timestamps, so that one might be interesting for alerting on WAL not being written or too far behind (like to measure lag from wal write to watcher read).
@mattdurham @ptodev build is green, and all convos are resolved, can you take a final pass when you have some time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some small nits and a few side conversations but looking solid.
PR Description
This PR branches of #5434, and applies the work done in ptodev/prometheus#1 to improve two things:
loki.write
, so upon an agent restart, it will replay "not sent" data saved in the WAL segments that are aliveTODOs
loki.write
#5590 (comment)Which issue(s) this PR fixes
Notes to the Reviewer
PR Checklist