Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libbeat] Track frame counts in saved segments in the disk queue #22970

Merged
merged 17 commits into from
May 25, 2021

Conversation

faec
Copy link
Contributor

@faec faec commented Dec 7, 2020

What does this PR do?

This PR adds the plumbing to keep track of how many frames (events) are stored on disk at a given time, by adding a frameCount field to the segment file header and a frameIndex field to the queuePosition structure stored in the queue state. The initial version of the queue only tracked byte counts and positions, which don't convert easily to frame counts.

On startup, the queue now attempts to load the segment header from any preexisting segments. If the segment header has no frame count (either because it's from a previous version or because the segment was not closed cleanly), it attempts to calculate the value manually with a linear scan of the segment's frame headers.

Since the pipeline metrics are not yet accessible to the queue, this PR has no user-visible changes except for a few log messages.

Why is it important?

This is necessary preparation for reporting the "real" active event count for the disk queue as required by #22602

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@faec faec added enhancement libbeat Team:Integrations Label for the Integrations team labels Dec 7, 2020
@faec faec requested review from urso and fearful-symmetry December 7, 2020 21:21
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 7, 2020
@elasticmachine
Copy link
Collaborator

elasticmachine commented Dec 7, 2020

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #22970 updated

  • Start Time: 2021-05-25T16:48:23.194+0000

  • Duration: 138 min 20 sec

  • Commit: 27c5f09

Test stats 🧪

Test Results
Failed 0
Passed 47538
Skipped 5248
Total 52786

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 47538
Skipped 5248
Total 52786

Copy link
Contributor

@fearful-symmetry fearful-symmetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with the caveat that I don't know nearly enough about this as you.

libbeat/publisher/queue/diskqueue/reader_loop.go Outdated Show resolved Hide resolved
libbeat/publisher/queue/diskqueue/segments.go Outdated Show resolved Hide resolved
libbeat/publisher/queue/diskqueue/segments.go Outdated Show resolved Hide resolved
libbeat/publisher/queue/diskqueue/segments.go Outdated Show resolved Hide resolved
libbeat/publisher/queue/diskqueue/segments.go Outdated Show resolved Hide resolved
libbeat/publisher/queue/diskqueue/writer_loop.go Outdated Show resolved Hide resolved
@botelastic
Copy link

botelastic bot commented Jan 17, 2021

Hi!
We just realized that we haven't looked into this PR in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it in as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Jan 17, 2021
@botelastic botelastic bot removed the Stalled label Jan 27, 2021
@botelastic
Copy link

botelastic bot commented Feb 26, 2021

Hi!
We just realized that we haven't looked into this PR in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it in as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Feb 26, 2021
@botelastic
Copy link

botelastic bot commented Mar 28, 2021

Hi!
This PR has been stale for a while and we're going to close it as part of our cleanup procedure.
We appreciate your contribution and would like to apologize if we have not been able to review it, due to the current heavy load of the team.
Feel free to re-open this PR if you think it should stay open and is worth rebasing.
Thank you for your contribution!

@botelastic
Copy link

botelastic bot commented Apr 28, 2021

Hi!
We just realized that we haven't looked into this PR in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it in as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Apr 28, 2021
@botelastic botelastic bot removed the Stalled label May 13, 2021
// If the segment is still in the writing list, we can't discard it
// until the writer loop is done with it, but we can hope that advancing
// to the current write position will get us out of our error state.
dq.segments.nextReadPosition = segment.byteCount
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the way readers and writers are coordinated, an error in a segment that is currently written would likely indicate a bug in the business logic, a bug in the framing done by the writer, or an unexpected race condition. The system will most likely recover once we did start a new segment file. In this case we might want to use logger.Criticalf, in order to encourage users to report bugs.

Errors on older, already closed segment files might indicate a broken/invalid segment file, or a bug in the framing. The former is to be expected if the system was not shutdown cleanly. Error level would be enough I think.

fullPath := path.Join(pathStr, file.Name())
header, err := readSegmentHeaderWithFrameCount(fullPath)
if header == nil {
logger.Errorf("couldn't load segment file '%v': %v", fullPath, err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to encourage us to use structured logging more. e.g. the loop could introduce its own logger like logger := logger.With("segment", segmentID(id)). Would we need the full path if we have had the segment ID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full path is theoretically redundant since every segment is in the same directory, and the directory depends on the user configuration, but I thought it would be nice for the error to be explicit since I wouldn't expect whoever sees the message to know where the queue path is or how to find it.

@urso
Copy link

urso commented May 25, 2021

/test

@faec faec mentioned this pull request May 25, 2021
9 tasks
@faec faec merged commit 4b14493 into elastic:master May 25, 2021
@faec faec deleted the disk-queue branch May 25, 2021 19:47
@faec faec added the backport-v7.14.0 Automated backport with mergify label Jun 24, 2021
mergify bot pushed a commit that referenced this pull request Jun 24, 2021
faec added a commit that referenced this pull request Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.14.0 Automated backport with mergify enhancement libbeat Team:Integrations Label for the Integrations team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants