Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/filelog] Threadpooling for maxconcurrency #19448

Conversation

VihasMakwana
Copy link
Contributor

Description:
Changes in core logic.
Major changes:

  1. makeReaders -> makeReader: Instead of creating all the readers at once, we'll go through the paths and create readers one by one. We also move known fingerprints and known files to the struct Manager and we reset this every poll cycle.
  2. Removing the maxBatches limit: As we've gone asynchronous, there's no need for such a limit. We can add a new limit like maxFilesPerCycle which limits the number of files matched per cycle. I have also removed the test cases for maxBatches
  3. Many new fields were added to Manager. pathHashLock: For atomically updating the path map. readerLock: For atomically tracking the lostReaders list, which will be used to detect lost files. knownFilesLock to atomically update the m.knownFiles.

Link to tracking Issue: #18908

Testing: The core functionality remains the same, but the approach is different. I have thus updated the test cases taking the asynchronous behavior into consideration.

Documentation: Updated design.md

@VihasMakwana VihasMakwana requested a review from a team March 10, 2023 20:05
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 10, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@VihasMakwana VihasMakwana marked this pull request as draft March 10, 2023 20:06
@VihasMakwana VihasMakwana force-pushed the threadpooling_for_maxconcurrency branch 2 times, most recently from 944cd17 to 10124c9 Compare March 10, 2023 20:09
@djaglowski djaglowski changed the title Threadpooling for maxconcurrency [receiver/filelog] Threadpooling for maxconcurrency Mar 10, 2023
@dmitryax
Copy link
Member

@vihas-splunk, please sign the CLA

@runforesight
Copy link

runforesight bot commented Mar 12, 2023

Foresight Summary

    
Major Impacts

TestReadRotatingFiles ❌ failed 4 times in 8 runs (50% fail rate).
TestReadRotatingFiles/CopyTruncateTimestamped ❌ failed 4 times in 8 runs (50% fail rate).
build-and-test duration(19 minutes 38 seconds) has decreased 26 minutes 46 seconds compared to main branch avg(46 minutes 24 seconds).
View More Details

⭕  build-and-test-windows workflow has finished in 10 seconds (31 minutes 33 seconds less than main branch avg.) and finished at 7th Apr, 2023.


Job Failed Steps Tests
windows-unittest-matrix -     🔗  N/A See Details
windows-unittest -     🔗  N/A See Details

✅  telemetrygen workflow has finished in 1 minute 12 seconds and finished at 7th Apr, 2023.


Job Failed Steps Tests
build-dev -     🔗  N/A See Details
publish-latest -     🔗  N/A See Details
publish-stable -     🔗  N/A See Details

✅  check-links workflow has finished in 1 minute 40 seconds (⚠️ 42 seconds more than main branch avg.) and finished at 7th Apr, 2023.


Job Failed Steps Tests
changed files -     🔗  N/A See Details
check-links -     🔗  N/A See Details

✅  changelog workflow has finished in 2 minutes 52 seconds and finished at 7th Apr, 2023.


Job Failed Steps Tests
changelog -     🔗  N/A See Details

✅  prometheus-compliance-tests workflow has finished in 3 minutes 27 seconds (2 minutes 55 seconds less than main branch avg.) and finished at 7th Apr, 2023.


Job Failed Steps Tests
prometheus-compliance-tests -     🔗  N/A See Details

✅  load-tests workflow has finished in 6 minutes 16 seconds (4 minutes 6 seconds less than main branch avg.) and finished at 7th Apr, 2023.


Job Failed Steps Tests
setup-environment -     🔗  N/A See Details
loadtest (TestIdleMode) -     🔗  N/A See Details
loadtest (TestBallastMemory|TestLog10kDPS) -     🔗  N/A See Details
loadtest (TestMetric10kDPS|TestMetricsFromFile) -     🔗  N/A See Details
loadtest (TestMetricResourceProcessor|TestTrace10kSPS) -     🔗  N/A See Details
loadtest (TestTraceNoBackend10kSPS|TestTrace1kSPSWithAttrs) -     🔗  N/A See Details
loadtest (TestTraceBallast1kSPSWithAttrs|TestTraceBallast1kSPSAddAttrs) -     🔗  N/A See Details
loadtest (TestTraceAttributesProcessor) -     🔗  N/A See Details

✅  e2e-tests workflow has finished in 12 minutes 44 seconds and finished at 7th Apr, 2023.


Job Failed Steps Tests
kubernetes-test (v1.26.0) -     🔗  N/A See Details
kubernetes-test (v1.25.3) -     🔗  N/A See Details
kubernetes-test (v1.24.7) -     🔗  N/A See Details
kubernetes-test (v1.23.13) -     🔗  N/A See Details

❌  build-and-test workflow has finished in 19 minutes 38 seconds (26 minutes 46 seconds less than main branch avg.) and finished at 7th Apr, 2023. 6 jobs failed.


Job Failed Steps Tests
setup-environment -     🔗  N/A See Details
check-codeowners -     🔗  N/A See Details
lint-matrix (receiver-0) -     🔗  N/A See Details
lint-matrix (receiver-1) -     🔗  N/A See Details
lint-matrix (processor) -     🔗  N/A See Details
lint-matrix (exporter) -     🔗  N/A See Details
lint-matrix (extension) -     🔗  N/A See Details
lint-matrix (connector) -     🔗  N/A See Details
lint-matrix (internal) -     🔗  N/A See Details
lint-matrix (other) Lint     🔗  N/A See Details
build-examples -     🔗  N/A See Details
check-collector-module-version -     🔗  N/A See Details
checks Porto     🔗  N/A See Details
correctness-metrics -     🔗  N/A See Details
correctness-traces -     🔗  N/A See Details
integration-tests -     🔗  N/A See Details
unittest-matrix (1.20, receiver-0) Run Unit Tests     🔗  N/A See Details
unittest-matrix (1.20, receiver-1) -     🔗  N/A See Details
unittest-matrix (1.20, processor) -     🔗  N/A See Details
unittest-matrix (1.20, exporter) -     🔗  N/A See Details
unittest-matrix (1.20, extension) -     🔗  N/A See Details
unittest-matrix (1.20, connector) -     🔗  N/A See Details
unittest-matrix (1.20, internal) -     🔗  N/A See Details
unittest-matrix (1.20, other) -     🔗  N/A See Details
unittest-matrix (1.19, receiver-0) -     🔗  N/A See Details
unittest-matrix (1.19, receiver-1) -     🔗  N/A See Details
unittest-matrix (1.19, processor) -     🔗  N/A See Details
unittest-matrix (1.19, exporter) -     🔗  N/A See Details
unittest-matrix (1.19, extension) -     🔗  N/A See Details
unittest-matrix (1.19, connector) -     🔗  N/A See Details
unittest-matrix (1.19, internal) -     🔗  N/A See Details
unittest-matrix (1.19, other) -     🔗  N/A See Details
unittest (1.20) Interpret result     🔗  N/A See Details
unittest (1.19) Interpret result     🔗  N/A See Details
lint Interpret result     🔗  N/A See Details
cross-compile -     🔗  N/A See Details
windows-msi -     🔗  N/A See Details
build-package -     🔗  N/A See Details
publish-dev -     🔗  N/A See Details
publish-check -     🔗  N/A See Details
publish-stable -     🔗  N/A See Details
rotate-milestone -     🔗  N/A See Details

🔎 See details on Foresight

*You can configure Foresight comments in your organization settings page.

@VihasMakwana VihasMakwana force-pushed the threadpooling_for_maxconcurrency branch 2 times, most recently from 7700de9 to 046888d Compare March 13, 2023 05:31
@VihasMakwana VihasMakwana marked this pull request as ready for review March 13, 2023 05:32
@VihasMakwana VihasMakwana force-pushed the threadpooling_for_maxconcurrency branch from 046888d to f301bb4 Compare March 13, 2023 05:52
@VihasMakwana VihasMakwana force-pushed the threadpooling_for_maxconcurrency branch 2 times, most recently from d980857 to b1fa948 Compare March 25, 2023 16:27
@VihasMakwana VihasMakwana requested a review from djaglowski March 28, 2023 10:15
@VihasMakwana VihasMakwana force-pushed the threadpooling_for_maxconcurrency branch from b1fa948 to ac96f12 Compare March 31, 2023 06:23
@VihasMakwana VihasMakwana force-pushed the threadpooling_for_maxconcurrency branch from ac96f12 to 1c27447 Compare April 1, 2023 10:00
1. Each file handle is wrapped into a `Reader` along with some metadata. (See Reader section above)
- During the creation of a `Reader`, the file's fingerprint is cross referenced with previously known fingerprints.
- If a file's fingerprint matches one that has recently been seen, then metadata is copied over from the previous iteration of the Reader. Most importantly, the offset is accurately maintained in this way.
- If a file's fingerprint does not match any recently seen files, then its offset is initialized according to the `start_at` setting.
9. Detection of Lost Files
8. Detection of Lost Files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think behavior described here has been changed in this PR. Moreover, the term "lost files" no longer seems to be an apt description of how we're managing file handles.

What do you think about dropping this terminology and just describing how we hold on to file handles as long as possible, and then read to the end one last time before closing them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

// We have reached end of the current path and all the previous characters have matched
// Return if current node is leaf and it is not root
if node.isLeaf() && node != trie {
return r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't r just the last byte?

Suggested change
return r
return key

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

@VihasMakwana VihasMakwana force-pushed the threadpooling_for_maxconcurrency branch from e7d3380 to a5d6824 Compare April 11, 2023 12:56
@VihasMakwana VihasMakwana requested a review from djaglowski April 11, 2023 12:56
@VihasMakwana VihasMakwana force-pushed the threadpooling_for_maxconcurrency branch 2 times, most recently from 668b4d1 to 4e381b5 Compare April 11, 2023 17:01
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @VihasMakwana, thanks for your patience on this. I'd still very much like to see this PR through but I've had little time to focus on it.

I will try my best to give a more thorough review later this week, but a couple thoughts at this time:

  • This is a significant change to the way this component works and we have many people depending on it in production. As a result, I think we should enable this functionality initially behind a feature gate. This means we should be able to run the component just as it was before these changes, so we'll need to identify the point of departure within the codebase.
  • Any small parts of this PR that can be submitted separately could potentially be accepted without a feature gate, since they can be evaluated more carefully in isolation. We can do this before or after the feature gate is added. A couple ideas that immediately come to mind are the test refactoring and potentially a minimal switch to makeReader.

// We have reached end of the current path and all the previous characters have matched
// Return if current node is leaf and it is not root
if node.isLeaf() && node != trie {
return r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

@VihasMakwana
Copy link
Contributor Author

Hi @VihasMakwana, thanks for your patience on this. I'd still very much like to see this PR through but I've had little time to focus on it.

I will try my best to give a more thorough review later this week, but a couple thoughts at this time:

  • This is a significant change to the way this component works and we have many people depending on it in production. As a result, I think we should enable this functionality initially behind a feature gate. This means we should be able to run the component just as it was before these changes, so we'll need to identify the point of departure within the codebase.
  • Any small parts of this PR that can be submitted separately could potentially be accepted without a feature gate, since they can be evaluated more carefully in isolation. We can do this before or after the feature gate is added. A couple ideas that immediately come to mind are the test refactoring and potentially a minimal switch to makeReader.

Sounds good to me.
A feature gate makes much more sense as this is a significant change.
I'll separate the parts which can be accepted without a feature gate.

@dmitryax
Copy link
Member

dmitryax commented Apr 30, 2023

@VihasMakwana, did you run this build to confirm that it resolves the original issue?

Also, it'd be great to highlight what other benefits it provides for the end user. Does it improve logs collection throughput? Can you please add benchmarks?

@VihasMakwana
Copy link
Contributor Author

VihasMakwana commented May 3, 2023

@VihasMakwana, did you run this build to confirm that it resolves #17846?

Also, it'd be great to highlight what other benefits it provides for the end user. Does it improve logs collection throughput? Can you please add benchmarks?

Yeah, I did test this out and it resolved the original issue.
I'll add some benchmarks and highlight the benefits.

@VihasMakwana
Copy link
Contributor Author

@djaglowski what I'm thinking is to create some other PRs before finalizing this one,
one to refactor, and one to update the test cases.
I'll put this PR in draft mode till then I guess.
Can you please share your thoughts?

@djaglowski
Copy link
Member

@djaglowski what I'm thinking is to create some other PRs before finalizing this one, one to refactor, and one to update the test cases. I'll put this PR in draft mode till then I guess. Can you please share your thoughts?

@VihasMakwana, that sounds great. Thank you again.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label May 26, 2023
@github-actions
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Jun 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants