Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict the number of results processed per index worker #3269

Merged
merged 26 commits into from
Mar 1, 2021

Conversation

ryanhall07
Copy link
Collaborator

@ryanhall07 ryanhall07 commented Feb 23, 2021

What this PR does / why we need it:

A new MaxResultsPerPermit option is introduced to cap how many index
results an index worker can process at a time. If the max is exceeded, the
index worker must yield the permit back and acquire it again
(potentially waiting) to continue processing the results.

This cap ensures large queries don't dominate the finite number of index
workers allowed to run concurrently and lock out smaller queries. The
idea is users would want to set the max large enough so the vast
majority of typical queries can finish with only a single permit
acquisition.
Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:


Does this PR require updating code package or user-facing documentation?:


@ryanhall07 ryanhall07 force-pushed the rhall-worker-pool-iter branch 3 times, most recently from a05f145 to 706d4c6 Compare February 23, 2021 19:14
A new MaxResultsPerPermit option is introduced to cap how many index
results an index worker can process at a time. If the max is exceeded, the
index worker must yield the permit back and acquire it again
(potentially waiting) to continue processing the results.

This cap ensures large queries don't dominate the finite number of index
workers allowed to run concurrently and lock out smaller queries. The
idea is users would want to set the max large enough so the vast
majority of typical queries can finish with only a single permit
acquisition.
@ryanhall07 ryanhall07 force-pushed the rhall-worker-pool-iter branch from 706d4c6 to 9abfd17 Compare February 23, 2021 19:19
@ryanhall07 ryanhall07 changed the title WIP Restrict the number of results processed per index worker Feb 23, 2021
@ryanhall07 ryanhall07 marked this pull request as ready for review February 23, 2021 19:34
@codecov
Copy link

codecov bot commented Feb 23, 2021

Codecov Report

Merging #3269 (b7ab449) into master (73bac90) will decrease coverage by 0.0%.
The diff coverage is 81.3%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #3269     +/-   ##
=========================================
- Coverage    72.5%    72.4%   -0.1%     
=========================================
  Files        1099     1101      +2     
  Lines      101504   101562     +58     
=========================================
- Hits        73616    73607      -9     
- Misses      22830    22866     +36     
- Partials     5058     5089     +31     
Flag Coverage Δ
aggregator 76.5% <ø> (+<0.1%) ⬆️
cluster 84.9% <ø> (+<0.1%) ⬆️
collector 84.3% <ø> (ø)
dbnode 78.8% <81.3%> (-0.2%) ⬇️
m3em 74.4% <ø> (ø)
m3ninx 73.5% <ø> (ø)
metrics 19.9% <ø> (ø)
msg 74.2% <ø> (-0.2%) ⬇️
query 67.3% <ø> (ø)
x 80.5% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73bac90...1337a6d. Read the comment docs.

}
permits.Release()
}
blockIter.searchTime += blockIter.iter.SearchDuration()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nice plus of this change is we no longer have to acquire locks to update the timing info on the shared results. each goroutine has its own timing information and the sub results are added together when all queries are done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -64,6 +64,9 @@ const (
aggregateResultsEntryArrayPoolSize = 256
aggregateResultsEntryArrayPoolCapacity = 256
aggregateResultsEntryArrayPoolMaxCapacity = 256 // Do not allow grows, since we know the size

// defaultResultsPerPermit sets the default index results that can be processed per permit acquired.
defaultResultsPerPermit = 10000
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what a good default is. I'm hesitant to make the default really large or off, since we changed the access pattern for acquiring workers. now all blocks eagerly attempt to acquire workers in parallel, so without some kind of max, larger queries will dominate even more.

select {
case f.permits <- struct{}{}:
default:
panic("more permits released than acquired")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how we feel about panics in the code base. fwiw the std go semaphore panics for the same reason. this can only happen due to a logical bug, which should be caught with a test

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems sane. Although yes usually try to avoid panics. Alternatively we could make it return an error and queries start failing.

Copy link
Collaborator

@wesleyk wesleyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, LGTM!

src/cmd/services/m3dbnode/config/config.go Outdated Show resolved Hide resolved
src/dbnode/storage/index.go Show resolved Hide resolved
QueryOptions{SeriesLimit: 3},
results,
10,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we deciding on the limit here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty arbitrary. high enough for the tests to pass with a single iteration.

src/dbnode/storage/index/types.go Outdated Show resolved Hide resolved
@@ -775,15 +814,6 @@ func TestLimits(t *testing.T) {
requireExhaustive: false,
expectedErr: "",
},
{
name: "no limits",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this no longer relevant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't make sense that exhaustive would be false if there is no limit. just seems made up for the test.

src/dbnode/storage/limits/permits/fixed_permits_test.go Outdated Show resolved Hide resolved
// make sure the query hasn't been canceled
if queryCanceled() {
// acquire a permit before kicking off the goroutine to process the iterator. this limits the number of
// concurrent goroutines to # of permits + large queries that needed multiple iterations to finish.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@@ -387,6 +387,14 @@ type IndexConfiguration struct {
// as they are very CPU-intensive (regex and FST matching).
MaxQueryIDsConcurrency int `yaml:"maxQueryIDsConcurrency" validate:"min=0"`

// MaxResultsPerPermit is the maximum index results a query can process after obtaining a permit. If a query needs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we split this up into different values for 1) regular queries 2) aggregate queries?

Aggregate queries when scoped with match[]=... are relatively more expensive per iteration since they have to do postings list interception with each aggregate term which is progressed to, as per:

fti.current.term, fti.current.postings = fti.termIter.Current()
if fti.restrictByPostings == nil {
// No restrictions.
return true, nil
}
bitmap, ok := roaring.BitmapFromPostingsList(fti.current.postings)
if !ok {
return false, errUnpackBitmapFromPostingsList
}
// Check term is part of at least some of the documents we're
// restricted to providing results for based on intersection
// count.
// Note: IntersectionCount is significantly faster than intersecting and
// counting results and also does not allocate.
if n := fti.restrictByPostings.IntersectionCount(bitmap); n > 0 {
// Matches, this is next result.
return true, nil
}
(term progression)
and
field, pl := fieldIter.Current()
if !fti.opts.allow(field) {
continue
}
if fti.restrictByPostings == nil {
// No restrictions.
fti.current.field = field
return true
}
bitmap, ok := roaring.BitmapFromPostingsList(pl)
if !ok {
fti.err = errUnpackBitmapFromPostingsList
return false
}
// Check field is part of at least some of the documents we're
// restricted to providing results for based on intersection
// count.
// Note: IntersectionCount is significantly faster than intersecting and
// counting results and also does not allocate.
if n := fti.restrictByPostings.IntersectionCount(bitmap); n < 1 {
// No match, not next result.
continue
}
// Matches, this is next result.
fti.current.field = field
return true
(field progression)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could do this in a followup? Might be too much to add to scope of this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's pretty easy to do now, just 2 different config values.

Comment on lines +461 to +463
docs, series := iter.Counts()
b.metrics.queryDocsMatched.RecordValue(float64(docs))
b.metrics.querySeriesMatched.RecordValue(float64(series))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

Comment on lines 422 to 428
// MaxResultsPerWorkerConfiguration configures the max results per index worker.
type MaxResultsPerWorkerConfiguration struct {
// Fetch is the max for fetch queries.
Fetch int `yaml:"fetch"`
// Aggregate is the max for aggregate queries.
Aggregate int `yaml:"aggregate"`
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anymore? Seems like it can be removed.

Copy link
Collaborator

@robskillington robskillington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than removing MaxResultsPerWorkerConfiguration

@ryanhall07 ryanhall07 merged commit 973caaf into master Mar 1, 2021
@ryanhall07 ryanhall07 deleted the rhall-worker-pool-iter branch March 1, 2021 16:48
@@ -561,6 +528,9 @@ func (b *block) QueryWithIter(
}
}

iter.AddSeries(size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryanhall07 @robskillington Just verifying with you as I read this code multiple times, and I think I found a bug:
size is the number of elements in results as effectively returned originally here:

func (r *results) AddDocuments(batch []doc.Document) (int, int, error) {
	r.Lock()
	err := r.addDocumentsBatchWithLock(batch)
	size := r.resultsMap.Len()
	docsCount := r.totalDocsCount + len(batch)
	r.totalDocsCount = docsCount
	r.Unlock()
	return size, docsCount, err
}

Since results only keeps growing, as I haven't seen any reset to this object, it means that when you call iter.Counts() to get the seriesMatched, you will get a wrong number, since the AddSeries(size) keeping adding up total and not adding delta. Maybe the correct way is to measure results before adding and after to get that delta?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea this looks like a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants