Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexec: index out of range panic from vectorized engine #57198

Closed
irfansharif opened this issue Nov 27, 2020 · 3 comments · Fixed by #57483
Closed

colexec: index out of range panic from vectorized engine #57198

irfansharif opened this issue Nov 27, 2020 · 3 comments · Fixed by #57483
Assignees
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@irfansharif
Copy link
Contributor

Saw the following index out of range here.

=== CONT  TestLogic/fakedist-spec-planning/inverted_join_json_array
    logic.go:2419: 
        
        testdata/logic_test/inverted_join_json_array:320: SELECT * FROM
        (SELECT j1.a, j2.a FROM json_tab@foo_inv AS j1, json_tab AS j2 WHERE j1.b @> j2.b) AS inv_join(a1, a2)
        FULL OUTER JOIN
        (SELECT j1.a, j2.a FROM json_tab@primary AS j1, json_tab AS j2 WHERE j1.b @> j2.b) AS cross_join(a1, a2)
        ON inv_join.a1 = cross_join.a1 AND inv_join.a2 = cross_join.a2
        WHERE inv_join.a1 IS NULL OR cross_join.a1 IS NULL
        expected success, but found
        (XX000) internal error: unexpected error from the vectorized engine: runtime error: index out of range [3] with length 3
        error.go:90: in func1()
        DETAIL: stack trace:
        github.com/cockroachdb/cockroach/pkg/sql/colexecbase/colexecerror/error.go:90: func1()
        runtime/panic.go:969: gopanic()
        runtime/panic.go:88: goPanicIndex()
        github.com/cockroachdb/cockroach/pkg/col/coldata/vec.eg.go:620: Copy()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/hashjoiner.go:570: func1()
        github.com/cockroachdb/cockroach/pkg/sql/colmem/allocator.go:292: PerformOperation()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/hashjoiner.go:564: congregate()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/hashjoiner.go:546: exec()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/hashjoiner.go:281: Next()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/disk_spiller.go:201: func1()
        github.com/cockroachdb/cockroach/pkg/sql/colexecbase/colexecerror/error.go:93: CatchVectorizedRuntimeError()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/disk_spiller.go:199: Next()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/operator.go:408: Next()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/buffer.go:62: advance()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/case.go:110: Next()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/bool_vec_to_sel.go:150: Next()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/bool_vec_to_sel.go:57: Next()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/simple_project.go:124: Next()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/invariants_checker.go:44: Next()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/parallel_unordered_synchronizer.go:195: 1()
        github.com/cockroachdb/cockroach/pkg/sql/colexecbase/colexecerror/error.go:93: CatchVectorizedRuntimeError()
        github.com/cockroachdb/cockroach/pkg/sql/colexec/parallel_unordered_synchronizer.go:229: func2()
        runtime/asm_amd64.s:1374: goexit()
        
        NOTE: internal errors may have more details in logs. Use -show-logs.
    logic.go:2164: 
         pq: internal error: unexpected error from the vectorized engine: runtime error: index out of range [3] with length 3
--- done: testdata/logic_test/inverted_join_json_array with config fakedist-spec-planning: 8 tests, 2 failures
    logic.go:2927: 
        testdata/logic_test/inverted_join_json_array:329: error while processing
    logic.go:2927: testdata/logic_test/inverted_join_json_array:329: too many errors encountered, skipping the rest of the input
        --- FAIL: TestLogic/fakedist-spec-planning/inverted_join_json_array (1.18s)

This was on #57155, rebased atop e9d66e5 at the time. I think my PR is unrelated to the panic above.

@irfansharif irfansharif added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-execution Relating to SQL execution. labels Nov 27, 2020
@yuzefovich
Copy link
Member

Huh, this issue is quite puzzling. I've stressed the query both on my laptop (for 5 min) and on the gceworker (for an hour) on several fakedist configs on e9d66e5 and was unable to reproduce it. I'll kick off a run for the night on the gceworker.

@yuzefovich
Copy link
Member

I've been stressing the queries for over 16 hours on gceworker on all logic test configs and still wasn't able to reproduce it.

I read through the code where the panic occurred and not sure how it could occur. My observations:

  • panic came from the last line in this loop
for i := range sel[args.SrcStartIdx:args.SrcEndIdx] {
	selIdx := sel[args.SrcStartIdx+i]
	v := fromCol.Get(selIdx)
	toCol[i+args.DestIdx] = v
}

indicating that we tried to assign to toCol[3] but len(toCol) == 3

  • the call to Vec.Copy is made with the following arguments
SliceArgs: coldata.SliceArgs{
	Src:       valCol,
	Sel:       hj.probeState.probeIdx,
	SrcEndIdx: nResults,
},
  • therefore, args.DestIdx must be 0, so i in the loop above must have been at value 3 indicating that args.SrcEndIdx - args.SrcStartIdx > 3, this gives us nResults > 3. But we called ResetMaybeReallocate with nResults capacity, so the output must have enough capacity.

Ok, actually typing this out made me realize that there could be a scenario when we hit this out of bounds error: if coldata.BatchSize() is 3 yet nResults somehow is larger than 3, ResetMaybeReallocate truncates the capacity. I'll continue on digging.

@yuzefovich
Copy link
Member

Hm, I'm starting to think that this issue has the same root cause as #51156 - the fact that there might be delay in propagating the cluster setting update.

I've reviewed the hash joiner collecting code, and I don't see how nResults could be larger than coldata.BatchSize() apart from the latter being modified at runtime. I think it is likely that recent work on making the vectorized engine more dynamic (including the hash joiner adjustment) make #51156 occur more often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
2 participants