Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: function resolution errors when testing_optimizer_disable_rule_probability enabled #94890

Closed
SteveLeungYL opened this issue Jan 8, 2023 · 5 comments · Fixed by #105582
Closed
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner

Comments

@SteveLeungYL
Copy link

SteveLeungYL commented Jan 8, 2023

Describe the problem

The latest version of the CockroachDB (d8d16e2) shows Internal Error when executing the following query:

SET testing_optimizer_disable_rule_probability = 1.0;
SELECT * FROM ROWS FROM (COALESCE(CRDB_INTERNAL.DECODE_PLAN_GIST('abc')));

To Reproduce

Here is the detail steps to reproduce the bug.

  1. In operating system Ubuntu 20.04, download the CockroachDB source code from the github source.
  2. Use the latest version of the CockroachDB code (tested version: d8d16e2)
  3. Directly make install in the root repository folder.
  4. Run ./cockroach demo, and then paste the PoC query to the cockroach cli environment.
  5. Observe the Internal Error and log the stack information.

Expected behavior
Shows error such as the following:

ERROR: illegal base64 data at input byte 0

Additional data / screenshots

Here is the outputted stack trace:

[email protected]:26257/movr> SET testing_optimizer_disable_rule_probability = 1.0;
SET

Time: 3ms total (execution 2ms / network 1ms)

warning: error retrieving the database name: ERROR: use of crdb_internal_vtable_pk column not allowed. (SQLSTATE XXUUU)
[email protected]:26257/?> SELECT * FROM ROWS FROM (COALESCE(CRDB_INTERNAL.DECODE_PLAN_GIST('abc')));
ERROR: internal error: unexpected error from the vectorized engine: interface conversion: tree.FnOverload is nil, not func(context.Context, *eval.Context, tree.Datums) (tree.Datum, error)
SQLSTATE: XX000
DETAIL: stack trace:
github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:89: func1()
runtime/panic.go:884: gopanic()
runtime/iface.go:262: panicdottypeE()
github.com/cockroachdb/cockroach/pkg/sql/sem/eval/expr.go:479: EvalFuncExpr()
github.com/cockroachdb/cockroach/pkg/sql/sem/tree/eval_expr_generated.go:276: Eval()
github.com/cockroachdb/cockroach/pkg/sql/sem/eval/expr.go:197: EvalCoalesceExpr()
github.com/cockroachdb/cockroach/pkg/sql/sem/tree/eval_expr_generated.go:101: Eval()
github.com/cockroachdb/cockroach/pkg/sql/sem/eval/expr.go:26: Expr()
github.com/cockroachdb/cockroach/pkg/sql/execinfrapb/expr.go:248: Eval()
github.com/cockroachdb/cockroach/pkg/sql/rowexec/project_set.go:241: nextGeneratorValues()
github.com/cockroachdb/cockroach/pkg/sql/rowexec/project_set.go:292: Next()
github.com/cockroachdb/cockroach/pkg/sql/colexec/columnarizer.go:243: Next()
github.com/cockroachdb/cockroach/pkg/sql/colflow/stats.go:116: next()
github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:92: CatchVectorizedRuntimeError()
github.com/cockroachdb/cockroach/pkg/sql/colflow/stats.go:124: Next()
github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:252: nextAdapter()
github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:92: CatchVectorizedRuntimeError()
github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:256: next()
github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:294: Run()
github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:309: Run()
github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:827: Run()
github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:1795: PlanAndRun()
github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:1547: PlanAndRunAll()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1633: execWithDistSQLEngine()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1253: dispatchToExecutionEngine()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:738: execStmtInOpenState()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:130: func1()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:2522: execWithProfiling()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:129: execStmt()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1959: func1()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1964: execCmd()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1882: run()

HINT: You have encountered an unexpected error.

Environment:

  • CockroachDB version: v23.1.0-alpha
  • Server OS: Ubuntu 20.04 LTS
  • Client app: CockroachDB demo command. (./cockroach demo)
  • Detailed CockroachDB version output:
cockroach version details:
Build Tag:        v23.1.0-alpha.1-642-gd8d16e2824
Build Time:       2023/01/08 00:10:58
Distribution:     CCL
Platform:         linux arm64 (aarch64-linux-gnu)
Go Version:       go1.19
C Compiler:       gcc 9.4.0
Build Commit ID:  d8d16e28249b2834331ebfcba4a5b2b2fde2eb0c
Build Type:       development

Additional context

This Internal Error problem may not happen in the real-world application, since it requires the specific setup: SET testing_optimizer_disable_rule_probability = 1.0;, which is a developer option used to control the query optimization. However, an Internal Error is still unexpected from the resulted behavior. Therefore, this problem is reported here.

Jira issue: CRDB-23209

@SteveLeungYL SteveLeungYL added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 8, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jan 8, 2023

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @cockroachdb/sql-queries (found keywords: optimizer,vectorized,distsql,PLAN)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Jan 8, 2023
@rafiss rafiss changed the title sql: query triggers Internal Error sql: unhandled decoding error in decode_plan_gist Jan 17, 2023
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jan 17, 2023
@cucaroach
Copy link
Contributor

This has nothing to do with GISTS, any function will do, ie:

SELECT * FROM ROWS FROM (COALESCE(CRDB_INTERNAL.SHOW_CREATE_ALL_TYPES('abc'))) 

There is some kind of failure to resolve passing the results of a builtin generator function to COALESCE that is somehow only triggered when SET testing_optimizer_disable_rule_probability = 1.0;

@cucaroach cucaroach changed the title sql: unhandled decoding error in decode_plan_gist sql: function resolution errors when testing_optimizer_disable_rule_probability enabled Jan 18, 2023
@cucaroach
Copy link
Contributor

There's really two issues here, the fact that we can't get the database name is weird:

warning: error retrieving the database name: ERROR: use of crdb_internal_vtable_pk column not allowed. (SQLSTATE XXUUU)

This makes me think there's something fundamentally broken when testing_optimizer_disable_rule_probability is set to 1.0 and whatever it is possibly also breaks function resolution somehow.

@cucaroach
Copy link
Contributor

@yuzefovich ostensibly fixed this "column not allowed" error here. But SHOW database doesn't work for me at that commit. There must be more to it. Moving to backlog for now.

craig bot pushed a commit that referenced this issue Jan 18, 2023
94720: sql: add crdb_internal.{node,cluster}_txn_execution_insights r=xinhaoz a=xinhaoz

Closes #93075

This commit adds 2 new virtual tables displaying execution
insights for transaction.
- crdb_internal.cluster_txn_execution_insights
- crdb_internal.node_txn_execution_insights

Release note (sql change):
2 new virtual tables displaying execution insights for transaction:
- crdb_internal.cluster_txn_execution_insights
- crdb_internal.node_txn_execution_insights

----------------

Only the 2nd commit is relevant here.

95258: server: fix UserSQLRoles to account for global privileges r=maryliag a=rafiss

Epic: None
Release note (bug fix): DB Console features that check for the
VIEWACTIVITYREDACTED privilege now also account for global privileges.

95373: server: clear tenant cookies when tenant is missing r=knz a=dhartunian

Previously, when we had cookies in an HTTP Request referencing a tenant, and that tenant didn't exist, we'd be stuck in an error loop constantly attempting to start a tenant that didn't exist.

This commit updates the error log in `server_controller.go` to clear cookies for multi-tenant sessions when we return an error in order to break the browser out of the bad update cycle and have it either continue happily with an insecure cluster that doesn't need a session anyway, or request a new one by routing the user to the login page.

Resolves: #95365
Resolves: #95366
Epic: CRDB-14545

Release note: None

95397: ui: fix sql stats page data refreshing r=xinhaoz a=xinhaoz

This commit fixes a couple of bugs with sql stats page data fetching.

First - we missed clearing the initial timeout to fetch data that was created on mount - it should be cleared when a new time range is selected on the page or when the page is unmounted. Currently, this bug can lead to unexpeted data refetches of the incorrect (previously selected) time range.

Second - with the introduction of new pages that use the time picker, such as insights, sql stats pages are no longer the only page that can change the time range. Previously, data was only fetched on mount for sql stats pages if it was not previously fetched, or a data refresh was scheduled based on last update time if the current time range was not a custom (fixed) range. We should now also refetch when the data is outdated due to a new time range being selected on a different page. This commit introduces the `isDataValid` prop to sql stats pages to determine this and refetch data accordingly.

Epic: none


CC - just showing all the fetching is still working:
https://www.loom.com/share/1b24ec65dbe64dcd9c23cd2be980b877

DB Console -
https://www.loom.com/share/0afe1722d7ab418bb7f41c2462f4d84a

95444: ci: publish `latest-master-build` docker image r=jlinder a=rail

Previously, our automation published latest docker images using `latest-v2x.x-build` tag. External automation that tracked the master branch would need to update the tag, when we change the version.

This PR adds an extra step to tag the master branch builds with `latest-master-build` docker tag.

Epic: none
Release note: None

95447: sem/builtins: fix population of generator builtins r=yuzefovich a=yuzefovich

Generator builtins are such that they should not evaluated as scalars (i.e. `Fn` and `FnWithExprs` evaluation functions should not be called on them). In order to highlight this as an assertion failure we initialize generator builtins with special functions in those two fields. However, previously the initialization was broken since we were modifying a copy of the builtin struct, and this is now fixed. There isn't much of a production impact though (if we ever tried to evaluate the generator as a scalar, it'd result in a nil pointer which would be caught by the vectorized engine panic-catcher; also, it seems very hard if possible to trigger this without the testing randomizations), so there is no release note.

Informs: #94890.
Epic: None

Release note: None

Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
cucaroach added a commit to cucaroach/cockroach that referenced this issue Jan 20, 2023
If a vtable scan query tries to use the dummy "0" column the exec
builder errors out, this typically won't happen thanks to prune columns
normalization rules and those rules are marked as "essential" but the
logic allowing those rules to be applied was flawed.

Epic: CRDB-20535
Informs: cockroachdb#94890
Release note: None
craig bot pushed a commit that referenced this issue Jan 20, 2023
…95564 #95583 #95606

90222: server: add api for decommission pre-flight checks r=AlexTalks a=AlexTalks

While we have an API for checking the status of an in-progress decommission, we did not previously have an API to execute sanity checks prior to requesting a node to move into the `DECOMMISSIONING` state. This adds an API to do just that, intended to be called by the CLI prior to issuing a subsequent `Decommission` RPC request.

Fixes #91568.

Release note: None

91458: roachtest/mixed-versions-compat: use corpus for master r=fqazi a=fqazi

Informs: #91350

The CI build scripts take advantage of the branch name to uplaod the corpus to GCS. Unfortunately, we have no way of know if the current build is master inside the roachtest. To address this, this patch supports fetching the master corpus as a fallback.

Release note: None

92826: multiregionccl: add a cold start latency test  r=ajwerner a=ajwerner

This commit adds a test which creates an MR serverless cluster and then
boots the sql pods in each region while disallowing connectivity to other
regions. It also simulates latency to make sure the routing logic works
and to provide a somewhat realistic picture of what to expect.

Epic: CRDB-18596

Release note: None

93758: server: evaluate decommission pre-checks r=kvoli a=AlexTalks

This adds support for the evaluation of the decommission readiness of a
node (or set of nodes), by simulating their liveness to have the
DECOMMISSIONING status and utilizing the allocator to ensure that we are
able to perform any actions needed to repair the range. This supports a
"strict" mode, in which case we expect all ranges to only need
replacement or removal due to the decommissioning status, or a more
permissive "non-strict" mode, which allows for other actions needed, as
long as they do not encounter errors in finding a suitable allocation
target. The non-strict mode allows us to permit situations where a range
may have more than one action needed to repair it, such as a range that
needs to reach its replication factor before the decommissioning replica
can be replaced, or a range that needs to finalize an atomic replication
change.

Depends on #94024.

Part of #91568

95007: admission: CPU slot adjustment and utilization metrics r=irfansharif a=sumeerbhola

Our existing metrics are gauges (total and used slots) which don't give us insight into what is happening at smaller time scales. This creates uncertainty when we observe admission queueing but the gauge samples show total slots consistenly greater than used slots. Additionally, if total slots is steady during queuing, it doesn't tell us whether that was because of roughly matching increments or decrements, or no increments/decrements.

The following metrics are added:
- admission.granter.slots_exhausted_duration.kv: cumulative duration when the slots were exhausted. This can give insight into how much exhaustion was occurring. It is insufficient to tell us whether 0.5sec/sec of exhaustion is due to a long 500ms of exhaustion and then non-exhaustion or alternating 1ms of exhaustion and non-exhaustion. But this is an improvement over what we have.
- admission.granter.slot_adjuster_{increments,decrements}.kv: Counts the increments and decrements of the total slots.
- admission.granter.cpu_load_{short,long}_period_duration.kv: cumulative duration of short and long ticks, as indicated by the period in the CPULoad callback. We don't expect long period ticks when admission control is active (and we explicitly disable enforcement during long period ticks), but it helps us eliminate some hypothesis during incidents (e.g. long period ticks alternating with short period ticks causing a slow down in how fast we increment slots). Additionally, the sum of the rate of these two, if significantly < 1, would indicate that CPULoad frequency is lower than expected, say due to CPU overload.

Fixes #92673

Epic: none

Release note: None

95145: sql/stats: include partial stats in results of statsCache.GetTableStats r=rytaft a=michae2

We were not including partial stats in the list of table statistics returned by `statsCache.GetTableStats`. This was fine for the optimizer, which currently cannot use partial stats directly, but it was a problem for backup.

We'd like to use partial stats directly in the optimizer eventually, so this commit goes ahead and adds them to the results of `GetTableStats`. The optimizer then must filter them out. To streamline this we add some helper functions.

Finally, in an act of overzealous refactoring, this commit also changes `MergedStatistics` and `ForecastTableStatistics` to accept partial statistics and full statistics mixed together in the same input list. This simplifies the code that calls these functions.

Fixes: #95056
Part of: #93983

Epic: CRDB-19449

Release note: None

95387: kvserver: separate loadstats cpu nanos to raft/req r=andrewbaptist a=kvoli

Previously, loadstats tracked replica raft/request cpu nanos per second separately but returned both summed together in `load.ReplicaLoadStats`. This patch separates `RaftCPUNanosPerSecond` and
`RequestCPUNanosPerSecond` in the returned `load.ReplicaLoadStats` so that they may be used independently.

Informs #95380

Release note: None

95557: sql: Fix testing_optimizer_disable_rule_probability usage with vtables r=cucaroach a=cucaroach

If a vtable scan query tries to use the dummy "0" column the exec
builder errors out, this typically won't happen thanks to prune columns
normalization rules and those rules are marked as "essential" but the
logic allowing those rules to be applied was flawed.

Epic: CRDB-20535
Informs: #94890
Release note: None


95559: rpc: fix comment r=andreimatei a=andreimatei

This copy-pasta comment was mentioning a KV node, which was not right.

Release note: None
Epic: None

95564: cpustopwatch: s/grunning.Difference/grunning.Elapsed r=irfansharif a=irfansharif

`grunning.Elapsed()` is the API to use when measuring the running time spent doing some piece of work, with measurements from the start and end. This only exists due to `grunning.Time()`'s non-monotonicity, a bug in our runtime patch: #95529. The bug results in slight {over,under}-estimation of the running time (the latter breaking monotonicity), but is livable with our current uses of this library, including the one here in cpustopwatch. `grunning.Elapsed()` papers over this bug by 0-ing out when `grunning.Time()`stamps regress. This is unlike `grunning.Difference()` which would return the absolute value of the regression -- not what we want here.

Release note: None

95583: sql: fix cluster setting propagation flake take 2 r=cucaroach a=cucaroach

Previously we tried to fix this with one retry but that was
insufficient.  Extend it to all queries in this section of the test.

Release note: None
Epic: CRDB-20535


95606: backupccl: deflake TestScheduleChainingLifecycle r=adityamaru a=msbutler

This patch will skip the test if the machine clock is close to midnight, and increases the frequency of incremental backups to run every 2 minutes.

Previously, the backup schedule in this test used the following crontab recurrence: '*/5 * * * *'. In english, this means "run a full backup now, and then run a full backup every day at midnight, and an incremental every 5 minutes. This test also relies on an incremental backup running after the first full backup. But, what happens if the first full backup gets scheduled to run within 5 minutes of midnight? A second full backup may get scheduled before the expected incremental backup, breaking the invariant the test expects.

Fixes #88575 #95186

Release note: None

Co-authored-by: Alex Sarkesian <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Michael Erickson <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
@cucaroach cucaroach assigned DrewKimball and unassigned cucaroach Jan 24, 2023
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jun 26, 2023
This patch adds checks during type-checking to ensure that generator functions
are not used in the arguments of `CASE` or `COALESCE` expressions. This mirrors
postgres behavior.

Fixes cockroachdb#97119
Fixes cockroachdb#94890

Release note (bug fix): CASE and COALESCE expressions now return an error when
passed a generator function as an argument. This mirrors postgres behavior.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jul 14, 2023
This patch adds checks during type-checking to ensure that generator functions
are not used in the arguments of `CASE`, `IF`, `COALESCE`, or `IFNULL`
expressions. This mirrors postgres behavior. This patch also corrects the error
message that is returned in these cases to say "set-returning" instead of
"generator".

Fixes cockroachdb#97119
Fixes cockroachdb#94890

Release note (bug fix): CASE, IF, COALESCE, and IFNULL expressions now return
an error when passed a generator function as an argument. This mirrors postgres
behavior.
craig bot pushed a commit that referenced this issue Jul 18, 2023
105582: opt: check that generator functions are not used in CASE or COALESCE r=DrewKimball a=DrewKimball

#### opt: check that generator functions are not used in CASE or COALESCE

This patch adds checks during type-checking to ensure that generator functions
are not used in the arguments of `CASE`, `IF`, `COALESCE`, or `IFNULL`
expressions. This mirrors postgres behavior. This patch also corrects the error
message that is returned in these cases to say "set-returning" instead of
"generator".

Fixes #97119
Fixes #94890

Release note (bug fix): CASE, IF, COALESCE, and IFNULL expressions now return
an error when passed a generator function as an argument. This mirrors postgres
behavior.

106944: bazel,dev: in `dev`, handle different `--compilation_mode`s correctly... r=rail a=rickystewart

... and switch our default compilation mode to `dbg`.

Under `fastbuild`, built binaries are stripped. This is not the case with `dbg`. Have `dev doctor` recommend using `dbg`, and either way make `dev` resilient to whatever kind of `--compilation_mode` you're using.

In principle or theory `dbg` is slower than `fastbuild`, but for the Go compiler it generates debuggable binaries by default and you have to opt-in to stripping, so it shoould make no real difference.

Now, we think of the `--compilation_mode` simply as follows: `dbg` is the default that everyone is opted into by default unless they explicitly set `-c opt` (we use this for release). For now I don't see a reason anyone would need `fastbuild`.

Epic: CRDB-17171
Release note: None
Closes: #106820

107086: server: minor improvement around TestTenantInterface r=yuzefovich a=yuzefovich

Addresses: #76378.
Fixes: #106903.


Co-authored-by: Drew Kimball <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig craig bot closed this as completed in 5cc456b Jul 18, 2023
@SteveLeungYL
Copy link
Author

Thank you for the patch and the updated information. Cheers!

@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants