Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add index for range scans on views #95770

Closed
maryliag opened this issue Jan 24, 2023 · 13 comments
Closed

add index for range scans on views #95770

maryliag opened this issue Jan 24, 2023 · 13 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team

Comments

@maryliag
Copy link
Contributor

maryliag commented Jan 24, 2023

Add the ability to add/use index for range scans on virtual tables.

We have a feature where we display the list of fingerprints used per index, and to be possible we do a select such as
SELECT * FROM system.statement_statistics WHERE "1@1"::jsonb <@ indexes_usage
but we have to use the system table, since this is not possible on the crdb_internal.statement_statistics virtual table.
We don't let non-admin users to do selects on system tables, so using the crdb_internal table without the index is very inefficient.

Jira issue: CRDB-23726

@maryliag maryliag added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team labels Jan 24, 2023
@cucaroach
Copy link
Contributor

What's the level of urgency of this?

@cucaroach
Copy link
Contributor

Possibly related: #74806

@maryliag
Copy link
Contributor Author

What's the level of urgency of this?

The feature mentioned on the description (listing fingerprints used by index) will have to be hidden to non-admins until this task is completed, so the sooner the better.

Feel free to close this one in favour of the task you tagged on your message.

@ajwerner
Copy link
Contributor

Another possible approach could be for us to:

  1. Implement set-returning functions (we're planning to do this)
  2. Implement the SECURITY DEFINER feature on functions so that the function can run as its owner

Then we could have the function do the nice query parameterized as we want it and then grant access to the function somehow via a role?

cc @mgartner and @chengxiong-ruan

@mgartner
Copy link
Collaborator

Can we not already do the exact same thing as @ajwerner describes but with views? Or is it impossible to create a role that can access a view but not the tables it references?

@ajwerner
Copy link
Contributor

What is the privilege model here? I assumed it was complex, but if it's simple, views should do the trick.

maryliag added a commit to maryliag/cockroach that referenced this issue Jan 26, 2023
The get the list of fingerprints used by an index,
we use the `system.statement_statistics` directly, but
non-admins don't have access to system table.
Until cockroachdb#95756 or cockroachdb#95770 are done, we need to hide
this feature for non-admins.

Part Of cockroachdb#93087

Release note (ui change): Hide list of used fingerprint per index
on Index Details page, for non-admin users.
craig bot pushed a commit that referenced this issue Jan 26, 2023
94153: sql: Remove type annotations from `column_default` in `information_schema.columns` r=ZhouXing19 a=e-mbrown

Informs: #87774

The type annotation on `column_default` caused confusion in postgres compatible apps. This change formats the `column_default` without the type annotation. It does not address the incorrect type annotation.

Release note (bug fix): The content of `column_default` in `information_schema.columns` no longer have type annotations.

95855: backupccl: run restore/tpce/32tb/aws/nodes=15/cpus=16 once a week r=benbardin a=msbutler

This patch adds the restore/tpce/32TB/aws/nodes=15/cpus=16 test to our weekly
test suite. The backup fixture was generated by initing a tpce cluster with 2
Million customers, and running this workload while 48 incremental backups were
taken at 15 minute increments. The roachtest restores from a system time
captured in the 24th backup.

The cluster used to restore the backup will run on aws with 15 nodes each with
16 vcpus.

To verify this test exists, run:
`roachtest list tag:weekly`

Release note: None

Epic: none

95861: kv: remove lastToReplica and lastFromReplica from raftMu r=nvanbenschoten a=nvanbenschoten

Extracted from #94165 without modification.

----

In 410ef29, we moved these fields from under the Replica.mu to under the Replica.raftMu. This was done to avoid lock contention.

In this commit, we move these fields under their own mutex so that they can be accessed without holding the raftMu. This allows us to send Raft messages from other goroutines.

The commit also switches from calling RawNode.ReportUnreachable directly in sendRaftMessage to using the more flexible unreachablesMu set, which defers the call to ReportUnreachable until the next Raft tick. As a result, the commit closes #84246.

Release note: None
Epic: None

95876: sql: remove round-trips from common DISCARD ALL r=ajwerner a=ajwerner

#### sql: avoid checking if a role exists when setting to the current role

This is an uncached round-trip. It makes `DISCARD` expensive.
In #86485 we implemented
`SET SESSION AUTHORIZATION DEFAULT`, this added an extra sql query to
`DISCARD ALL` to check if the role we'd become exists. This is almost
certainly unintentional in the case where the role we'd become is the
current session role. I think this just happened because of code
consolidation.

We now no longer check if the current session role exists. This removes
the last round-trip from DISCARD ALL.

#### sql: use in-memory session data to decide what to do in DISCARD

In #86246 we introduced logic to discard schemas when running DISCARD ALL and DISCARD TEMP. This logic did expensive kv operations unconditionally; if the session knew it had never created a temporary schema, we'd still fetch all databases and proceed to search all databases for a temp schema. This is very expensive.

Fixes: #95864

Release note (performance improvement): In 22.2, logic was added to make
`SET SESSION AUTHORIZATION DEFAULT` not a no-op. This implementation used
more general code for setting the role for a session which made sure that
the role exists. The check for whether a role exists is currently uncached.
We don't need to check if the role we already are exists. This improves the
performance of `DISCARD ALL` in addition to `SET SESSION AUTHORIZATION
DEFAULT`.


Release note (performance improvement): In 22.2, we introduced support for `DISCARD TEMP` and made `DISCARD ALL` actually discard temp tables. This implementation ran expensive logic to discover temp schemas rather than consulting in-memory data structures. As a result, `DISCARD ALL`, which is issued regularly by connection pools, became an expensive operation when it should be cheap. This problem is now resolved.

95997: ui: hide list of statement for non-admins r=maryliag a=maryliag

To get the list of fingerprints used by an index, we use the `system.statement_statistics` directly, but non-admins don't have access to system table.
Until #95756 or #95770 are done, we need to hide
this feature for non-admins.

Part Of #93087

Release note (ui change): Hide list of used fingerprint per index on Index Details page, for non-admin users.

95998: roachtest: update activerecord blocklist r=ZhouXing19 a=andyyang890

This patch removes a few tests from the blocklist that do not
exist in the test suite for rails 7.0.3, which were being run
prior to the fix for version pinning.

Informs #94211

Release note: None

96003: kv: add MVCC local timestamp details to ReadWithinUncertaintyIntervalError r=nvanbenschoten a=nvanbenschoten

This PR adds details about MVCC local timestamps to ReadWithinUncertaintyIntervalErrors. This provides more information about the cause of uncertainty errors.

The PR also includes a series of minor refactors to clean up this code.

Release note: None
Epic: None

Co-authored-by: e-mbrown <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: maryliag <[email protected]>
Co-authored-by: Andy Yang <[email protected]>
@michae2
Copy link
Collaborator

michae2 commented Feb 7, 2023

[sql queries triage meeting] @maryliag would using views work, as mentioned above?

@maryliag
Copy link
Contributor Author

maryliag commented Feb 7, 2023

I don't understand how that works, for example, what I would need to do to make a select as
SELECT * FROM crdb_internal.statement_statistics WHERE "1@1"::jsonb <@ indexes_usage ?

@ajwerner
Copy link
Contributor

ajwerner commented Feb 7, 2023

Okay, so the summary of the problem as I understand it is:

  1. We have a view which today combines the data from the system table and the in-memory data (crdb_internal.statement_statistics).
  2. We have a good index on the system table for the query we'd like to write.
  3. What we don't have is a way to push the query bounds into the in-memory data, so we end up materializing all of it.
  4. Maybe we don't care about the in-memory data, and we only care about the durable data, so we'd like to just scan the system table
  5. End users can't just scan the system table because they don't have sufficient privileges.

The proposed solution is:

  • Create a virtual view which wraps just querying the system table.
  • Query that view.

The advantage of this over the system table is that you can grant privileges to these virtual tables now. An example of such a a wrapping is this:

--- a/pkg/sql/crdb_internal.go
+++ b/pkg/sql/crdb_internal.go
@@ -182,6 +182,7 @@ var crdbInternal = virtualSchema{
                catconstants.CrdbInternalTablesTableID:                      crdbInternalTablesTable,
                catconstants.CrdbInternalClusterTxnStatsTableID:             crdbInternalClusterTxnStatsTable,
                catconstants.CrdbInternalTxnStatsTableID:                    crdbInternalTxnStatsView,
+               catconstants.CrdbInternalStatementStatisticsTableID:         crdbInternalStatementStatisticsView,
                catconstants.CrdbInternalTransactionStatsTableID:            crdbInternalTransactionStatisticsTable,
                catconstants.CrdbInternalZonesTableID:                       crdbInternalZonesTable,
                catconstants.CrdbInternalInvalidDescriptorsTableID:          crdbInternalInvalidDescriptorsTable,
@@ -6297,6 +6298,34 @@ CREATE TABLE crdb_internal.cluster_transaction_statistics (
        },
 }
 
+var crdbInternalStatementStatisticsView = virtualSchemaView{
+       schema: `CREATE VIEW crdb_internal.statement_statistics2 AS SELECT
+                       aggregated_ts,
+                       fingerprint_id,
+                       transaction_fingerprint_id,
+                       plan_hash,
+                       app_name,
+                       node_id,
+                       agg_interval,
+                       metadata,
+                       statistics,
+                       plan,
+                       index_recommendations FROM system.statement_statistics`,
+       resultColumns: colinfo.ResultColumns{
+               {Name: "aggregated_ts", Typ: types.TimestampTZ},
+               {Name: "fingerprint_id", Typ: types.Bytes},
+               {Name: "transaction_fingerprint_id", Typ: types.Bytes},
+               {Name: "plan_hash", Typ: types.Bytes},
+               {Name: "app_name", Typ: types.String},
+               {Name: "node_id", Typ: types.Int},
+               {Name: "agg_interval", Typ: types.Interval},
+               {Name: "metadata", Typ: types.Jsonb},
+               {Name: "statistics", Typ: types.Jsonb},
+               {Name: "plan", Typ: types.Jsonb},
+               {Name: "index_recommendations", Typ: types.StringArray},
+       },
+}
+

@maryliag
Copy link
Contributor Author

maryliag commented Feb 7, 2023

To the point 4: I'm only using the system table because there is no way to do that on the combined view. My goal is to do that select on the combined view, because I do care with the latest data (that only exists in-memory)

The point number 5 can be done as a workaround, until I can make the call to the combined view.

@ajwerner
Copy link
Contributor

ajwerner commented Feb 7, 2023

Great. This issue does seem fitting as written for point 4. We need to have more advanced predicate push down into virtual tables and virtual indexes to realize the vision.

@yuzefovich
Copy link
Member

In a recent escalation, we had a system.statement_statistics table with 2.9TiB of data, and queries that power the SQL Activity page were performing a full scan of that table. Fixing this issue would have (probably) helped there and perhaps made the escalation be resolved faster.

@mgartner
Copy link
Collaborator

@maryliag When we spoke IRL I believe you mentioned that you are not longer combining the in-memory data with persisted data because fanning out to each node to retrieve the in-memory data was too slow. I'm going to close this issue, because I don't think there's anything to be done here, but let me know if I'm mistaken.

@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team
Projects
Archived in project
Development

No branches or pull requests

6 participants