Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass filter/limit/order to virtual tables #95766

Closed
maryliag opened this issue Jan 24, 2023 · 8 comments
Closed

pass filter/limit/order to virtual tables #95766

maryliag opened this issue Jan 24, 2023 · 8 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-postmortem Originated from a Postmortem action item. T-sql-queries SQL Queries Team

Comments

@maryliag
Copy link
Contributor

maryliag commented Jan 24, 2023

Currently values for order/filter/limit are not passed on virtual tables, making the queries on them to be inefficient.
We use virtual tables on the console on several locations (e.g. crdb_internal.statement_statistics, crdb_internal.transaction_statistics), but selects on them can take a long time to be executed since filters and limit are not being passed on.
This change would allow all of sql observability queries to be more efficient and consequently the tables would take less time to load (it can take minutes when users have a lot of data).

Jira issue: CRDB-23723

@maryliag maryliag added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team labels Jan 24, 2023
@michae2 michae2 added O-postmortem Originated from a Postmortem action item. C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels Feb 16, 2023
@mgartner
Copy link
Collaborator

mgartner commented Mar 2, 2023

@michae2 can you add a link in the JIRA ticket to the post-mortem document related to this issue? Can you also assign a priority and due date based on how important you think this is? If there is a status update, please leave a comment in the JIRA ticket, too.

@rytaft
Copy link
Collaborator

rytaft commented Mar 2, 2023

@maryliag my impression from recent discussions with you was that these virtual tables are not dependent on a system table, as there is a view that combines the virtual table and system table for use by other queries. Is this correct? If so, I'm not sure what SQL Queries can do to optimize purely in-memory virtual tables (i.e., we can't push filters down to an index, etc).

@maryliag
Copy link
Contributor Author

maryliag commented Mar 2, 2023

We have a view that combines [in-memory view] + [system table]
We would like to pass limits/order to that combined view.

For example

SELECT
 fingerprint_id,
 transaction_fingerprint_id,
 app_name,
 max(aggregated_ts) as aggregated_ts,
 metadata,
 crdb_internal.merge_statement_stats(array_agg(statistics)) AS statistics,
 max(sampled_plan) AS sampled_plan,
 aggregation_interval
FROM crdb_internal.statement_statistics   <--- this is the combined view with in-memory and system
WHERE app_name NOT LIKE '$ internal%' AND app_name NOT LIKE '$$ %' 
AND aggregated_ts >= $1     <--- I believe the filter is already getting passed on, so we don't do a full scan because of this
AND aggregated_ts <= $2"
GROUP BY
 fingerprint_id,
 transaction_fingerprint_id,
 app_name,
 metadata,
 aggregation_interval
ORDER BY aggregated_ts DESC LIMIT $3   <----- this here is what I'm referring to

@michae2
Copy link
Collaborator

michae2 commented Mar 2, 2023

@maryliag is this related to #95770?

@rytaft
Copy link
Collaborator

rytaft commented Mar 2, 2023

Right, so this is a different problem, which has nothing to do with virtual tables. Even if you were to remove any reference to the virtual table from the above query, we would still not be able to push the limit down. Here I've re-written this query over the system table, and you can see that we cannot push the limit down (note the top-k at the top of the query plan):

[email protected]:26257/movr> explain SELECT                                                                                    
                        ->  fingerprint_id,                                                                                  
                        ->  transaction_fingerprint_id,                                                                      
                        ->  app_name,                                                                                        
                        ->  max(aggregated_ts) as aggregated_ts,                                                             
                        ->  metadata,                                                                                        
                        ->  crdb_internal.merge_statement_stats(array_agg(statistics)) AS statistics,                        
                        ->  max(plan) AS sampled_plan,                                                                       
                        ->  agg_interval                                                                                     
                        -> FROM system.statement_statistics                                                                  
                        -> WHERE app_name NOT LIKE '$ internal%' AND app_name NOT LIKE '$$ %'                                
                        -> AND aggregated_ts >= now() - interval '1 day'                                                     
                        -> AND aggregated_ts <= now()                                                                        
                        -> GROUP BY                                                                                          
                        ->  fingerprint_id,                                                                                  
                        ->  transaction_fingerprint_id,                                                                      
                        ->  app_name,                                                                                        
                        ->  metadata,                                                                                        
                        ->  agg_interval                                                                                     
                        -> ORDER BY aggregated_ts DESC LIMIT 10  ;                                                           
                                                                                                                                                                     info
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • render
  │
  └── • top-k
      │ estimated row count: 10
      │ order: -max
      │ k: 10
      │
      └── • group (hash)
          │ estimated row count: 20
          │ group by: fingerprint_id, transaction_fingerprint_id, app_name, agg_interval, metadata
          │
          └── • filter
              │ estimated row count: 20
              │ filter: (app_name NOT LIKE '$ internal%') AND (app_name NOT LIKE '$$ %')
              │
              └── • scan
                    estimated row count: 180 (100% of the table; stats collected 2 seconds ago)
                    table: statement_statistics@primary
                    spans: [/0/'2023-03-01 22:12:00.227443+00' - /0/'2023-03-02 22:12:00.227443+00'] [/1/'2023-03-01 22:12:00.227443+00' - /1/'2023-03-02 22:12:00.227443+00'] [/2/'2023-03-01 22:12:00.227443+00' - /2/'2023-03-02 22:12:00.227443+00'] [/3/'2023-03-01 22:12:00.227443+00' - /3/'2023-03-02 22:12:00.227443+00'] … (4 more)
(22 rows)

There are a couple of reasons for this:

  1. Due to the ORDER BY clause over the aggregated column, we cannot push the limit below the GROUP BY (i.e., there is no way to know which rows must be used to satisfy the LIMIT until we have performed the aggregation max(aggregated_ts)).
  2. Even if the ORDER BY were removed, we still cannot push the LIMIT below the GROUP BY since none of the grouping columns form a prefix on the index scanned, so it's not possible to create a streaming GROUP BY.

Maybe someone else on the team has a suggestion, but I currently don't see a way for us to better optimize this query.

My suggestion would be to change the query so it can be more easily optimized, or if this is not possible, asynchronously load the data into a materialized view so that it's ready when needed.

@maryliag
Copy link
Contributor Author

maryliag commented Mar 2, 2023

is this related to #95770?

different problems, that one is for the query SELECT * FROM crdb_internal.statement_statistics WHERE "1@1"::jsonb <@ indexes_usage, which is currently not possible

@maryliag
Copy link
Contributor Author

maryliag commented Mar 2, 2023

Thank you for the explanation Becca. We will look into other improvements we can make on the query

@rytaft
Copy link
Collaborator

rytaft commented Mar 7, 2023

Great, thanks. In that case I'm going to close this issue, but please let us know if we can help brainstorm how to rewrite the query so it's possible to optimize (or feel free to open another issue).

@rytaft rytaft closed this as completed Mar 7, 2023
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-postmortem Originated from a Postmortem action item. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

No branches or pull requests

4 participants