adapter.get_columns_in_relation
has extremely poor performance as the number of glue databases and tables increase
#2854
Labels
bug
Something isn't working
good_first_issue
Straightforward + self-contained changes, good for new contributors!
redshift
Describe the bug
Background
The Redshift
svv_external_*
views for describing Redshift Spectrum are essentially a wrapper over API calls to the AWS Glue Catalog. Theget-tables
endpoint requires adatabase-name
, and optionally takes a parameter,expression
REGEX pattern to match table names. Thesearch-tables
endpoint takes a JMESPathfilter
param. In either case, a query that does not have a filter onschemaname
would require Redshift to query for and retrieve every glue table for every schema available to the user querying thesvv_external_columns
view.Issue
When doing some testing to update to 0.18.1, so I'd get better external table support, I noticed some pretty significant performance issues with our dbt project. The addition of external columns support in the
adapter.get_columns_in_relation
macro is what I found to be the culprit. I discovered that the `table_schema = '{{ adapter }}' filter does not push down further than the union CTE. This means that redshift would have to make API calls searching all of the glue catalog databases, then bring them over, before unioning them to filter on the schema name. In cases where folks have a bunch of external tables, this is a pretty heavy performance concern.Steps To Reproduce
svv_external_schemas
Expected behavior
System table queries should be performant and not cause significant overhead on a dbt run. More specifically, when querying for columns in
svv_external_columns
, the where clause should filter against thesvv_external_columns.schemaname
column to ensure the number of api calls are minimized.Screenshots and log output
Explain plan of the
adapter.get_columns_in_relation
macro. Notice that the filter is only applied AFTER the union. Thesvv_external_columns
table is a wrapper over Glue API calls, so the filter needs to be put inside the CTE.System information
Which database are you using dbt with?
The output of
dbt --version
:The operating system you're using:
The output of
python --version
:Additional context
Explain Plan
Notice the
table_schema='some'
filter. It just hits the union, and never gets passed down into thesvv_external_columns
table.The text was updated successfully, but these errors were encountered: