[CT-202] Workaround for some limitations due to `list_relations_without_caching` method #513

danfran · 2021-09-21T15:29:28Z

Describe the feature

I am currently facing an issue using DBT with Spark on AWS/Glue/EMR environment as discussed already in dbt-labs/dbt-spark#215 (but already raised here dbt-labs/dbt-spark#93).

The current issue is about the adapter's method list_relations_without_caching:

https://github.com/dbt-labs/dbt/blob/HEAD/core/dbt/include/global_project/macros/adapters/common.sql#L240

which in the Spark Adapter implementation is:

https://github.com/dbt-labs/dbt-spark/blob/a8a85c54d10920af1c5efcbb4d2a51eb7cfcad11/dbt/include/spark/macros/adapters.sql#L133-L139

In this case you can see that the command show table extended in {{ relation }} like '*' is executed. It will force Spark to go through all the tables info in the schema (as Spark has not Information Schema Layer) in a sort of "discovery mode" and this approach produces two main issues:

Bad performance: some environments can have hundreds or even thousands of tables generated not only by DBT but also by other processes in the same schema. In that case this operation can be very costly, especially when you have different DBT processes that run some updates at different times on a few tables.
Instability, as I verified in AWS/Glue/EMR environment, where you can have views without "S3 Location" defined, like an Athena/Presto view, that will make crash a DBT process running SparkSql on EMR with errors like:

show table extended in my_schema like '*'
  ' with 62bb6394-b13a-4b79-92dd-2e0918831cf3
21/09/18 13:00:22 INFO SparkExecuteStatementOperation: Running query with 62bb6394-b13a-4b79-92dd-2e0918831cf3
21/09/18 13:01:03 INFO DAGScheduler: Asked to cancel job group 62bb6394-b13a-4b79-92dd-2e0918831cf3
21/09/18 13:01:03 ERROR SparkExecuteStatementOperation: Error executing query with 62bb6394-b13a-4b79-92dd-2e0918831cf3, currentState RUNNING,
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string

Describe alternatives you've considered

I do not see the reason why DBT process should care of the "rest of the world" like the Athena views from before or tables created from other processes that are in the same schema.

So I can think ideally to replace the method:

show table extended in <schema> like '*'

with something like:

show table extended in <schema> like ('<table1>|<table2>|…')

where my <table1>, <table2>, etc. are determined automatically when I run a command like

dbt run --models my_folder

where my_folder contains the files: table1.sql, table2.sql, etc

but from the current method interface, only the schema params can be passed.

Two questions here:
How can I infer automatically the name of the tables involved when a command like dbt run --models my_folder run and how can I pass them eventually to the list_relations_without_caching?

Additional context

I found it relevant for Spark on AWS environment but can be potentially a similar issue for other implementations.

Who will this benefit?

On DBT's slack channel I talked to another used "affected" by similar issue, but probably whoever is going to use Spark in distributed environment can be affected by this (AWS and non).

Are you interested in contributing this feature?

Sure, both coding and testing.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2021-10-13T10:00:25Z

@danfran Thanks for opening the issue, and apologies for getting back to you here (and over in dbt-labs/dbt-spark#215).

Background

The reason we use show table extended in <schema> like '*' is to build the adapter cache at the start of every run. This is how dbt can know which models already exist, and (crucially) whether they exist as a view or table, so that subsequent calls to get_relation within materializations (e.g. here and here) can be cache lookups rather than redundant, slow introspective queries.

We found we need to use show table extended in <schema> like '*', instead of the much less verbose (and more performant) show tables in <schema>, because show tables does not include the relation type (view or table), only whether the object is temporary. On other databases, this kind of metadata is easy and fast to access, but Apache Spark lacks a real information schema.

(A neat side effect of having to use show table extended in <schema> like '*' is that we can resolve get_columns_in_relation from the adapter cache, when available, which we cannot do on other adapters. This is purely a bonus, though, and it would be worth it to give up this capability in exchange for faster, more reliable cache construction.)

I know that show table extended in <schema> like '*' can be very slow, especially when accessing metadata for schemas with thousands of objects. This feels especially frustrating when most of those objects are not relevant to dbt! As a best practice, we highly recommend that you create dbt objects in a dedicated schema, separate from other processes. I understand it isn't always possible to isolate source tables, but this only impacts catalog querying (docs generate) rather than cache construction (run etc), since schemas with sources are catalogued but not cached.

Proposed change

Okay, now that we've gotten that background out of the way—I think you're onto something here:

So I can think ideally to replace the method:

show table extended in <schema> like '*'

with something like:

show table extended in <schema> like ('<table1>|<table2>|…')

I think this might just be doable!

There's a base adapter method, _get_cache_schemas, that the adapter uses to identify which schemas it ought to cache. That list is derived from the enabled, non-ephemeral relations in the project. Normally, the list includes identifier-less relations, because we just need to run a single metadata query for the entire schema. The list of schema-only identifiers is then passed into the list_relations_without_caching method, i.e. the list_relations_without_caching macro, i.e. show table extended in <schema> like '*'.

But what if we instead passed in a schema relation that did have an identifier? That identifier could look like '<table1>|<table2>|…', and the list_relations_without_caching macro could use it to run a more targeted metadata query.

Any methods defined in BaseAdapter can be overridden within SparkAdapter, so even though _get_cache_schemas isn't defined there now, we can reimplement it.

Here's my rough attempt:

    def _get_cache_schemas(self, manifest: Manifest) -> Set[BaseRelation]:
        """Get the set of schema relations that the cache logic needs to
        populate. This means only executable nodes are included.
        """
        # the cache only cares about executable nodes
        relations = [
            self.Relation.create_from(self.config, node)  # keep the identifier
            for node in manifest.nodes.values()
            if (
                node.is_relational and not node.is_ephemeral_model
            )
        ]
        # group up relations by common schema
        import collections
        relmap = collections.defaultdict(list)
        for r in relations:
            relmap[r.schema].append(r)
        # create a single relation for each schema
        # set the identifier to a '|' delimited string of relation names, or '*'
        schemas = [
            self.Relation.create(
                schema=schema,                
                identifier=(
                    '|'.join(r.identifier for r in rels)
                    # there's probably some limit to how many we can include by name
                    if len(rels) < 100 else '*'
                )
            ) for schema, rels in relmap.items()
        ]
        return schemas

Then, the macro becomes:

{% macro spark__list_relations_without_caching(relation) %}
  {% call statement('list_relations_without_caching', fetch_result=True) -%}
    show table extended in {{ relation.schema }} like '{{ relation.identifier or "*" }}'
  {% endcall %}

  {% do return(load_result('list_relations_without_caching').table) %}
{% endmacro %}

That works for me locally:

show table extended in dev_jerco like 'my_first_dbt_model|my_second_dbt_model'

Next steps

What do you think of an approach like that? Would you be interested in taking that and running with it?
Should we also try to do something similar for _get_catalog_schemas, to speed up docs generate?
I'm going to transfer this issue to the dbt-spark repo, since (I believe) that's where we can make all the code changes we need.
Semi-related: In the process of teasing apart this functionality, I noticed that the add_schema_to_cache method doesn't seem to be called/used anywhere. It might be a holdover from a long time ago. It's always a good idea to remove totally unused code.

danfran · 2021-10-14T13:11:48Z

Hi @jtcohen6 thank you for the insights, very useful! I am more than happy to try what you suggested as seems working around the issue in a good way. I will also try to apply your other suggested points.

ssimeonov · 2022-01-12T00:24:27Z

@danfran I am curious to know whether this worked for you.

danfran · 2022-01-17T11:49:01Z

@ssimeonov sorry for the late answer. Unfortunately my work conditions are changed since then and I was not able to test it in my original environment. However the solution proposed by @jtcohen6 still seems valid. If you are in an AWS environment and you want to test it (included the Athena view / EMR-Hive conflict that makes this fix even more useful) you need just a few tables in order to reproduce the issue and create a valid environment.

jeremyyeo · 2022-01-18T08:50:28Z

@jtcohen6 is there a strictly macro override way (without getting into python) to make the macro do as you suggested? I plonked something like this into the /macros folder but didn't quite work (did a few variations, including having if execute around {% for node in).

{% macro spark__list_relations_without_caching(relation) %}
  {% set rels = [] %}
  {% for node in graph.nodes.values() | selectattr("resource_type", "equalto", "model") %}
      {% do rels.append(node.fqn[1]) %}
  {% endfor %}

  {% if rels | length > 1 %}  
    {% set suffix = rels | join('|') %}
  {% else %}
    {% set suffix = '*' %}
  {% endif %}

  {% call statement('list_relations_without_caching', fetch_result=True) -%}
    show table extended in {{ relation }} like {{ suffix }}
  {% endcall %}
  {% do return(load_result('list_relations_without_caching').table) %}
{% endmacro %}

I'm guessing surely not because you provided a _get_cache_schemas() function + macro method 😁

jtcohen6 · 2022-01-18T09:32:14Z

@jeremyyeo I think this approach would indeed require implementing _get_cache_schemas(), i.e. a change to the python methods in dbt-spark—so there's no way to test this out with purely user-space code in the meantime.

An alternative approach that either I just remembered, or which just occurred to me, in the threads where we're discussing this internally:

The reason we initially opted (back in Saner approaches to getting metadata for Relations dbt-spark#49) for show table extended (verbose) over show tables (concise) is because the concise version lacks a field/specifier that enables us to tell between views and tables. (Contrary to the nomenclature, show tables includes both views + tables.)
We could look into running two concise queries rather than one verbose one, by running (for each relevant database) show views in <database> + show tables in <database>. Then, we can reasonably infer that every object returned by the former is a view, and every object returned by the latter and not in former is a table.
This bare-bones approach would be enough for the initial adapter cache that powers materializations—though it wouldn't let us do incidentally clever things like returning get_columns_in_relation from the cache if available.

It might be worth experimenting with both approaches, and seeing which one yields greater benefits. Does the slowness have more to do with the verbosity of show table extended (contextual information we don't strictly need)? Or more to do with trying to return metadata for thousands of extraneous objects that dbt doesn't care about, but which happen to live alongside relevant objects in the same database/schema?

ssimeonov · 2022-01-19T04:28:37Z

Does the slowness have more to do with the verbosity of show table extended (contextual information we don't strictly need)?

Yes. SHOW TABLES runs quickly.

jtcohen6 · 2022-02-04T14:44:41Z

After speaking a bit more about this with @superdupershant:

show tables should indeed be very fast, but show views is likely to be significantly slower at scale
Even though show table extended requires pulling out more metadata than the simple show command, filtering on the table name (like 'table_one|table_two|...') is more likely to yield a significant speedup

There are other performance boosts that the dbt-databricks team is planning to take advantage of, enabled by the specificity of their underlying architecture. That feels like an appropriate approach for our two plugins going forward.

ssimeonov · 2022-02-04T15:11:38Z

This aligns with our experience @jtcohen6.

Forgive my unfamiliarity with DBT architecture: how do capabilities "degrade" based on the quality of metadata available about assets a data production depends on?

I can think of (at least) several levels:

Existence (SHOW TABLES can solve this)
Schema information
Lineage information (view V depends on table W and view X, which in turn depends on table Y)
Data freshness information (things like last data update time)

The reason for the question is that, while the Spark metastore is quite dumb, Delta tables have queryable history and recognizable versioned identifiers (which could be used as an indication of immutable data state).

Further, one could design a simple open metadata standard (say, a single property containing some JSON, implemented via managed table properties in Spark, for example) for getting the metadata DBT needs for advanced functionality.

jtcohen6 · 2022-02-04T15:33:47Z

@ssimeonov It's a really fair question!

Existence (+ relation type) is the only information dbt needs for all relations before running, to populate its runtime cache. As each model is materialized, it updates the cache.
Schema information is pulled for post-run metadata ("catalog"), and it can be accessed during the run via well-understood methods (get_columns_in_relation) if needed to power dynamically templated model SQL
dbt infers lineage information by parsing a project — those ref() + source() calls. This doesn't require a database connection at all. However, for some databases (such as "bound" views in Postgres), dbt also queries for inter-object dependencies, so that its runtime cache can correctly handle the implications of drop ... cascade. That's less relevant on most analytical databases, including Spark/Databricks.
Data freshness information, for sources, is currently powered by a separate task — dbt source freshness. This is comparable to docs generate, which generates the "catalog" described in (2). It also checks freshness naively (select max(timestamp_col)) — I know that, for Delta tables, it would make a lot of sense to access native metadata to answer the same question.

In the future, I could see integrating both types of information (freshness + catalog) into other dbt tasks. Overall, I think dbt only stands to benefit from Delta tables that can provide all types of metadata more readily, quickly, and reliably.

ssimeonov · 2022-02-04T15:38:55Z

Then the path forward seems to be the implementation of your original idea: to generate the like 'table_one|table_two|...' from DBT's project information, is that right?

jtcohen6 · 2022-02-05T12:23:25Z

Yes, I think that's right. Let's aim to include this change in a v1.1 beta, and ask folks to test it out in their own projects to ensure it yields the desired speedups.

cc @VersusFacit: This is another cache-related performance question, similar to dbt-labs/dbt-snowflake#83, where benchmarking (and anticipating edge cases) will be the name of the game.

The change proposed in this issue will significantly help in cases where dbt manages some objects, in large databases that also contain thousands of non-dbt objects. If we want to speed up cases where dbt manages thousands of objects, and a given invocation is only selecting a handful to run, we'll need to tackle the larger question of selection-based caching: dbt-labs/dbt-core#4688

crystalro0 · 2022-03-10T22:59:02Z

Adding another customer onto this issue. Seeing dbt docs generate taking up to an hour or more on job runs.
Please msg if more info is required.

ryanClift-sd · 2022-03-14T19:04:57Z

Hoping this change can get added into 1.1 as part of the base code!

It would be a significant benefit for us as we are using Managed Workflows for Apache Airflow (MWAA) and are unable to modify the core code as referenced here #513.

With AWS MWAA we provide the requirements.txt (dbt-core==1.0.0) but have no access to the shell to modify the core code after the fact and show table extended in <schema> like '*' continues to take longer and longer and with our implementation show table extended is run for every model step.

Our implementation means we don't do a single dbt run --select <project> as we wanted control over re-running failed models and that control to be handled via restarting the task inside airflow.
See https://www.astronomer.io/blog/airflow-dbt-1/ "A Better, Better Way"

Anyways, looking forward to changes that speed up "show table extended" for spark schema objects.

jtcohen6 · 2022-03-18T09:13:19Z

dbt uses show table extended like '*' for two purposes today:

To cache relations at the start of each run, whether for one model or many
To access rich metadata during catalog generation (docs generate)

Cache population: We're tracking the investigation into the two Spark-specific possibilities noted above in dbt-labs/dbt-spark#296. There are two branches with sample/experimental code, if you're able to try them out and let us know what you find. We're also thinking about cache speedups via dbt-core changes, e.g. dbt-labs/dbt-core#4688.

Catalog generation: As a separate question, I think it would make a lot of sense to pass specific table info (show table extended like 'one|two|three|...) when grabbing metadata from the catalog. The stakes of missing an object are lower, and source tables/schemas are included. This will require code changes different from the ones sketched out above / in the branches linked by dbt-labs/dbt-spark#296. I've opened a new issue to track that work: dbt-labs/dbt-spark#300

ryanClift-sd · 2022-03-21T10:55:22Z

@jtcohen6
Thank you for the quick adjustments and 2 branch options!
Sorry for the delay, testing took longer than I thought.

I was able to complete testing and below are my findings and some related context around the testing.
Additional Context & Final Thoughts to follow below testing/findings.

TL;DR

I'm definitely leaning towards Option2 as well
- For everyone not interested in the entire thing haha
  - I know, holy wall of text, batman

Testing/Findings

All Testing / Findings do not include cluster startup time.
- For every test of DBT Compile & DBT Run we manually started the cluster
  - once it was started & available then testing commenced
    - This was so no cluster startup time is included in the metadata calls timing test
All testing was done on an i3.2xlarge cluster size (61GB of memory, 4 CPUs)
- 1 Worker
- 1 Driver

dbt-spark v1.0.0 Code

show table extended in <schema> like '*'

In schemas that contain a few objects, performance concerns aren't really existent; overhead is negligible
- As objects in the schema grow, even if all are related to a DBT Project, this command starts to take longer
- Spark Cluster size makes no difference here for execution time of show table extended in <schema> like '*'
Executing a single command at the beginning of DBT Run to populate the cache or catalog
- Cache
  - As schemas grow that are quite large definitely slow down
  - schemas may contain objects not relevant to the current DBT Project or any DBT Project at all
    - In some cases many more objects, thereby limiting the efficacy of this command/process
- Catalog
  - dbt docs generate can suffer from this single call, especially in the schema/project cases above
  - Testing on spark shows the show table extended in <schema> like '*' can also be quite lengthy inside schemas that are quite large and/or contain many objects not related to the project/model docs is being generated for/on
I have attached a Screenshot showing the execution timings of show table extended in <schema> lilke '*' from the spark cluster as well as DBT command line for both DBT Compile & DBT Run
- click to enlarge for better quality and readability

dbt-spark wip-faster-caching-option1

show table extended in <schema> like <relation object|relation object|relation object|....>
or as I will refer to it
show table extended in <schema> like <list>

For any project in any schema this is a great step & more targeted metadata query
- Only models related to the DBT Project are added to the like <list> for enumeration
Executing a subset of models or even a single model in the project
- all models in the project are still added to the like <list> enumeration
  - Meaning, if we executed just our staging folder or a single model.sql
    - all 1,578 models are added to the like <list>
like <list> "limit" testing - Tested with 1,578 Models
- No rejection from the spark cluster that the like <list> was to large
  - This is our current project size, it doesn't mean that 1,579 will error, but it could
Execution timing on this command was less than 3 minutes for the 1,578 models
- For Smaller Projects inside schemas that have objects not related to the DBT Project, this would be significantly faster metadata query
- The schema this was tested in has 2,048 Tables and ~1,578 are related to the DBT Project
Definitely faster and more targeted than Current v1.0 Code Base
- Again, executing a subsection of the Project, or a single model still incurs the lookup of all objects
I have attached a Screenshot showing the execution timings of show table extended in <schema> like <list> from the spark cluster as well as DBT command line for both DBT Compile and DBT Run
- click to enlarge for better quality and readability

dbt-spark wip-faster-caching-option2

show tables in <schema>
show views in <schema>

followed by describe extended <objects> for any "refs" inside the model.sql

Show Tables & Show Views in <schema> is lightning fast
- Both commands on the schema in question execute in less than 10 seconds combined
- describe extended <object> is quick as well
  - milliseconds execution for any single object
DBT Run though suffers in terms of overall execution compared to Option1 due to the describe event occurring every model
DBT Compile is the command that suffers the most with this approach
- DBT Docs would as well due to needing to run describe extended 1 by 1 for these commands compared to like <list>|'*'
- describe extended <object> doesn't provide whether the column is nullable or not
  - data availability inconsistency between show table extended in <schema> like <list>|'*' and describe extended <object>
In regards to some principal concerns here: [CT-302] [Spike] Benchmark perf for show tables, show views dbt-spark#296 (comment)
- Tested/Verified External Object creation to make sure it still created Delta objects "as "
- Tested/Verified that "is_incremental()" models still executed properly
- Tested/Verified that a merge statement was used
  - Our dbt_project.yml absolutely has +file_format: delta set though, maybe that's entirely why this passed my testing
I have attached a Screenshot showing the execution timings of show tables in <schema> & show views in <schema> from the spark cluster as well as DBT command line for both DBT Compile and DBT Run

Additional Context

How we are executing DBT Projects/Models at Slickdeals

I understand this might not be the recommended or normal approach.
Below is an example Screenshot of our implementation and notes following

We have farmed out the execution of every single model to a single Airflow Task where every task calls DBT Run --select model.sql
- model.sql is every individual model in the graph
- Allows us to let Airflow handle execution state tracking
  - I know you are all looking at this/state tracking and re-running the DAG
    - Add pseudo selectors that select models based on artifact states dbt-core#2465
  - Leaves DBT to handle code creation and not have to be the orchestrator as well
    - You all support executing dbt commands via the bash operator, just maybe not as granular as we are
      - Running dbt in Production - Airflow
- Allows us to parallelize execution of the entire graph at a very low cost/overhead via MWAA
  - other than the core calls that are executed for every "DBT Run" call
    - This is absolutely what is crushing our execution times & environment
    - we end up having to call show table extended in <schema> like '*' for every model (nearly 900+ times currently)
We have a limited DAG capacity though using MWAA
- why this project isn't entirely broken down into many projects
- plus all models in this project handle applying replicated data transactions to our Datalake from on premise via the same code template used to generate all the models

I totally understand as well that our implementation might not be normal or recommended and is part, if not entirely, the reason we are experiencing problems. Of course, none of these are your problems but I greatly appreciate the work being put into making caching faster overall

Reasons why we went the above Implementation route

Execution State Tracking to restart failed models
- There may be changes upstream or unexpected data that flows through the system or schema changes that we are not aware of and we wanted the ability to restart the DAG where it failed while letting other tasks to continue to execute and eventually complete their execution path.
Parallelization of task execution
- As we are migrating towards cloud infrastructure, new tools and building an entirely new platform, we needed to be able to execute 100's of steps in parallel as a lot of our DAGs only have limited dependencies and take the view of eventual consistency
  - I love FK's and ensuring referential consistency but as items get more and more distributed that becomes a "secondary" task - say DBT Test
  - Current on premise "DAGs" run 100's of steps in parallel and we needed like for like

Final Thoughts

show table extended in <schema> like '*'

Because our implementation calls DBT Run every time, we incur this cost every time
This command run by itself this call can take 5-8+ minutes, which isn't terrible really
- our implementation makes it terrible sadly
Regardless of implementation though, this could be more a more targeted cache query, especially for others use cases as well as schemas that contain objects unrelated to DBT DAG processing

show table extended in <schema> like <list>

I like it overall for DBT Compile & DBT Docs
Potential serious draw back due to the could get too long to be executed and error
For DBT Run it's still faster than like '*'
I think this fits the methodology that is recommended for executing DBT DAGs
- probably how it is normally executed by most using DBT
  - Could be wrong though
Eventually the list will get so long where performance could "suffer" again?
- Not sold on this, but maybe

show tables in <schema>
show views in <schema>
followed by describe extended <objects> for any "refs" inside the model.sql

I love this method overall, especially for our implementation!
- In fact I'm trying to get it into a python wheel and uploaded to pypi so we can install it to fix some of our current production issues
- I understand that this is purely a work in progress and there might be other drawbacks were missing or enabling by using this methodology, especially asap.
  - I understand all code is "as is" etc
This is of course the most targeted metadata query gathering and ideal for how we have implemented DBT
As you've outlined here [CT-302] [Spike] Benchmark perf for show tables, show views dbt-spark#296 (comment) , this method is faster in larger models and slower in small models
- I don't want you to have to go back to a single describe extended table approach indicated in your last paragraph here [CT-279] Delta tables missing column schema in show table extended dbt-spark#295 (comment)
  - You do see a slow down vs Option1 with a single call to populate the cache and that call being more targeted
In that regards, you also mention in [CT-302] [Spike] Benchmark perf for show tables, show views dbt-spark#296 (comment) Principal risk of approach 2: is that you'd be missing is_delta
- describe extended does have a Provider attribute available, maybe that can be used to reduce this risk?
- Not saying that you don't have to run an additional query describe extended <object> to get these results, but could those then be cached?
  - Thereby allowing you to still do cache lookup checks?
    - Just thinking out loud here, maybe you still cannot and must run the describe extended <object> again

I don't want you to have to finagle support for both Option1 and Option2

Overall I'm leaning towards Option2 heavily.
- Slightly biased by the significant benefit it would have for our implementation of course
- I still feel that option 2 overall is more targeted
  - sadly the describe extended call must be accepted and that might be a sort of return back to the slower approach that you mention in your last paragraph here [CT-279] Delta tables missing column schema in show table extended dbt-spark#295 (comment)
Our implementation would still be faster with Option1
- Option2 is exactly what we'd want and would like to with from our perspective
  - Open Source being what it is and all
I feel like Option 2 is the most performant
I'm definitely leaning towards Option2 as well

Again, thank you very much for the two options, and to everyone if you got this far kudos!

spenaustin · 2022-11-01T21:55:39Z

Very interested in this fix -- @jtcohen6 @TalkWIthKeyboard do you think there's anything I could do to help with dbt-labs/dbt-spark#433, or do you think it's best to let y'all handle it?

boxysean · 2022-11-09T10:11:15Z

I'm working with a dbt Cloud customer who have an existing, long-lived Spark cluster and are keen to try dbt Core/Cloud with it. They have 10,000+ objects in Spark and are experiencing this issue. They are following along to see its resolution!

theromsh · 2023-02-01T16:30:56Z

Is there any progress with a fix?
The solution jtcohen6 suggested really benefits our use-case...
Is there any estimations?

ssimeonov · 2023-02-01T17:36:28Z

Per @smtaylorslc @theromsh @boxysean I'll add that our team has moved away from dbt Cloud because of this problem.

tinolyu · 2023-02-14T01:23:23Z

I ran into the same issue, it took ~1hr to run dbt docs generate. Is there any fix for this?

VShkaberda · 2023-07-28T14:35:20Z

While waiting for this to be implemented I decided to use @ryanClift-sd 's dbt-spark wip-faster-caching-option1 and stuck with the further problem: it seems that dbt-core logic is broken and schema list isn't being generated for sources. While _get_cache_schemas() correctly generate list from manifest, we end up in list_relation() for the sources
claiming that

we can't build the relations cache because we don't have a manifest so we can't run any operations.

even thought the manifest definitely exists because it is being used by _get_cache_schemas().
After adding some logging:

I'm not sure whom this problem should be addressed to.

ryanClift-sd · 2023-07-28T17:49:58Z

I assume it's probably related to changes in made after dbt-core/dbt-spark 1.0. When we had implemented this solution we were only using dbt-core 1.0 and dbt-spark 1.0 & never tested anything above that version. We stuck with dbt-core/dbt-spark 1.0 until very recently.

We have since upgraded to dbt-core 1.5 and luckily were able to move to dbt-databricks vs dbt-spark.
Our implementation wasn't a self hosted spark instance, or aws glue spark, but databricks proper. At the time of dbt-core 1.0 the additional dbt-databricks package didn't exist.

dbt-databricks solved our issues with the relation retrieval and supported unity catalog which we were starting to leverage and needed to use in our dbt projects

pan3793 · 2023-11-20T02:33:29Z

I would like to report another issue on show table extended in schema_a like '*'.

We use the OSS HMS with Spark3, and there are some Kudu, HBase tables registered in HMS, but there are no correspondingly serde jars under the Spark application classpath, thus the above query would cause failure like

Spark adapter: Database Error
  Error operating ExecuteStatement: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hadoop.hive.kudu.KuduStorageHandler
        at org.apache.hadoop.hive.ql.metadata.Table.getStorageHandler(Table.java:297)
        at org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$convertHiveTableToCatalogTable(HiveClientImpl.scala:482)
        at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$3(HiveClientImpl.scala:434)
        at scala.Option.map(Option.scala:230)
        at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$1(HiveClientImpl.scala:434)
        at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:298)
        at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
        at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
        at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:278)
        at org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:432)
        at org.apache.spark.sql.hive.client.HiveClient.getTable(HiveClient.scala:95)
        at org.apache.spark.sql.hive.client.HiveClient.getTable$(HiveClient.scala:94)
        at org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:92)
        at org.apache.spark.sql.hive.HiveExternalCatalog.getRawTable(HiveExternalCatalog.scala:122)
        at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$getTable$1(HiveExternalCatalog.scala:734)
        at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:101)
        at org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:734)
        at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:138)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableRawMetadata(SessionCatalog.scala:515)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:500)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTempViewOrPermanentTableMetadata(SessionCatalog.scala:710)
        at org.apache.spark.sql.execution.command.ShowTablesCommand.$anonfun$run$43(tables.scala:859)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at org.apache.spark.sql.execution.command.ShowTablesCommand.run(tables.scala:854)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
        at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)

So, alongside the performance, I think the new approach should consider supporting table exclusion list

despewerba · 2024-01-09T17:20:21Z

We have the same issue, but our problem is not focused on performance but rather on the loss of information.
We are using iceberg's upsert functionality and, when this error occurs, dbt assumes that the models do not exist and instead of doing an upsert it executes a drop and create resulting in the loss of the previous information.
Has anyone experienced the same problem as us? @jtcohen6 do you have any thoughs on this?

vinjai · 2024-01-15T16:27:28Z

We are also facing the same issue with iceberg.
Ideally it should explicitly fail if the show tables command fails instead of retrying.

The failure is an intermittent one and get resolved in the rerun. However, there is an explicit loss of data because of the create or replace command.

We are facing this issue in dbt-spark 1.6.0

MatatatDespe · 2024-01-23T13:33:17Z

What @vinjai and @despewerba are telling its true, the same thing happened on databricks and they fixed it.

VShkaberda · 2024-02-02T11:13:41Z

The same issue leading to data loss because of failed show tables: dbt-labs/dbt-spark#950

Fokko · 2024-02-02T11:21:01Z

Iceberg does not support efficient listing because it is a V2 datasource in Spark. This will be fixed with the new 3.6 release. Until then, there is not much you can do (except making sure that you don't have too many tables in your schema).

VShkaberda · 2024-02-02T14:21:33Z

@Fokko could you please provide some additional info (did't find any similar issues at https://issues.apache.org/)? I am wondering if the problem with iceberg the same as the problem with delta tables mentioned in the linked issue above.

daniel-gomez-mng · 2024-02-19T13:26:32Z

Hello guys! I'm very interested in this topic. We have a very similar use, like the one @ryanClift-sd commented last year in #513, where we run every dbt model in an Airflow task, so for every run we need to wait 1-2 minutes in order to discover the catalog. Do you have any news on how this behavior is going to improve? Do you know if this is better with Databricks Unity Catalog?

ryanClift-sd · 2024-02-23T16:54:25Z

@daniel-gomez-mng I know for us, moving to the DBT Databricks adapter fixed our underlying spark adapter performance issues whether unity catalog or hive_metastore databricks tables. I understand this might not be an option for everyone though depending upon their environments.

daniel-gomez-mng · 2024-02-26T12:19:07Z

Hi @ryanClift-sd , we are already using the Databricks adapter and we have the same performance issues. Just to give you more info, each dbt run is executing the following queries:

Listing tables
select current_catalog()
show views (one query per schema)
describe extended table

In our case, every show views query take 40-50 seconds. So, if we take into account all models, we are spending around 16 hours per day discovering this metadata. This is very expensive and 'unnecessary'.

We are analyzing this cache options in order to avoid recovering all metadata in all models, but please, if anyone knows how to improve this it would be very helpful.

AlanCortesK · 2024-05-09T15:30:34Z

Is there any update on this issue? For me show views takes around 80 seconds, and since we are using 1-model-per-task strategy in Airflow we are having performance issues

daniel-gomez-mng · 2024-05-09T16:07:22Z

Hi @AlanCortesK , we improved this behavior by adding this config to our Spark jobs. We use SQL warehouses in Databricks and a Hive Metastore in Glue, I don't know if it also your case

spark.databricks.hive.metastore.glueCatalog.enabled true
spark.hadoop.aws.glue.cache.table.enable true
spark.hadoop.aws.glue.cache.table.size 1000
spark.hadoop.aws.glue.cache.table.ttl-mins 60
spark.hadoop.aws.glue.cache.db.enable true
spark.hadoop.aws.glue.cache.db.size 1000
spark.hadoop.aws.glue.cache.db.ttl-mins 60

rahulgoyal2987 · 2024-09-12T07:53:39Z

Recently I have fixed this issue. Improved the performance of dbt-spark performance much more faster.
Let me know if you want to share PR.
Below is the summary of implementation:
dbt-spark used to get metadata(column) of all the tables now i have made it limit to tables which are required while running the model.

MatatatDespe · 2024-09-12T12:20:09Z

Recently I have fixed this issue. Improved the performance of dbt-spark performance much more faster. Let me know if you want to share PR. Below is the summary of implementation: dbt-spark used to get metadata(column) of all the tables now i have made it limit to tables which are required while running the model.

I'd Love to see this implemented!

Mariana-Ferreiro · 2024-11-12T18:34:53Z

@rahulgoyal2987 how did you solve it? We have the same issue. Thanks

rahulgoyal2987 · 2024-12-09T06:56:37Z

@Mariana-Ferreiro
Please find pull request https://github.com/dbt-labs/dbt-spark/pull/1152/files
I found similar implementation in dbt-databricks.
Show extended tables query behavior doesn't work in parallel running model. So i have completely remove show tables extended.
I have changed it show tables for getting table and schema names. and setting table details using describe extended query in later state of function.

jtcohen6 transferred this issue from dbt-labs/dbt-core Oct 13, 2021

jtcohen6 added type:enhancement New feature request good_first_issue Good for newcomers labels Oct 13, 2021

jtcohen6 mentioned this issue Oct 13, 2021

When a table gets dropped elsewhere, analysis fails causing triggers of multiple repetitions of create table dbt-labs/dbt-spark#215

Closed

5 tasks

This was referenced Jan 26, 2022

[CT-94] Holistic tracking for perceived start-up performance dbt-labs/dbt-core#4625

Closed

Show table extended is slow dbt-labs/dbt-spark#269

Closed

How to support SQL Standard Catalogs? dbt-labs/dbt-spark#281

Closed

jtcohen6 mentioned this issue Feb 5, 2022

[CT-168] Cache objects for selected resources only? dbt-labs/dbt-core#4688

Closed

github-actions bot changed the title ~~Workaround for some limitations due to list_relations_without_caching method~~ [CT-202] Workaround for some limitations due to list_relations_without_caching method Feb 9, 2022

leahwicz mentioned this issue Feb 28, 2022

[CT-302] [Spike] Benchmark perf for show tables, show views dbt-labs/dbt-spark#296

Closed

jtcohen6 self-assigned this Mar 15, 2022

jtcohen6 removed their assignment Mar 18, 2022

jtcohen6 mentioned this issue Mar 18, 2022

[CT-385] Limit catalog generation to specific relations dbt-labs/dbt-spark#300

Closed

ryanClift-sd mentioned this issue Mar 21, 2022

Changes to the metadata query gathering for spark ryanClift-sd/dbt-spark#1

Merged

4 tasks

jtcohen6 mentioned this issue Aug 19, 2022

Dramatically faster caching dbt-labs/dbt-spark#433

Closed

4 tasks

twcardenas mentioned this issue Aug 31, 2022

[CT-1109] ParseException - mismatch input table dbt-labs/dbt-spark#446

Closed

daniel-gomez-mng mentioned this issue Feb 19, 2024

SUPER SLOW dbt run -m model_name databricks/dbt-databricks#556

Closed

luisarce83 mentioned this issue Nov 12, 2024

show table extended in {{relation.shema}} like * fails dbt-labs/dbt-spark#1099

Closed

2 tasks

mikealfare added the pkg:dbt-spark Issue affects dbt-spark label Jan 13, 2025

mikealfare transferred this issue from dbt-labs/dbt-spark Jan 13, 2025

VanTudor pushed a commit that referenced this issue Jan 14, 2025

dbt run fails with spark-testing profile of examples (#513)

c87b6b2

[CT-202] Workaround for some limitations due to list_relations_without_caching method #513

[CT-202] Workaround for some limitations due to list_relations_without_caching method #513

Comments

danfran commented Sep 21, 2021

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

jtcohen6 commented Oct 13, 2021

Background

Proposed change

Next steps

danfran commented Oct 14, 2021

ssimeonov commented Jan 12, 2022

danfran commented Jan 17, 2022 • edited Loading

jeremyyeo commented Jan 18, 2022

jtcohen6 commented Jan 18, 2022

ssimeonov commented Jan 19, 2022

jtcohen6 commented Feb 4, 2022

ssimeonov commented Feb 4, 2022 • edited Loading

jtcohen6 commented Feb 4, 2022 • edited Loading

ssimeonov commented Feb 4, 2022

jtcohen6 commented Feb 5, 2022

crystalro0 commented Mar 10, 2022

ryanClift-sd commented Mar 14, 2022

jtcohen6 commented Mar 18, 2022 • edited Loading

ryanClift-sd commented Mar 21, 2022

TL;DR

Testing/Findings

dbt-spark v1.0.0 Code

dbt-spark wip-faster-caching-option1

dbt-spark wip-faster-caching-option2

Additional Context

How we are executing DBT Projects/Models at Slickdeals

Reasons why we went the above Implementation route

Final Thoughts

I don't want you to have to finagle support for both Option1 and Option2

spenaustin commented Nov 1, 2022

boxysean commented Nov 9, 2022

theromsh commented Feb 1, 2023

ssimeonov commented Feb 1, 2023

tinolyu commented Feb 14, 2023

VShkaberda commented Jul 28, 2023 • edited Loading

ryanClift-sd commented Jul 28, 2023 • edited Loading

pan3793 commented Nov 20, 2023

despewerba commented Jan 9, 2024

vinjai commented Jan 15, 2024

MatatatDespe commented Jan 23, 2024

VShkaberda commented Feb 2, 2024

Fokko commented Feb 2, 2024

VShkaberda commented Feb 2, 2024

daniel-gomez-mng commented Feb 19, 2024

ryanClift-sd commented Feb 23, 2024

daniel-gomez-mng commented Feb 26, 2024

AlanCortesK commented May 9, 2024

daniel-gomez-mng commented May 9, 2024 • edited Loading

rahulgoyal2987 commented Sep 12, 2024 • edited Loading

MatatatDespe commented Sep 12, 2024 • edited Loading

Mariana-Ferreiro commented Nov 12, 2024

rahulgoyal2987 commented Dec 9, 2024 • edited Loading

[CT-202] Workaround for some limitations due to `list_relations_without_caching` method #513

[CT-202] Workaround for some limitations due to `list_relations_without_caching` method #513

danfran commented Jan 17, 2022 •

edited

Loading

ssimeonov commented Feb 4, 2022 •

edited

Loading

jtcohen6 commented Feb 4, 2022 •

edited

Loading

jtcohen6 commented Mar 18, 2022 •

edited

Loading

VShkaberda commented Jul 28, 2023 •

edited

Loading

ryanClift-sd commented Jul 28, 2023 •

edited

Loading

daniel-gomez-mng commented May 9, 2024 •

edited

Loading

rahulgoyal2987 commented Sep 12, 2024 •

edited

Loading

MatatatDespe commented Sep 12, 2024 •

edited

Loading

rahulgoyal2987 commented Dec 9, 2024 •

edited

Loading