Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG - Too many partitions produced by query, allowed 4000, query produces at least 4272 partitions #39

Closed
2 of 9 tasks
csaroff opened this issue Sep 2, 2021 · 19 comments
Labels
bug Something isn't working

Comments

@csaroff
Copy link

csaroff commented Sep 2, 2021

Are you a current Fivetran customer?
Yes; Chaskin Saroff, Data Engineer at Bstock

Describe the bug
Too many partitions produced by int_zendesk__field_calendar_spine and int_zendesk__field_history_pivot.

Database Error in model int_zendesk__field_calendar_spine (models/ticket_history/int_zendesk__field_calendar_spine.sql)
  Too many partitions produced by query, allowed 4000, query produces at least 4272 partitions
  compiled SQL at target/run/zendesk/models/ticket_history/int_zendesk__field_calendar_spine.sql

Database Error in model int_zendesk__field_history_pivot (models/ticket_history/int_zendesk__field_history_pivot.sql)
  Too many partitions produced by query, allowed 4000, query produces at least 4022 partitions
  compiled SQL at target/run/zendesk/models/ticket_history/int_zendesk__field_history_pivot.sql

Steps to reproduce
Have more than 4000 days worth of field history data in zendesk while using bigquery as your data warehouse.

Expected behavior
No error should occur.

Project variables configuration

# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'bstock'
version: '0.0.1'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'default'

# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
source-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
models:
  +persist_docs:
    relation: true
    columns: true
target-path: "target"  # directory which will store compiled SQL files
clean-targets:         # directories to be removed by `dbt clean`
    - "target"
    - "dbt_modules"
# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models in the marketplace/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
  marketplace:
      # Applies to all files under models/marketplace/
      marketplace:
          materialized: table


vars:
    'dbt_date:time_zone': 'America/Los_Angeles'

Package Version

packages:
  - package: fishtown-analytics/dbt_utils
    version: 0.6.4
  - package: fivetran/zendesk
    version: 0.4.0
  - package: fivetran/salesforce
    version: 0.3.1
  - package: fishtown-analytics/spark_utils
    version: 0.1.0
  - package: calogica/dbt_date
    version: 0.2.5

Warehouse

  • BigQuery
  • Redshift
  • Snowflake
  • Postgres
  • Databricks
  • Other (provide details below)

Please indicate the level of urgency
Not urgent, but the bug is frustrating because my build fails constantly if I leave this package installed.

Are you interested in contributing to this package?

  • Yes, I can do this and open a PR for your review.
  • Possibly, but I'm not quite sure how to do this. I'd be happy to do a live coding session with someone to get this fixed.
  • No, I'd prefer if someone else fixed this. I don't have the time and/or don't know what the root cause of the problem is.
@csaroff csaroff added the bug Something isn't working label Sep 2, 2021
@csaroff
Copy link
Author

csaroff commented Sep 2, 2021

It might not hurt to simply ignore data that is more than 4000 days old if the data warehouse is bigquery.

@fivetran-jamie
Copy link
Contributor

fivetran-jamie commented Sep 2, 2021

hey there @csaroff -- you can expect a more formal solution released in the next week or so, but for the time being, this branch of the package should hopefully work for you without having to exclude any data

# packages.yml
  - git: https://github.com/fivetran/dbt_zendesk.git
    revision: bq-partitions
    warn-unpinned: false  

basically, i just changed the granularity of the partition_by config clause to week. the final solution may be different, but let me know if this works for you!

@fivetran-jamie
Copy link
Contributor

^ did this work for you btw? @csaroff

@csaroff
Copy link
Author

csaroff commented Sep 7, 2021

@fivetran-jamie I haven't gotten a chance to try it out yet, but will try soon and let you know. Thanks!

@csaroff
Copy link
Author

csaroff commented Sep 8, 2021

@fivetran-jamie Looks like there might be a bug with this.

Here's what I'm seeing when I do a dbt run

Completed with 2 errors and 0 warnings:

Database Error in model int_zendesk__field_calendar_spine (models/ticket_history/int_zendesk__field_calendar_spine.sql)
  PARTITION BY expression must be DATE(<timestamp_column>), DATE(<datetime_column>), DATETIME_TRUNC(<datetime_column>, DAY/HOUR/MONTH/YEAR), a DATE column, TIMESTAMP_TRUNC(<timestamp_column>, DAY/HOUR/MONTH/YEAR), DATE_TRUNC(<date_column>, MONTH/YEAR), or RANGE_BUCKET(<int64_column>, GENERATE_ARRAY(<int64_value>, <int64_value>[, <int64_value>]))
  compiled SQL at target/run/zendesk/models/ticket_history/int_zendesk__field_calendar_spine.sql

Database Error in model int_zendesk__field_history_pivot (models/ticket_history/int_zendesk__field_history_pivot.sql)
  PARTITION BY expression must be DATE(<timestamp_column>), DATE(<datetime_column>), DATETIME_TRUNC(<datetime_column>, DAY/HOUR/MONTH/YEAR), a DATE column, TIMESTAMP_TRUNC(<timestamp_column>, DAY/HOUR/MONTH/YEAR), DATE_TRUNC(<date_column>, MONTH/YEAR), or RANGE_BUCKET(<int64_column>, GENERATE_ARRAY(<int64_value>, <int64_value>[, <int64_value>]))
  compiled SQL at target/run/zendesk/models/ticket_history/int_zendesk__field_history_pivot.sql

Looks like the same error is showing on your circleci build for that branch.

@fivetran-jamie
Copy link
Contributor

@csaroff ack somehow missed the circle error 🤦 it should be working now -- week is not actually a supported granularity level so switched it to month

@csaroff
Copy link
Author

csaroff commented Sep 9, 2021

No worries. I'll try again today. Thanks!

@csaroff
Copy link
Author

csaroff commented Sep 9, 2021

Looks like it cleared up the previous error, but now I'm getting:

Database Error in model int_zendesk__field_calendar_spine (models/ticket_history/int_zendesk__field_calendar_spine.sql)
  Query exceeded resource limits. This query used 19447 CPU seconds but would charge only 15M Analysis bytes. This exceeds the ratio supported by the on-demand pricing model. Please consider moving this workload to the flat-rate reservation pricing model, which does not have this limit. 19447 CPU seconds were used, and this query must use less than 3800 CPU seconds.
  compiled SQL at target/run/zendesk/models/ticket_history/int_zendesk__field_calendar_spine.sql

@fivetran-jamie
Copy link
Contributor

Ah shoot -- and are you trying to run a full refresh?

Also it might be more efficient to troubleshoot this over a call if you'd like! You can book an office hours session here if that sounds good

@fivetran-jamie
Copy link
Contributor

oh one idea .... this may be because int_zendesk__calendar_spine is ephemeral by default (so that whole model is just a CTE in int_zendesk__field_calendar_spine). could you try adding the following to your dbt_project.yml?

models: 
    zendesk:
        utils:
            +materialized: table

@csaroff
Copy link
Author

csaroff commented Sep 10, 2021

I don't think it ever successfully built int_zendesk__field_calendar_spine so I guess it's doing a full refresh even though I'm not using the --full-refresh flag.

I added the +materialized: table config to my dbt_project.yml as you suggested. Tragically, it's still failing :(

Completed with 1 error and 0 warnings:

Database Error in model int_zendesk__field_calendar_spine (models/ticket_history/int_zendesk__field_calendar_spine.sql)
  Query exceeded resource limits. This query used 18210 CPU seconds but would charge only 20M Analysis bytes. This exceeds the ratio supported by the on-demand pricing model. Please consider moving this workload to the flat-rate reservation pricing model, which does not have this limit. 18210 CPU seconds were used, and this query must use less than 5100 CPU seconds.
  compiled SQL at target/run/zendesk/models/ticket_history/int_zendesk__field_calendar_spine.sql

I can confirm that int_zendesk__calendar_spine is being built as a table rather than a view

image

@fivetran-jamie
Copy link
Contributor

Ok got a few more tricks up my sleeve....

currently, int_zendesk__field_calendar_spine associates every ticket with every day since the creation of the ticket, regardless of its age or status. i am going to try adapting our code from the jira dbt package, which caps the timeline for resolved issues at the last date they were updated at. that's a better solution overall and hopefully may address the resources-exceeded error, as the data volume can really explode here.

in the meantime, while i develop the above solution, could you try out this branch again? i've added a cluster_by config to the problematic models, which may help optimize those queries further

@csaroff
Copy link
Author

csaroff commented Sep 15, 2021

Hmm. Still getting Query Exceeded Resource Limits

Database Error in model int_zendesk__field_calendar_spine (models/ticket_history/int_zendesk__field_calendar_spine.sql)
  Query exceeded resource limits. This query used 27133 CPU seconds but would charge only 15M Analysis bytes. This exceeds the ratio supported by the on-demand pricing model. Please consider moving this workload to the flat-rate reservation pricing model, which does not have this limit. 27133 CPU seconds were used, and this query must use less than 3800 CPU seconds.
  compiled SQL at target/run/zendesk/models/ticket_history/int_zendesk__field_calendar_spine.sql

@fivetran-jamie
Copy link
Contributor

ok that seems to have made it worse lol -- i will get back to ya soon with the other solution...

@fivetran-jamie
Copy link
Contributor

@csaroff alright wanna try it out again? 🤞

@csaroff
Copy link
Author

csaroff commented Sep 19, 2021

I ran dbt clean && dbt deps && dbt run --model zendesk

Completed with 1 error and 0 warnings:

Database Error in model int_zendesk__field_calendar_spine (models/ticket_history/int_zendesk__field_calendar_spine.sql)
  Query exceeded resource limits. This query used 7891 CPU seconds but would charge only 30M Analysis bytes. This exceeds the ratio supported by the on-demand pricing model. Please consider moving this workload to the flat-rate reservation pricing model, which does not have this limit. 7891 CPU seconds were used, and this query must use less than 7600 CPU seconds.
  compiled SQL at target/run/zendesk/models/ticket_history/int_zendesk__field_calendar_spine.sql

Sooooo close!

@fivetran-jamie
Copy link
Contributor

ahhh -- I think at this point it might be best / most efficient to troubleshoot this over a call if you're down! could you book one or two 15 minute slots of our office hours here?

@fivetran-jamie
Copy link
Contributor

@csaroff if that doesn't work for you tho, i've also gone ahead and introduced a workaround to simply filter out old data. this happens via a ticket_field_history_timeframe variable, which limits ticket field history to the last X many years you set the variable to, and by default all data is included (code here).

this change lives on the bq-partitions branch we've been working with - if you would like to try setting ticket_field_history_timeframe to 10 years, i believe that the original too many partitions error will go away 🤞

happy to try to troubleshoot other solutions on a call, but i am thinking this is the route we'll probably have to take given the resources exceeded messages we're getting with larger partitions 😢

we are planning to release this at the end of our sprint (end of next week), so if you'd like to try it out/provide any feedback before then that'd be wonderful! thanks and sorry we couldn't come up with a more elegant fix....

@fivetran-jamie fivetran-jamie mentioned this issue Oct 4, 2021
12 tasks
@fivetran-joemarkiewicz
Copy link
Contributor

@csaroff the latest release v0.7.0 (which includes this request) has just be cut!! You can expect this release to be live on the dbt hub at the top of the hour.

I will leave this ticket open until the release is live. Once the latest release is live I will close the ticket and please feel free to open another Github issue if you have any further questions or requests for the package. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants