[ADAP-1016] [Regression] [1.7] python models don't work #1006

dataders · 2023-11-07T21:28:12Z

Is this a regression in a recent version of dbt-bigquery?

I believe this is a regression in dbt-bigquery functionality
I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

all Python models fail to compile with the following compilation error.

sequence item 2: expected str instance, NoneType found

thread from #db-bigquery community Slack

21:23:46  Compilation Error in model thing (models/thing.py)
  sequence item 2: expected str instance, NoneType found
  
  > in macro materialization_table_bigquery (macros/materializations/table.sql)
  > called by model thing (models/thing.py)

Expected/Previous Behavior

the model should run

Steps To Reproduce

create the below file in jaffle_shop
dbt seed
dbt run -s +thing

weirdly dbt compile -s thing works without issue

-- thing.py
def model(dbt, session):
    dbt.config(
        submission_method="serverless",
        dataproc_cluster_name="dbt-test-1"
    )

    my_model = dbt.ref("orders")

    return my_model

Relevant log output

No response

Environment

- OS: macOS 13.5
- Python: `3.10.8`
- dbt-core (working version): `1.6.7`
- dbt-bigquery (working version): `1.6.8`
- dbt-core (regression version): `1.7.1`
- dbt-bigquery (regression version): `1.7.0`

Additional Context

perhaps related to #681?

The text was updated successfully, but these errors were encountered:

tanghyd · 2023-11-08T12:53:30Z

I can confirm this is broken for me running dbt-bigquery with Dataproc for both dbt-core=1.7.1 and dbt-bigquery=1.7.1 as well as dbt-core=1.7.0 and dbt-bigquery=1.7.0. As reported in this issue above, the error appears to be at compile time and a job is never submitted to Dataproc:

12:46:33  Completed with 1 error and 0 warnings:
12:46:33  
12:46:33    Compilation Error in model my_table (models/my_table.py)
  sequence item 2: expected str instance, NoneType found
  
  > in macro materialization_table_bigquery (macros/materializations/table.sql)
  > called by model my_table (models/my_table.py)

I have reverted to an earlier working version of dbt-core=1.6.6 and dbt-bigquery=1.6.7 and the dbt python models complete successfully.

tanghyd · 2023-11-11T11:17:30Z

Hi! - I don't think #1014 completely fixed running python models on Dataproc by the way.

The first batch job can work, but it seems like I'm having still having issues running on dbt-bigquery=1.7.2 and dbt-core=1.7.1. When I try to run subsequent batch jobs for my python model, dbt errors as Dataproc informs me that the model batch job already exists. It returns: 409 Already exists: Failed to create batch.

I believe this is error could be due to the way the PR #1014 is written it tags the batch name with model["created_at"]). The created_at field isn't updated for subsequent batch jobs (perhaps because created_at is a static field?), and so the batch job won't submit and Dataproc reports errors, breaking the dbt pipeline build.

colin-rogers-dbt · 2023-11-12T17:59:38Z

keeping this open till we've verified it fixes things

tanghyd · 2024-01-05T02:28:31Z

keeping this open till we've verified it fixes things

Is there any way that I can temporarily fix this issue by manually specifying the batch_id in the dbt config for now? That way I could just generate the id with like uuid.uuid4() or something, and I can unpin my version lock to dbt-bigquery==1.6.9 :)

nickozilla · 2024-01-11T10:39:06Z

@tanghyd
I'd recommend setting up your batch ID similar to this:

models:
  - name: model_name
    config:
      batch_id: |
        {{ run_started_at.strftime("%Y-%m-%d-%H-%M") }}-modelname-{{ range(0,10000) | random }}

tanghyd · 2024-01-12T00:53:32Z

      batch_id: |
        {{ run_started_at.strftime("%Y-%m-%d-%H-%M") }}-modelname-{{ range(0,10000) | random }}

Hello! Thank you for your suggestion but unfortunately that does not reliably work for repeat runs.

Sometimes two subsequent calls of the same python model does not generate two different batch_ids and the 409 Already exists: Failed to create batch error is raised on the second one.

It seems like the batch_id is persistent on the model (even though I might wait a couple minutes between runs) as the same batch_id is still rendered to the same string under the hood despite the run_started_at variable and random expressions being used.

As an alternative, I tried assigning this jinja string to the dbt config inside the python file instead of the schema.yml file as follows:

dbt.config(batch_id='{{ run_started_at.strftime("%Y-%m-%d-%H-%M") }}-modelname-{{ range(0,10000) | random }}')

However that also fails with the following error: No jinja in python model code is allowed

tanghyd · 2024-01-12T01:20:55Z

      batch_id: |
        {{ run_started_at.strftime("%Y-%m-%d-%H-%M") }}-modelname-{{ range(0,10000) | random }}
Hello! Thank you for your suggestion but unfortunately that does not reliably work for repeat runs.

Sometimes two subsequent calls of the same python model does not generate two different batch_ids and the 409 Already exists: Failed to create batch error is raised on the second one.

It seems like the batch_id is persistent on the model (even though I might wait a couple minutes between runs) as the same batch_id is still rendered to the same string under the hood despite the run_started_at variable and random expressions being used.

OK, I've done some further testing here. For subsequent runs after an initially successful first run, I've found two possible scenarios:

If no further changes are made to the model's schema.yml file, then the batch_id is not re-rendered from the jinja expression.
- This means that re-running the table within the same session will cause an error to be raised due to the duplicate batch_id.
- Perhaps this is caused by the manifest being re-generated when changes are discovered or something like that, but running dbt parse without any changes to the file does not re-generate a new batch_id value on re-run.
If I edit the schema.yml file in any way whatsoever after the first job submission, it'll seems to generate a new batch_id on the second submission.
- Perhaps because dbt has triggered a re-parse and re-created a manifest that is slightly different, causing the batch_id expression to be re-executed with the new time and random number? I'm not sure

I've also tried with the following config and the same issue comes up (a duplicate invocation_id between runs and therefore duplicate batch_id when submitting to Dataproc, which raises an error).

models:
  - name: model_name
    config:
      batch_id: "{{ invocation_id }}"

dlubawy · 2024-01-23T01:09:21Z

Python models are still broken in v1.7.3 due to ADAP-1063.

tanghyd · 2024-01-26T09:18:51Z

TL;DR: Two subsequent runs of the same python model continue to fail on the second attempt as the model config's batch_id does not change and Dataproc requires unique batch_id's for separate job submissions.

Python models are still broken in v1.7.3 for the same reason as described above in my comments despite the recently released change introduced here: #1020 in dbt/adapters/bigquery/python_submissions.py on line 128 to 130 as follows:

def _get_batch_id(self) -> str:
    model = self.parsed_model
    default_batch_id = str(uuid.uuid4())
    return model["config"].get("batch_id", default_batch_id)

I tried two sequential runs with a python model running dbt-core=1.7.6 and dbt-bigquery=1.7.3 and the second job failed when it should not have. This is ultimately still a regression from dbt-bigquery=1.6.9 and is not related to #1047 (as far as I can tell).

As hypothesised above, if a model has not changed in between runs, the batch_id does not refresh (as the model["config"]["batch_id"] seems to remain static as it is not re-parsed). Clearly then the issue is not just with defining default_batch_id = str(uuid.uuid4()), but with the underlying model config, however I am not familiar with any further details.

See below for example commands and results where a subsequent run has an identical batch job id (details obfuscated for privacy reasons):

(env) user@computer:~/code$ dbt build --select my_python_model
07:06:08  Running with dbt=1.7.6
07:06:08  Registered adapter: bigquery=1.7.3
07:06:08  Unable to do partial parsing because of a version mismatch
07:06:10  Found a models, b seeds, c tests, d sources, e exposures, f metrics, g macros, h groups, i semantic models
07:06:10  
07:06:13  Concurrency: 32 threads (target='dev')
07:06:13  
07:06:13  1 of 1 START python table model my_dataset.my_python_model ............. [RUN]
07:06:14  BigQuery adapter: Submitting batch job with id: c8d6f0e2-ad3d-4139-8a5c-54ceb6f34779
07:12:40  1 of 1 OK created python table model my_dataset.my_python_model ........ [OK in 387.41s]
07:12:40  Finished running 1 table model, 0 tests in 0 hours 6 minutes and 27.41 seconds (387.41).
07:12:40  
07:12:40  Completed successfully
07:12:40  
07:12:40  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

(env) user@computer:~/code$ dbt build --select my_python_model
08:59:46  Running with dbt=1.7.6
08:59:47  Registered adapter: bigquery=1.7.3
08:59:47  Found a models, b seeds, c tests, d sources, e exposures, f metrics, g macros, h groups, i semantic models
08:59:47  
08:59:49  Concurrency: 32 threads (target='dev')
08:59:49  
08:59:49  1 of 1 START python table model my_dataset.my_python_model ............. [RUN]
08:59:51  BigQuery adapter: Submitting batch job with id: c8d6f0e2-ad3d-4139-8a5c-54ceb6f34779
08:59:55  Unhandled error while executing target/run/models/my_dataset/my_python_model.py
409 Already exists: Failed to create batch: Batch projects/xxxxxx/locations/xxxxxx/batches/c8d6f0e2-ad3d-4139-8a5c-54ceb6f34779
08:59:55  1 of 1 ERROR creating python table model my_dataset.my_python_model .... [ERROR in 5.07s]
08:59:55  Finished running 1 table model, 0 tests in 0 hours 0 minutes and 7.75 seconds (7.75s).
08:59:55  
08:59:55  Completed with 1 error and 0 warnings:
08:59:55  
08:59:55    409 Already exists: Failed to create batch: Batch projects/xxxxxx/locations/xxxxxx/batches/c8d6f0e2-ad3d-4139-8a5c-54ceb6f34779
08:59:55  
08:59:55  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

tanghyd · 2024-04-01T07:15:58Z

✅ FYI for completeness, my error described above has been resolved after updating to the latest dbt-bigquery and not writing the following in the model yaml config (likely missed after #1020)

config:
     batch_id: "{{ invocation_id }}"

dataders added type:bug Something isn't working triage:product type:regression labels Nov 7, 2023

github-actions bot changed the title ~~[Regression] [1.7] python models don't work~~ [ADAP-1016] [Regression] [1.7] python models don't work Nov 7, 2023

dataders added Team:Adapters and removed triage:product labels Nov 7, 2023

colin-rogers-dbt self-assigned this Nov 8, 2023

colin-rogers-dbt mentioned this issue Nov 9, 2023

Fix Broken Python Models #1014

Merged

4 tasks

colin-rogers-dbt closed this as completed in #1014 Nov 9, 2023

colin-rogers-dbt reopened this Nov 11, 2023

colin-rogers-dbt mentioned this issue Nov 11, 2023

fix 409 duplicate batch issue #1020

Merged

4 tasks

colin-rogers-dbt closed this as completed in #1020 Nov 12, 2023

colin-rogers-dbt reopened this Nov 12, 2023

mikealfare added support_rotation and removed support labels Nov 29, 2023

This was referenced Dec 11, 2023

[ADAP-1069] Custom batch_id prevents rebuilding the model on subsequent builds #1051

Closed

[ADAP-1067] [Regression] New default batch_id does not always conform to the required Dataproc batch ID pattern #1050

Closed

mikealfare removed the Team:Adapters label Feb 7, 2024

mikealfare closed this as completed Mar 26, 2024

maxmckittrick mentioned this issue Jan 14, 2025

[Regression] support non-literal batch_id config for python models on dataproc dbt-labs/dbt-adapters#547

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAP-1016] [Regression] [1.7] python models don't work #1006

[ADAP-1016] [Regression] [1.7] python models don't work #1006

dataders commented Nov 7, 2023

tanghyd commented Nov 8, 2023 •

edited

Loading

tanghyd commented Nov 11, 2023

colin-rogers-dbt commented Nov 12, 2023

tanghyd commented Jan 5, 2024 •

edited

Loading

nickozilla commented Jan 11, 2024 •

edited

Loading

tanghyd commented Jan 12, 2024 •

edited

Loading

tanghyd commented Jan 12, 2024 •

edited

Loading

dlubawy commented Jan 23, 2024 •

edited by jira bot

Loading

tanghyd commented Jan 26, 2024 •

edited

Loading

tanghyd commented Apr 1, 2024

[ADAP-1016] [Regression] [1.7] python models don't work #1006

[ADAP-1016] [Regression] [1.7] python models don't work #1006

Comments

dataders commented Nov 7, 2023

Is this a regression in a recent version of dbt-bigquery?

Current Behavior

Expected/Previous Behavior

Steps To Reproduce

Relevant log output

Environment

Additional Context

tanghyd commented Nov 8, 2023 • edited Loading

tanghyd commented Nov 11, 2023

colin-rogers-dbt commented Nov 12, 2023

tanghyd commented Jan 5, 2024 • edited Loading

nickozilla commented Jan 11, 2024 • edited Loading

tanghyd commented Jan 12, 2024 • edited Loading

tanghyd commented Jan 12, 2024 • edited Loading

dlubawy commented Jan 23, 2024 • edited by jira bot Loading

tanghyd commented Jan 26, 2024 • edited Loading

tanghyd commented Apr 1, 2024

tanghyd commented Nov 8, 2023 •

edited

Loading

tanghyd commented Jan 5, 2024 •

edited

Loading

nickozilla commented Jan 11, 2024 •

edited

Loading

tanghyd commented Jan 12, 2024 •

edited

Loading

tanghyd commented Jan 12, 2024 •

edited

Loading

dlubawy commented Jan 23, 2024 •

edited by jira bot

Loading

tanghyd commented Jan 26, 2024 •

edited

Loading