Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-1016] [Regression] [1.7] python models don't work #1006

Closed
2 tasks done
dataders opened this issue Nov 7, 2023 · 10 comments · Fixed by #1014 or #1020
Closed
2 tasks done

[ADAP-1016] [Regression] [1.7] python models don't work #1006

dataders opened this issue Nov 7, 2023 · 10 comments · Fixed by #1014 or #1020
Assignees
Labels
type:bug Something isn't working type:regression

Comments

@dataders
Copy link
Contributor

dataders commented Nov 7, 2023

Is this a regression in a recent version of dbt-bigquery?

  • I believe this is a regression in dbt-bigquery functionality
  • I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

all Python models fail to compile with the following compilation error.

sequence item 2: expected str instance, NoneType found

thread from #db-bigquery community Slack

21:23:46  Compilation Error in model thing (models/thing.py)
  sequence item 2: expected str instance, NoneType found
  
  > in macro materialization_table_bigquery (macros/materializations/table.sql)
  > called by model thing (models/thing.py)

Expected/Previous Behavior

the model should run

Steps To Reproduce

  1. create the below file in jaffle_shop
  2. dbt seed
  3. dbt run -s +thing

weirdly dbt compile -s thing works without issue

-- thing.py
def model(dbt, session):
    dbt.config(
        submission_method="serverless",
        dataproc_cluster_name="dbt-test-1"
    )

    my_model = dbt.ref("orders")

    return my_model

Relevant log output

No response

Environment

- OS: macOS 13.5
- Python: `3.10.8`
- dbt-core (working version): `1.6.7`
- dbt-bigquery (working version): `1.6.8`
- dbt-core (regression version): `1.7.1`
- dbt-bigquery (regression version): `1.7.0`

Additional Context

perhaps related to #681?

@github-actions github-actions bot changed the title [Regression] [1.7] python models don't work [ADAP-1016] [Regression] [1.7] python models don't work Nov 7, 2023
@tanghyd
Copy link

tanghyd commented Nov 8, 2023

I can confirm this is broken for me running dbt-bigquery with Dataproc for both dbt-core=1.7.1 and dbt-bigquery=1.7.1 as well as dbt-core=1.7.0 and dbt-bigquery=1.7.0. As reported in this issue above, the error appears to be at compile time and a job is never submitted to Dataproc:

12:46:33  Completed with 1 error and 0 warnings:
12:46:33  
12:46:33    Compilation Error in model my_table (models/my_table.py)
  sequence item 2: expected str instance, NoneType found
  
  > in macro materialization_table_bigquery (macros/materializations/table.sql)
  > called by model my_table (models/my_table.py)

I have reverted to an earlier working version of dbt-core=1.6.6 and dbt-bigquery=1.6.7 and the dbt python models complete successfully.

@tanghyd
Copy link

tanghyd commented Nov 11, 2023

Hi! - I don't think #1014 completely fixed running python models on Dataproc by the way.

The first batch job can work, but it seems like I'm having still having issues running on dbt-bigquery=1.7.2 and dbt-core=1.7.1. When I try to run subsequent batch jobs for my python model, dbt errors as Dataproc informs me that the model batch job already exists. It returns: 409 Already exists: Failed to create batch.

I believe this is error could be due to the way the PR #1014 is written it tags the batch name with model["created_at"]). The created_at field isn't updated for subsequent batch jobs (perhaps because created_at is a static field?), and so the batch job won't submit and Dataproc reports errors, breaking the dbt pipeline build.

@colin-rogers-dbt
Copy link
Contributor

keeping this open till we've verified it fixes things

@tanghyd
Copy link

tanghyd commented Jan 5, 2024

keeping this open till we've verified it fixes things

Is there any way that I can temporarily fix this issue by manually specifying the batch_id in the dbt config for now? That way I could just generate the id with like uuid.uuid4() or something, and I can unpin my version lock to dbt-bigquery==1.6.9 :)

@nickozilla
Copy link
Contributor

nickozilla commented Jan 11, 2024

@tanghyd
I'd recommend setting up your batch ID similar to this:

models:
  - name: model_name
    config:
      batch_id: |
        {{ run_started_at.strftime("%Y-%m-%d-%H-%M") }}-modelname-{{ range(0,10000) | random }}

@tanghyd
Copy link

tanghyd commented Jan 12, 2024

      batch_id: |
        {{ run_started_at.strftime("%Y-%m-%d-%H-%M") }}-modelname-{{ range(0,10000) | random }}

Hello! Thank you for your suggestion but unfortunately that does not reliably work for repeat runs.

Sometimes two subsequent calls of the same python model does not generate two different batch_ids and the 409 Already exists: Failed to create batch error is raised on the second one.

It seems like the batch_id is persistent on the model (even though I might wait a couple minutes between runs) as the same batch_id is still rendered to the same string under the hood despite the run_started_at variable and random expressions being used.


As an alternative, I tried assigning this jinja string to the dbt config inside the python file instead of the schema.yml file as follows:

dbt.config(batch_id='{{ run_started_at.strftime("%Y-%m-%d-%H-%M") }}-modelname-{{ range(0,10000) | random }}')

However that also fails with the following error: No jinja in python model code is allowed

@tanghyd
Copy link

tanghyd commented Jan 12, 2024

      batch_id: |
        {{ run_started_at.strftime("%Y-%m-%d-%H-%M") }}-modelname-{{ range(0,10000) | random }}

Hello! Thank you for your suggestion but unfortunately that does not reliably work for repeat runs.

Sometimes two subsequent calls of the same python model does not generate two different batch_ids and the 409 Already exists: Failed to create batch error is raised on the second one.

It seems like the batch_id is persistent on the model (even though I might wait a couple minutes between runs) as the same batch_id is still rendered to the same string under the hood despite the run_started_at variable and random expressions being used.

OK, I've done some further testing here. For subsequent runs after an initially successful first run, I've found two possible scenarios:

  • If no further changes are made to the model's schema.yml file, then the batch_id is not re-rendered from the jinja expression.
    • This means that re-running the table within the same session will cause an error to be raised due to the duplicate batch_id.
    • Perhaps this is caused by the manifest being re-generated when changes are discovered or something like that, but running dbt parse without any changes to the file does not re-generate a new batch_id value on re-run.
  • If I edit the schema.yml file in any way whatsoever after the first job submission, it'll seems to generate a new batch_id on the second submission.
    • Perhaps because dbt has triggered a re-parse and re-created a manifest that is slightly different, causing the batch_id expression to be re-executed with the new time and random number? I'm not sure

I've also tried with the following config and the same issue comes up (a duplicate invocation_id between runs and therefore duplicate batch_id when submitting to Dataproc, which raises an error).

models:
  - name: model_name
    config:
      batch_id: "{{ invocation_id }}"

@dlubawy
Copy link

dlubawy commented Jan 23, 2024

Python models are still broken in v1.7.3 due to ADAP-1063.

@tanghyd
Copy link

tanghyd commented Jan 26, 2024

TL;DR: Two subsequent runs of the same python model continue to fail on the second attempt as the model config's batch_id does not change and Dataproc requires unique batch_id's for separate job submissions.

Python models are still broken in v1.7.3 for the same reason as described above in my comments despite the recently released change introduced here: #1020 in dbt/adapters/bigquery/python_submissions.py on line 128 to 130 as follows:

def _get_batch_id(self) -> str:
    model = self.parsed_model
    default_batch_id = str(uuid.uuid4())
    return model["config"].get("batch_id", default_batch_id)

I tried two sequential runs with a python model running dbt-core=1.7.6 and dbt-bigquery=1.7.3 and the second job failed when it should not have. This is ultimately still a regression from dbt-bigquery=1.6.9 and is not related to #1047 (as far as I can tell).

As hypothesised above, if a model has not changed in between runs, the batch_id does not refresh (as the model["config"]["batch_id"] seems to remain static as it is not re-parsed). Clearly then the issue is not just with defining default_batch_id = str(uuid.uuid4()), but with the underlying model config, however I am not familiar with any further details.

See below for example commands and results where a subsequent run has an identical batch job id (details obfuscated for privacy reasons):

(env) user@computer:~/code$ dbt build --select my_python_model
07:06:08  Running with dbt=1.7.6
07:06:08  Registered adapter: bigquery=1.7.3
07:06:08  Unable to do partial parsing because of a version mismatch
07:06:10  Found a models, b seeds, c tests, d sources, e exposures, f metrics, g macros, h groups, i semantic models
07:06:10  
07:06:13  Concurrency: 32 threads (target='dev')
07:06:13  
07:06:13  1 of 1 START python table model my_dataset.my_python_model ............. [RUN]
07:06:14  BigQuery adapter: Submitting batch job with id: c8d6f0e2-ad3d-4139-8a5c-54ceb6f34779
07:12:40  1 of 1 OK created python table model my_dataset.my_python_model ........ [OK in 387.41s]
07:12:40  Finished running 1 table model, 0 tests in 0 hours 6 minutes and 27.41 seconds (387.41).
07:12:40  
07:12:40  Completed successfully
07:12:40  
07:12:40  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

(env) user@computer:~/code$ dbt build --select my_python_model
08:59:46  Running with dbt=1.7.6
08:59:47  Registered adapter: bigquery=1.7.3
08:59:47  Found a models, b seeds, c tests, d sources, e exposures, f metrics, g macros, h groups, i semantic models
08:59:47  
08:59:49  Concurrency: 32 threads (target='dev')
08:59:49  
08:59:49  1 of 1 START python table model my_dataset.my_python_model ............. [RUN]
08:59:51  BigQuery adapter: Submitting batch job with id: c8d6f0e2-ad3d-4139-8a5c-54ceb6f34779
08:59:55  Unhandled error while executing target/run/models/my_dataset/my_python_model.py
409 Already exists: Failed to create batch: Batch projects/xxxxxx/locations/xxxxxx/batches/c8d6f0e2-ad3d-4139-8a5c-54ceb6f34779
08:59:55  1 of 1 ERROR creating python table model my_dataset.my_python_model .... [ERROR in 5.07s]
08:59:55  Finished running 1 table model, 0 tests in 0 hours 0 minutes and 7.75 seconds (7.75s).
08:59:55  
08:59:55  Completed with 1 error and 0 warnings:
08:59:55  
08:59:55    409 Already exists: Failed to create batch: Batch projects/xxxxxx/locations/xxxxxx/batches/c8d6f0e2-ad3d-4139-8a5c-54ceb6f34779
08:59:55  
08:59:55  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

@tanghyd
Copy link

tanghyd commented Apr 1, 2024

✅ FYI for completeness, my error described above has been resolved after updating to the latest dbt-bigquery and not writing the following in the model yaml config (likely missed after #1020)

config:
     batch_id: "{{ invocation_id }}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working type:regression
Projects
None yet
6 participants