Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQueryInsertJobOperator will fail if the DagID contains the '.' character #11280

Closed
nathadfield opened this issue Oct 5, 2020 · 7 comments · Fixed by #11287
Closed

BigQueryInsertJobOperator will fail if the DagID contains the '.' character #11280

nathadfield opened this issue Oct 5, 2020 · 7 comments · Fixed by #11287
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@nathadfield
Copy link
Collaborator

Apache Airflow version: 1.10.*
BackPort Packages version: 2020.10.5rc1

What happened:

BigQueryInsertJobOperator tries to start a BigQuery job by specifying a job_id which is a combination of dag_id, task_id, execution_date and an additional uniqueness_suffix.

https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/operators/bigquery.py#L2072

However, because BigQuery only accepts alphanumeric (in additon to dashes and underscore) characters then, if a DagID contains a version number whcih includes a . character then this will cause the task to fail.

https://cloud.google.com/bigquery/docs/running-jobs#generate-jobid

How to reproduce it:
Create a DAG with the name my-dag-v1.0.

@nathadfield nathadfield added the kind:bug This is a clearly a bug label Oct 5, 2020
@bradleydamato
Copy link

@nathadfield @potiuk what approach do we think should be taken here? Should we scrub the dag_id for the '.' char prior to submitting?

@saisiddhant12
Copy link

@nathadfield @potiuk
I would like to give it a shot, I can think introducing macro like ts_no_dash with a hyphen separation of each version should solve the problem, or
we can replace . to unixtimestamp will resolve it (if macro is not used)

@potiuk
Copy link
Member

potiuk commented Oct 5, 2020

I think it we already do similar thing elsewhere in Airflow - we are replacing . with dot I believe. It is in "views.py" - and it is mostly to prevent some subdag matching. I do not know BigQuery part well enough , but something like that might be a good idea:

        resp = {
            r.dag_id.replace('.', '__dot__'): {
                'dag_id': r.dag_id,
                'last_run': r.last_run.isoformat(),
            } for r in query
        }
        return wwwutils.json_response(resp)

It's really typical "escaping" thing :).

@turbaszek
Copy link
Member

turbaszek commented Oct 5, 2020

Or should we build the string and then use the following code with added .?

exec_date = re.sub(r"\:|-|\+", "_", context['execution_date'].isoformat())

@turbaszek
Copy link
Member

Btw. I'm not sure if using . in dag_id is the best idea. The dot is special for subdags if I'm not mistaken

@turbaszek turbaszek added area:providers provider:google Google (including GCP) related issues labels Oct 5, 2020
@saisiddhant12
Copy link

Btw. I'm not sure if using . in dag_id is the best idea. The dot is special for subdags if I'm not mistaken

yes, it creates a sub-dag if we use . between parent & child, I agree with you @turbaszek ,
@nathadfield were you explaining something else as a bug ?

@nathadfield
Copy link
Collaborator Author

Well, perhaps this is something that I shouldn't have done early on, but we've specified a DAG version as part of the DagID using a '.'. Not had any problems with this until now but then we don't use subdags.

I can change our DAG names quite easily to eliminate this but I would argue that it shouldn't be possible to create a DAG if the id contains a restricted set of characters.

Perhaps the dag_id should also be constrained to the same criteria as above?

turbaszek added a commit to PolideaInternal/airflow that referenced this issue Oct 5, 2020
turbaszek added a commit that referenced this issue Oct 7, 2020
Make autogenerated job_id more unique by using microseconds and hash of configuration. Replace dots in job_id.
Closes: #11280
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants