Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update pattern for dataflow job id extraction #41794

Merged
merged 3 commits into from
Sep 1, 2024

Conversation

lukas-mi
Copy link
Contributor

@lukas-mi lukas-mi commented Aug 27, 2024

Dataflow job id is extracted from the logged output of java process that starts the Dataflow job, for example, in case of BeamRunJavaPipelineOperator.

Currently job id pattern matches characters until first " or \n is encountered, which is fine for a following case:

  • logged line: [2024-08-27 11:20:22,094] INFO Submitted job: 2024-08-27_04_20_21-7947372725816706151
  • extracted job id: 2024-08-27_04_20_21-7947372725816706151

However, if the logger is configured differently, for example, has a whitespace and a suffix at the end with additional information, the pattern extracts the id together with the suffix:

  • logged line: [2024-08-27 11:20:22,094] INFO Submitted job: 2024-08-27_04_20_21-7947372725816706151 (org.apache.beam.runners.dataflow.DataflowRunner) (main)
  • extracted job id: 2024-08-27_04_20_21-7947372725816706151 (org.apache.beam.runners.dataflow.DataflowRunner) (main)

In the previous example suffix (org.apache.beam.runners.dataflow.DataflowRunner) (main) should not be extracted as part of the job id.

I updated the pattern by adding the whitespace character \s (along side existing " and \n), indicating the end of job id.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Aug 27, 2024
@lukas-mi lukas-mi force-pushed the dataflow-job-id-pattern branch from 676c264 to 8019d86 Compare August 28, 2024 08:25
@lukas-mi
Copy link
Contributor Author

@VladaZakharova when will this be merged? :)

@VladaZakharova
Copy link
Contributor

Hi @potiuk ! Can you please merge it?

@potiuk
Copy link
Member

potiuk commented Aug 30, 2024

@VladaZakharova when will this be merged? :)

When the test pass and someone will merge it.

Since you are the first time contributor - we have to manually approve workflows to see if tests pass, then you have to fix them if they don't. but when you submit new version you will have to wait for someone to see it and approve it (you can ask in general without mentioning anyone to approve your workflows) to signal that you think you fixed all the tests.

Also see the contribution docs that explain the process https://github.com/apache/airflow/tree/main/contributing-docs

@potiuk potiuk merged commit 9a66882 into apache:main Sep 1, 2024
54 checks passed
Copy link

boring-cyborg bot commented Sep 1, 2024

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants