Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/gc): reduce logging, remove unnecessary sleeps #12238

Merged
merged 3 commits into from
Dec 30, 2024

Conversation

anshbansal
Copy link
Collaborator

@anshbansal anshbansal commented Dec 30, 2024

  • Reducing logs was required otherwise logs were full of useless Deleted 0 ....
  • Now the final report will have structured details on number of data jobs and data flows
  • We don't want to wait in case there were 0 deletions. Otherwise it will take too long. This fixes that. In testing this changed things from ~1 hour to 15 minutes
image
  • No need to query data flows in case the deletion of empty data flows is disabled

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Dec 30, 2024
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 30, 2024
@anshbansal anshbansal changed the title fix(ingest/gc): reduce logging fix(ingest/gc): reduce logging, remove unnecessary sleeps Dec 30, 2024
@@ -265,13 +267,17 @@ def keep_last_n_dpi(
self.report.report_failure(
f"Exception while deleting DPI: {e}", exc=e
)
if deleted_count_last_n % self.config.batch_size == 0:
if (
deleted_count_last_n % self.config.batch_size == 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 30, 2024
Comment on lines +279 to +280
if deleted_count_last_n > 0:
logger.info(f"Deleted {deleted_count_last_n} DPIs from {job.urn}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we are repeating code here, how about refactoring both ifs to

Suggested change
if deleted_count_last_n > 0:
logger.info(f"Deleted {deleted_count_last_n} DPIs from {job.urn}")
if deleted_count_last_n > 0:
logger.info(f"Deleted {deleted_count_last_n} DPIs from {job.urn}")
if deleted_count_last_n % self.config.batch_size == 0 and self.config.delay:
logger.info(f"Sleeping for {self.config.delay} seconds")
time.sleep(self.config.delay)

Copy link
Collaborator

@skrydal skrydal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, left some minor suggestions.

Comment on lines +429 to +430
for flow in self.get_data_flows():
dataFlows[flow.urn] = flow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider using dictionary comprehension to reduce indentations:

          dataFlows = {flow.urn: flow for flow in self.get_data_flows()}

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 30, 2024
@anshbansal anshbansal merged commit 3723a3e into master Dec 30, 2024
138 of 142 checks passed
@anshbansal anshbansal deleted the ab-reduce-log-gc-30-dec-2024 branch December 30, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants