Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion): Adding ability to ignore users from top users calculation #3735

Merged

Conversation

treff7es
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

@treff7es
Copy link
Contributor Author

treff7es commented Dec 14, 2021

Here is an example how it looks like if you add the following to the usage config and only that user queried the table:

  ignore_user_emails_from_usage:
    - [email protected]

Screenshot 2021-12-14 at 9 55 51

Every other stat contains those queries except the top users.

@github-actions
Copy link

github-actions bot commented Dec 14, 2021

Unit Test Results

     45 files  +  1       45 suites  +1   45m 1s ⏱️ + 2m 45s
   650 tests +56     598 ✔️ +56  52 💤 ±0  0 ±0 
1 472 runs  +88  1 398 ✔️ +88  74 💤 ±0  0 ±0 

Results for commit 7045348. ± Comparison against base commit 22cef5f.

♻️ This comment has been updated with latest results.

@treff7es treff7es force-pushed the ignore_emails_from_top_users branch 2 times, most recently from 580b599 to 7045348 Compare December 17, 2021 07:32
@@ -112,6 +116,7 @@ def make_usage_workunit(

class BaseUsageConfig(BaseTimeWindowConfig):
top_n_queries: pydantic.PositiveInt = 10
ignore_user_emails_from_usage: Optional[List[str]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this an AllowDenyPattern type to be consistent with rest of the pattern based filters we have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I changed it.

@@ -27,6 +27,7 @@
class GenericAggregatedDataset(Generic[ResourceType]):
bucket_start_time: datetime
resource: ResourceType
ignore_user_emails: Optional[List[str]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change it to an AllowDenyPattern type.

@@ -0,0 +1,174 @@
from datetime import datetime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this added by mistake? Can we merge this with test_usage_common.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was a mistake. I removed it.

self.userFreq[user_email] += 1
if not self.ignore_user_emails or user_email not in self.ignore_user_emails:
self.userFreq[user_email] += 1

if query:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not dropping the query made by an ignored user? Should we not just drop the entry altogether, if we don't want the data from the user pollute the usage stats?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add in here, when we first discussed this feature we agreed to not drop them from totals as it was a heavier touch operation. I think we have enough signal to know that it would make sense to drop these users from sum total as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the logic to skip these users from every count.

@github-actions
Copy link

github-actions bot commented Jan 28, 2022

Unit Test Results (build & test)

  66 files  ±0    66 suites  ±0   10m 52s ⏱️ - 2m 1s
581 tests ±0  522 ✔️ ±0  59 💤 ±0  0 ±0 

Results for commit 5748cb9. ± Comparison against base commit ff367c0.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Jan 28, 2022

Unit Test Results (metadata ingestion)

       5 files  ±  0         5 suites  ±0   39m 10s ⏱️ -51s
   279 tests +  2     279 ✔️ +  2    0 💤 ±0  0 ±0 
1 284 runs  +10  1 255 ✔️ +10  29 💤 ±0  0 ±0 

Results for commit 5748cb9. ± Comparison against base commit ff367c0.

♻️ This comment has been updated with latest results.

@treff7es treff7es closed this Jan 28, 2022
@treff7es treff7es reopened this Jan 28, 2022
@@ -27,6 +28,7 @@
class GenericAggregatedDataset(Generic[ResourceType]):
bucket_start_time: datetime
resource: ResourceType
ignore_user_emails: AllowDenyPattern = AllowDenyPattern.allow_all()
Copy link
Contributor

@rslanka rslanka Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ignore_user_emails/user_email_pattern/. The deny list will imply ignore.

  • Corresponding documentation updates.

test_email2 = "[email protected]"
test_query = "select * from test"
test_query2 = "select * from test2"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove unnecessary new lines here.

def test_add_one_query_with_ignored_user():
test_email = "[email protected]"
test_query = "select * from test"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove unnecessary new lines here.

| `start_time` | | Last full day in UTC (or hour, depending on `bucket_duration`) | Earliest date of usage logs to consider. |
| `end_time` | | Last full day in UTC (or hour, depending on `bucket_duration`) | Latest date of usage logs to consider. |
| `top_n_queries` | | `10` | Number of top queries to save to each table. |
| `ignore_user_emails_from_usage.allow` | | | List of regex patterns for user emails to include in usage. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ignore_user_emails_from_usage/user_email_pattern.

AggregatedDataset(
bucket_start_time=floored_ts,
resource=resource,
ignore_user_emails=self.config.ignore_user_emails_from_usage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't these gotten renamed?

AggregatedDataset(
bucket_start_time=floored_ts,
resource=resource,
ignore_user_emails=self.config.ignore_user_emails_from_usage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't these gotten renamed?

AggregatedDataset(
bucket_start_time=floored_ts,
resource=resource,
ignore_user_emails=self.config.ignore_user_emails_from_usage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't these gotten renamed?

bucket_start_time=floored_ts, resource=resource
bucket_start_time=floored_ts,
resource=resource,
ignore_user_emails=self.config.ignore_user_emails_from_usage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't these gotten renamed?

Copy link
Contributor

@rslanka rslanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@treff7es treff7es force-pushed the ignore_emails_from_top_users branch from 9e4b166 to 5748cb9 Compare January 31, 2022 22:31
Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shirshanka shirshanka merged commit 6871122 into datahub-project:master Feb 1, 2022
@treff7es treff7es deleted the ignore_emails_from_top_users branch February 8, 2023 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants