Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kedro-telemetry: Spike to reduce redundant telemetry events #730

Closed
DimedS opened this issue Jun 14, 2024 · 5 comments · Fixed by #766
Closed

kedro-telemetry: Spike to reduce redundant telemetry events #730

DimedS opened this issue Jun 14, 2024 · 5 comments · Fixed by #766
Assignees

Comments

@DimedS
Copy link
Member

DimedS commented Jun 14, 2024

Description

Currently, when running a kedro new command with telemetry enabled, we send mostly the same data using _send_heap_event() 3! times. This seems excessive, especially as we move to opt-out telemetry with #715, which will definitely increase traffic.

The reason for this redundancy is that we send two identical events after each kedro command in the before_command_run hook:

            _send_heap_event(
                event_name=f"Command run: {main_command}",
                identity=user_uuid,
                properties=cli_properties,
            )

            # send generic event too, so it's easier in data processing
            generic_properties = deepcopy(cli_properties)
            generic_properties["main_command"] = main_command
            _send_heap_event(
                event_name="CLI command",
                identity=user_uuid,
                properties=generic_properties,
            )

I don't know the exact reasons for this duplication, but it seems like the last one is sufficient since it fully contains the first one.

Additionally, if after_catalog_created is triggered, we send one more event with data that was already included in the previous two events, along with additional project properties like the number of nodes and pipelines.

So proposal is to send only one piece of data per command:

  1. If after_catalog_created will be triggered after the command, send all information for that command in that hook.
  2. If not, send all information about that command in the before_command_run hook.

We need to gather requirements about what exactly is needed on the HEAP/ snowflake(?) side.

Additionally, after revising, please address the following:

  1. Update the notification accordingly, as mentioned here.
  2. Review the try-except blocks to ensure they are not excessive, as mentioned here.
@merelcht merelcht changed the title kedro-telemetry: reduce redundant telemetry events kedro-telemetry: Spike to reduce redundant telemetry events Jun 17, 2024
@merelcht merelcht added this to the Telemetry opt-out milestone Jul 3, 2024
@merelcht merelcht moved this to To Do in Kedro Framework Jul 8, 2024
@noklam noklam moved this from To Do to In Progress in Kedro Framework Jul 11, 2024
@noklam
Copy link
Contributor

noklam commented Jul 11, 2024

I have some idea, which may depends on

There are 2 event sent originally.

  1. run command specific
  2. generic event

The 3 was added in after_catalog_created for project statistic and this will be triggered for both CLI/ non-CLI run.

  1. was created mainly because it's easier to analyse the result directly in HEAP. Processing 2 requires some extra data cleaning and is harder to do it on HEAP. As mentioned, 2. is the superset so we can always retrieve those info from Snowflake, but we need to understand what's the impact of removing 1. event.

Is the performance critical? Especially if we can send the event only at the end, I don't think there will be a strong performance hit (I haven't seen any complains about this before).

One thing that I am certain that helps is combine the CLIHook and regualr Hook as one single class. In general we enrich the event and only send it at the end. We also have to consider different entrypoint: kedro run, package project and session.run in a notebook etc.

@noklam
Copy link
Contributor

noklam commented Jul 11, 2024

image

Most of the data are duplicated, the main different is that it parsed the kedro run **** to main command: run. There are some analytic depends on this.

@noklam
Copy link
Contributor

noklam commented Jul 11, 2024

Maybe we can create custom HEAP property to replace this?
image

@noklam
Copy link
Contributor

noklam commented Jul 11, 2024

After checking the data, the first 2 events are almost identical. The only difference is that the 1st get a differnet "event_name" so it could have some ergonomic benefit on HEAP specifically. I go through the charts quickly and couldn't find any usage of it, so I think it's safe to remove.

@noklam
Copy link
Contributor

noklam commented Jul 11, 2024

FYI: https://kedro-org.slack.com/archives/C054U0LJLAC/p1720714959292789

@noklam noklam moved this from In Progress to In Review in Kedro Framework Jul 15, 2024
@github-project-automation github-project-automation bot moved this from In Review to Done in Kedro Framework Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants