New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Introduce event aggregator feature #87

Open

gbossert wants to merge 2 commits into main from feat/event_aggregator

Contributor

gbossert commented Oct 30, 2023

This PR introduces the event aggregator feature in the SDK.
Some work must still be done to find a way to transparently integrate it in connectors logic without major performance impact.

gbossert requested a review from Darkheir

October 30, 2023 22:51

gbossert self-assigned this

github-actions bot commented Oct 30, 2023 •

edited

Loading

Test Results

168 tests +6 167 ✔️ +6 1m 33s ⏱️ -2s
    1 suites ±0     1 💤 ±0
    1 files ±0     0 ❌ ±0

Results for commit dde0103. ± Comparison against base commit aa1101d.

♻️ This comment has been updated with latest results.

gbossert mentioned this pull request

Introduce event aggregation on AWS SEKOIA-IO/automation-library#441

Closed

Darkheir reviewed

View reviewed changes

Collaborator

Darkheir left a comment

Nice work, we will indeed need to find a way to integrate it in the connectors code so we could benefit from those aggregation capabilities.

sekoia_automation/connector/event_aggregator.py Outdated

Comment on lines 142 to 144

+                                  xxhash.xxh64(
+                                      f"{event_dialect_uuid};{fingerprint.build_hash_str_func(event)}"
+                                  ).hexdigest(),

Collaborator

Darkheir Oct 31, 2023

Suggested change

      
                                xxhash.xxh64(
          
                                    f"{event_dialect_uuid};{fingerprint.build_hash_str_func(event)}"
          
                                ).hexdigest(),
          
                                xxhash.xxh3_64_hexdigest(
          
                                    f"{event_dialect_uuid};{fingerprint.build_hash_str_func(event)}"
          
                                ),

Contributor Author

gbossert Nov 1, 2023

fixed

sekoia_automation/connector/event_aggregator.py Outdated

+              import time
+              from typing import Callable, Tuple
+              import xxhash as xxhash

Collaborator

Darkheir Oct 31, 2023

Suggested change

      
            import xxhash as xxhash
          
            import xxhash

Contributor Author

gbossert Oct 31, 2023

pycharm you are drunk. thank you @Darkheir

Contributor Author

gbossert Nov 1, 2023

fixed

sekoia_automation/connector/event_aggregator.py Outdated

+                      """
+                      Returns the hash to fingerprint the specified event and its ttl
+                      """
+                      event_dialect_uuid = event["sekoiaio"]["intake"]["dialect_uuid"]

Collaborator

Darkheir Oct 31, 2023

I'm not sure this attribute is set in the events. The workflow sets it, but in the connectors we usually only have the intake key.

Contributor Author

gbossert Nov 1, 2023

I removed the need to explicit the dialect as an aggregation engine will be configured for each connector with their specific dialects

sekoia_automation/connector/event_aggregator.py Outdated

Comment on lines 13 to 14

		condition_func: Callable[[dict], bool]
		build_hash_str_func: Callable[[dict], str]

Collaborator

Darkheir Oct 31, 2023

Maybe we could have a single function that, if it returns None, will omit aggregation ?

The idea behind it is to avoid to have to iterate the events twice, once to check if it can be aggregated and a second time to calculate the fingerprint.

Suggested change

      
                condition_func: Callable[[dict], bool]
          
                build_hash_str_func: Callable[[dict], str]
          
                build_hash_str_func: Callable[[dict], str | None]

Contributor Author

gbossert Nov 1, 2023

applied

sekoia_automation/connector/event_aggregator.py Outdated

+              class EventAggregatorTTLThread(threading.Thread):
+                  f_must_stop: bool  # thread will stop if this flag is active
+                  ttl: int  # ttl value

Collaborator

Darkheir Oct 31, 2023

Suggested change

ttl: int # ttl value

Contributor Author

gbossert Nov 1, 2023

fixed

sekoia_automation/connector/event_aggregator.py Outdated

Comment on lines 27 to 28

		aggregated_event = copy.deepcopy(self.event)
		aggregated_event["sekoiaio"]["repeat"] = {

Collaborator

Darkheir Oct 31, 2023

Suggested change

      
                    aggregated_event = copy.deepcopy(self.event)
          
                    aggregated_event["sekoiaio"]["repeat"] = {
          
                    aggregated_event = copy.deepcopy(self.event)
          
                    aggregated_event.setdefault("sekoiaio", {})
          
                    aggregated_event["sekoiaio"]["repeat"] = {

Contributor Author

gbossert Nov 1, 2023

fixed

sekoia_automation/connector/event_aggregator.py

+                  def __init__(self, aggregation_definitions: dict[str, list[Fingerprint]]):
+                      self.aggregation_definitions = aggregation_definitions
+                      self.aggregations: dict[str, Aggregation] = dict()
+                      self.lock = threading.Lock()

Collaborator

Darkheir Oct 31, 2023

Improvement that could be added later:

If the lock becomes a bottleneck a common solution is to use shards to avoid lock contention.

Based on the fingerprint the Aggregations would be stored in 1 of the n shards. The lock to access or write a key would then need to lock only 1 shard, allowing other threads to read/write in the other shards at the same time.

Contributor Author

gbossert Nov 1, 2023

This approach looks very interesting.
The main issue that prevented me from implementing on this initial version is the complexity that comes with the following expectation that every shard must be handled by dedicated threads to prevent one locked shard to block the others.

But obviously, if you believe this current version will have a major impact on the performances of a collector I will implement this sharding mecanism in this PR.

Collaborator

Darkheir Nov 1, 2023

I agree with you, let's not add complexity that may not even bee needed.

It was just a thought I got while reviewing your code :)

codecov bot commented Nov 1, 2023 •

edited

Loading

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (aa1101d) 95.83% compared to head (dde0103) 95.74%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #87      +/-   ##
==========================================
- Coverage   95.83%   95.74%   -0.09%     
==========================================
  Files          26       27       +1     
  Lines        1897     1998     +101     
==========================================
+ Hits         1818     1913      +95     
- Misses         79       85       +6

Files	Coverage Δ
sekoia_automation/connector/event_aggregator.py	`94.05% <94.05%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Contributor Author

gbossert commented Nov 4, 2023

Unfortunately, the connector class doesn't expose a method that gives access to a parsed version of the event. Instead it only have access to the str version of the event.
For this reason, we cannot transparently enable aggregation.

Georges Bossert added 2 commits

November 4, 2023 17:38


          feat(event-aggregator): introduce event aggregator feature

44d9d0f


          fix(aggregator): implement review fixes

dde0103

gbossert force-pushed the feat/event_aggregator branch from e513150 to dde0103 Compare

November 4, 2023 16:39

gbossert marked this pull request as ready for review

November 4, 2023 17:03

Collaborator

Darkheir commented Nov 7, 2023

Unfortunately, the connector class doesn't expose a method that gives access to a parsed version of the event. Instead it only have access to the str version of the event. For this reason, we cannot transparently enable aggregation.

So maybe we could expose an aggregate method (or any other name) that would allow to perform aggregations.
It would take a list of events (dicts) and return the events to send.
If the event belongs to an already existing aggregation bucket then it will not be in the returned data.
We could add the events from the expired TTL buckets to the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet