🔨 Speedup anomalist #3517

Marigold · 2024-11-08T07:29:34Z

Speed up loading of UN SDG and GHE datasets. Surprisingly, the bottleneck was loading metadata from dictionary with dataclasses_json. This PR replaces it by a custom function that is ~10x faster. This should speed up dataset loading in general, not just for anomalist. It still takes a long time (~1min?) to load these huge datasets (with ~10k variables), but it's manageable when it's run by owidbot.

The new from_dict function also raises errors when the type is invalid. This uncovered an inconsistency in our of our old datasets.

It looks like others have noticed slowness of dataclasses-json too. I'll check what makes it so slow on our use case.

Also fixes issues with multidimensional datasets and filters.

Reduce number of GP anomalies - only save anomalies with >= 3 points and p-value < 0.1. That should get rid of useless anomalies like these from MPI.

owidbot · 2024-11-08T07:35:08Z

Quick links (staging server):

Site	Admin	Wizard	Docs

Login: ssh owid@staging-site-speedup-anomalist

chart-diff: ✅

No charts for review.

data-diff: ❌ Found differences

~ Dataset garden/health/2023-05-04/global_wellbeing
-   -     publication_year: 2020"
    ?                           -
+   +     publication_year: 2020
  = Table global_wellbeing
  = Table global_wellbeing_index


Legend: +New  ~Modified  -Removed  =Identical  Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet

Edited: 2024-11-11 12:45:34 UTC
Execution time: 13.77 seconds

pabloarosado

Thanks @Marigold, I couldn't follow the logic of some parts of the code (maybe it would be good to add a few comments to clarify the intentions), and I haven't tested it myself, but a speed up of 10x when loading variables? Merge this, please! 🎉

apps/anomalist/gp_detector.py

pabloarosado · 2024-11-11T10:29:12Z

apps/anomalist/gp_detector.py

@@ -160,6 +160,10 @@ def get_score_df(self, df: pd.DataFrame, variable_ids: List[int], variable_mappi
        # Convert z-score into p-value
        df_score_long["p_value"] = 2 * (1 - norm.cdf(np.abs(df_score_long["z"])))

+        # Anomalies with p-value < 0.1 are not interesting, drop them. This could be
+        # even stricter
+        df_score_long = df_score_long[df_score_long["p_value"] < 0.1]


I'm not sure if this helps much. In the end, we can always filter in the UI. The important thing would be that high scores always mean anomalous points. But if you prefer to hide p>0.1, I don't think it poses any risks of missing out important anomalies.

In the end, we can always filter in the UI

That's right, but it must feel weird for a new user to see GP "anomalies" at the top that are not anomalous at all. I'd keep them if they were useful for "exploration" mode, but I don't think they're useful at all. I'd experiment with filtering them and if we miss them in the future, we can just put them back.

lib/catalog/owid/catalog/processing_log.py

pabloarosado · 2024-11-11T10:35:11Z

lib/catalog/owid/catalog/utils.py

+                # HACK 2024-11-08: we got invalid type into one of the fields, remove it when we fix it
+                # on all staging servers
+                if "invalid literal for int() with base 10: '2020" in str(e):
+                    init_args[field_name] = int(v.replace('"', ""))


I'm not sure what's going on here. Maybe we should create an issue to remember to fix this properly?

It comes from a typo in one old metadata file. I'll remove it the next time we increase ETL_EPOCH.

Marigold · 2024-11-11T12:39:24Z

After more experimentation, I've found that adding caching to dataclasses-json would get it almost on par with our custom from_dict method. I've also tried https://github.com/cakemanny/fastclasses-json which was fast (again, on par with our custom method), but I was getting tons type hints errors from pylance.

I'll leave a comment about revisiting this and possibly using third-party libraries in the future.

github-actions bot assigned Marigold Nov 8, 2024

Marigold force-pushed the speedup-anomalist branch from a16a609 to 0bd6ae2 Compare November 8, 2024 13:03

Marigold marked this pull request as ready for review November 8, 2024 14:06

Marigold mentioned this pull request Nov 8, 2024

Fix multidimension variable load #3520

Closed

Marigold requested a review from pabloarosado November 8, 2024 15:35

pabloarosado approved these changes Nov 11, 2024

View reviewed changes

Marigold added 16 commits November 11, 2024 13:30

🔨 Speedup anomalist

de19c41

wip

0b10106

wip

e7301e2

wip

6b69120

wip

2b49da3

wip

a394712

wip

9754648

wip

a5cd632

wip

1dc96a2

wip

129d4b5

wip

a649aa9

wip

f98bd71

wip

2e579fe

wip

4c74738

wip

7b6debd

wip

8a60620

Marigold force-pushed the speedup-anomalist branch from b28dab6 to 8a60620 Compare November 11, 2024 12:31

Marigold added 2 commits November 11, 2024 13:43

wip

9f0582a

wip

2ec8412

Marigold merged commit 015d3df into master Nov 11, 2024
2 of 7 checks passed

Marigold deleted the speedup-anomalist branch November 11, 2024 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔨 Speedup anomalist #3517

🔨 Speedup anomalist #3517

Marigold commented Nov 8, 2024 •

edited

Loading

owidbot commented Nov 8, 2024 •

edited

Loading

pabloarosado left a comment

pabloarosado Nov 11, 2024

Marigold Nov 11, 2024

pabloarosado Nov 11, 2024

Marigold Nov 11, 2024

Marigold commented Nov 11, 2024

🔨 Speedup anomalist #3517

🔨 Speedup anomalist #3517

Conversation

Marigold commented Nov 8, 2024 • edited Loading

owidbot commented Nov 8, 2024 • edited Loading

pabloarosado left a comment

Choose a reason for hiding this comment

pabloarosado Nov 11, 2024

Choose a reason for hiding this comment

Marigold Nov 11, 2024

Choose a reason for hiding this comment

pabloarosado Nov 11, 2024

Choose a reason for hiding this comment

Marigold Nov 11, 2024

Choose a reason for hiding this comment

Marigold commented Nov 11, 2024

Marigold commented Nov 8, 2024 •

edited

Loading

owidbot commented Nov 8, 2024 •

edited

Loading