Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔨 Speedup anomalist #3517

Merged
merged 18 commits into from
Nov 11, 2024
Merged

🔨 Speedup anomalist #3517

merged 18 commits into from
Nov 11, 2024

Conversation

Marigold
Copy link
Collaborator

@Marigold Marigold commented Nov 8, 2024

Speed up loading of UN SDG and GHE datasets. Surprisingly, the bottleneck was loading metadata from dictionary with dataclasses_json. This PR replaces it by a custom function that is ~10x faster. This should speed up dataset loading in general, not just for anomalist. It still takes a long time (~1min?) to load these huge datasets (with ~10k variables), but it's manageable when it's run by owidbot.

The new from_dict function also raises errors when the type is invalid. This uncovered an inconsistency in our of our old datasets.

It looks like others have noticed slowness of dataclasses-json too. I'll check what makes it so slow on our use case.

Also fixes issues with multidimensional datasets and filters.

Reduce number of GP anomalies - only save anomalies with >= 3 points and p-value < 0.1. That should get rid of useless anomalies like these from MPI.

@owidbot
Copy link
Contributor

owidbot commented Nov 8, 2024

Quick links (staging server):

Site Admin Wizard Docs

Login: ssh owid@staging-site-speedup-anomalist

chart-diff: ✅ No charts for review.
data-diff: ❌ Found differences
~ Dataset garden/health/2023-05-04/global_wellbeing
-   -     publication_year: 2020"
    ?                           -
+   +     publication_year: 2020
  = Table global_wellbeing
  = Table global_wellbeing_index


Legend: +New  ~Modified  -Removed  =Identical  Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet

Automatically updated datasets matching weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included

Edited: 2024-11-11 12:45:34 UTC
Execution time: 13.77 seconds

@Marigold Marigold marked this pull request as ready for review November 8, 2024 14:06
@Marigold Marigold requested a review from pabloarosado November 8, 2024 15:35
Copy link
Contributor

@pabloarosado pabloarosado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Marigold, I couldn't follow the logic of some parts of the code (maybe it would be good to add a few comments to clarify the intentions), and I haven't tested it myself, but a speed up of 10x when loading variables? Merge this, please! 🎉

apps/anomalist/gp_detector.py Show resolved Hide resolved
@@ -160,6 +160,10 @@ def get_score_df(self, df: pd.DataFrame, variable_ids: List[int], variable_mappi
# Convert z-score into p-value
df_score_long["p_value"] = 2 * (1 - norm.cdf(np.abs(df_score_long["z"])))

# Anomalies with p-value < 0.1 are not interesting, drop them. This could be
# even stricter
df_score_long = df_score_long[df_score_long["p_value"] < 0.1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this helps much. In the end, we can always filter in the UI. The important thing would be that high scores always mean anomalous points. But if you prefer to hide p>0.1, I don't think it poses any risks of missing out important anomalies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, we can always filter in the UI

That's right, but it must feel weird for a new user to see GP "anomalies" at the top that are not anomalous at all. I'd keep them if they were useful for "exploration" mode, but I don't think they're useful at all. I'd experiment with filtering them and if we miss them in the future, we can just put them back.

lib/catalog/owid/catalog/processing_log.py Show resolved Hide resolved
# HACK 2024-11-08: we got invalid type into one of the fields, remove it when we fix it
# on all staging servers
if "invalid literal for int() with base 10: '2020" in str(e):
init_args[field_name] = int(v.replace('"', ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's going on here. Maybe we should create an issue to remember to fix this properly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It comes from a typo in one old metadata file. I'll remove it the next time we increase ETL_EPOCH.

@Marigold
Copy link
Collaborator Author

After more experimentation, I've found that adding caching to dataclasses-json would get it almost on par with our custom from_dict method. I've also tried https://github.com/cakemanny/fastclasses-json which was fast (again, on par with our custom method), but I was getting tons type hints errors from pylance.

I'll leave a comment about revisiting this and possibly using third-party libraries in the future.

@Marigold Marigold merged commit 015d3df into master Nov 11, 2024
2 of 7 checks passed
@Marigold Marigold deleted the speedup-anomalist branch November 11, 2024 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants