-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🔨 Speedup anomalist #3517
🔨 Speedup anomalist #3517
Conversation
Quick links (staging server):
Login: chart-diff: ✅No charts for review.data-diff: ❌ Found differences~ Dataset garden/health/2023-05-04/global_wellbeing
- - publication_year: 2020"
? -
+ + publication_year: 2020
= Table global_wellbeing
= Table global_wellbeing_index
Legend: +New ~Modified -Removed =Identical Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet Automatically updated datasets matching weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included Edited: 2024-11-11 12:45:34 UTC |
a16a609
to
0bd6ae2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Marigold, I couldn't follow the logic of some parts of the code (maybe it would be good to add a few comments to clarify the intentions), and I haven't tested it myself, but a speed up of 10x when loading variables? Merge this, please! 🎉
@@ -160,6 +160,10 @@ def get_score_df(self, df: pd.DataFrame, variable_ids: List[int], variable_mappi | |||
# Convert z-score into p-value | |||
df_score_long["p_value"] = 2 * (1 - norm.cdf(np.abs(df_score_long["z"]))) | |||
|
|||
# Anomalies with p-value < 0.1 are not interesting, drop them. This could be | |||
# even stricter | |||
df_score_long = df_score_long[df_score_long["p_value"] < 0.1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this helps much. In the end, we can always filter in the UI. The important thing would be that high scores always mean anomalous points. But if you prefer to hide p>0.1, I don't think it poses any risks of missing out important anomalies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end, we can always filter in the UI
That's right, but it must feel weird for a new user to see GP "anomalies" at the top that are not anomalous at all. I'd keep them if they were useful for "exploration" mode, but I don't think they're useful at all. I'd experiment with filtering them and if we miss them in the future, we can just put them back.
# HACK 2024-11-08: we got invalid type into one of the fields, remove it when we fix it | ||
# on all staging servers | ||
if "invalid literal for int() with base 10: '2020" in str(e): | ||
init_args[field_name] = int(v.replace('"', "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's going on here. Maybe we should create an issue to remember to fix this properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It comes from a typo in one old metadata file. I'll remove it the next time we increase ETL_EPOCH.
b28dab6
to
8a60620
Compare
After more experimentation, I've found that adding caching to dataclasses-json would get it almost on par with our custom I'll leave a comment about revisiting this and possibly using third-party libraries in the future. |
Speed up loading of UN SDG and GHE datasets. Surprisingly, the bottleneck was loading metadata from dictionary with
dataclasses_json
. This PR replaces it by a custom function that is ~10x faster. This should speed up dataset loading in general, not just for anomalist. It still takes a long time (~1min?) to load these huge datasets (with ~10k variables), but it's manageable when it's run by owidbot.The new
from_dict
function also raises errors when the type is invalid. This uncovered an inconsistency in our of our old datasets.It looks like others have noticed slowness of
dataclasses-json
too. I'll check what makes it so slow on our use case.Also fixes issues with multidimensional datasets and filters.
Reduce number of GP anomalies - only save anomalies with >= 3 points and p-value < 0.1. That should get rid of useless anomalies like these from MPI.