-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #3656 - Top sites priority change #3690
Conversation
b570c47
to
0ef8b4a
Compare
@@ -0,0 +1,156 @@ | |||
#!/usr/bin/python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is based on the code from the original topsites.py
r? @karlcow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ksy36
Big update! Impressive.
I'm a bit concerned by the amount of mock we do for testing the DB. It's code inherited, so probably it would be good to open a new issue on this for the future. That could help us to probably test some weird scenarios.
There's also probably an optimization which would be good to do on the queries in a for loop which will return only one result.
Request Changes just for getting the discussion rolling and see your thoughts.
webcompat/webhooks/helpers.py
Outdated
# No host_name in DB, find less-level domain (>2) | ||
# If host_name is lv4.lv3.example.com, find lv3.example.com/example.com |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why we left host_name instead of hostname in there. Left over from a previous version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've left this part untouched from the previous version. Updated now though :)
webcompat/webhooks/helpers.py
Outdated
for site in regional_site_db.query(SiteRegional).filter_by(url=hostname): # noqa | ||
return f'priority-{priorities[site.priority - 1]}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember why the code is using a for loop. Maybe an opportunity to fix it.
Do we expect regional_site_db.query(SiteRegional).filter_by(url=hostname)
to return more than one domain? and it would mean that even if there was more than one domain, only the first hit would be taken into consideration, because of the return.
Basically the code is currently the equivalent of
sites = ['A', 'B', 'C', 'A2', 'C']
def extract(sites):
for site in sites:
if site.startswith('A'):
return site
which would return 'A' always.
Should we just have something like.
for site in regional_site_db.query(SiteRegional).filter_by(url=hostname): # noqa | |
return f'priority-{priorities[site.priority - 1]}' | |
site = regional_site_db.query(SiteRegional).filter_by(url=hostname).first() | |
if not site: | |
site = global_site_db.query(SiteGlobal).filter_by(url=hostname).first() | |
if site: | |
return f'priority-{priorities[site.priority - 1]}' | |
#… and so on for the subdomains. |
.first()
returns None if the row doesn't exist.
or something else.
Maybe there is a function to extract here. because the code is repeating the code in a loop.
Or maybe I totally misunderstood the code 🍭 Long time I have touched this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The domain names are unique yes, so using .first()
will work indeed. Thanks for the suggestion, I've changed the function.
webcompat/webhooks/helpers.py
Outdated
for site in regional_site_db.query(SiteRegional).filter_by(url=domain): # noqa | ||
return f'priority-{priorities[site.priority - 1]}' | ||
|
||
for site in global_site_db.query(SiteGlobal).filter_by(url=domain): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we do it here again makes me think there is an opportunity for a function to reduce it and test it.
# License, v. 2.0. If a copy of the MPL was not distributed with this | ||
# file, You can obtain one at http://mozilla.org/MPL/2.0/. | ||
|
||
"""Tests for Siterank class.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the future, maybe there should be two mock DBs covering what we want instead of mocking. or in the setup a dictionary with data creating the mock DB. to think. Nothing to do now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea, I'll file an issue for that.
REGIONS = ['US', 'FR', 'IN', 'DE', 'TW', 'ID', 'HK', 'SG', 'PL', | ||
'GB', 'RU'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought, this was initially created like this I guess because of priority market for Firefox, but we can imagine that other browsers could have a desire to extend this. Nothing to do now.
tools/fetch_topsites.py
Outdated
if args.retrieve_regional: | ||
print('Warning: Alexa APIs will be deprecated on December 15, 2022.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the code have a failsafe feature. Either based on the date. "Hey it's past December 15, 2022. We will ignore the request." Or based on failing the request to Alexa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've added the check 👍
Could you please take another look @karlcow? I've refactored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @ksy36 for addressing the changes. 🔝 🥳
Thanks!! |
The new script fetches results from both Tranco (global) and Alexa (regional) and saves results in 2 dbs. Regional db only includes domains whose priority is higher than in global db, therefore it's much smaller. I have decided to store the rankings in 2 dbs since once Alexa's API is deprecated, we can only update the global db and preserve the regional one for some time.
The script can be run is this way:
python3 ./tools/fetch_topsites.py --retrieve-regional --ats_access_key=<access_key> --ats_secret_key=<secret_key>
and works as follows:
--retrieve-regional
parameter is passed and saves the results in "regional sites" db.