Fixes #3656 - Top sites priority change #3690

ksy36 · 2022-04-18T21:17:56Z

The new script fetches results from both Tranco (global) and Alexa (regional) and saves results in 2 dbs. Regional db only includes domains whose priority is higher than in global db, therefore it's much smaller. I have decided to store the rankings in 2 dbs since once Alexa's API is deprecated, we can only update the global db and preserve the regional one for some time.

The script can be run is this way:
python3 ./tools/fetch_topsites.py --retrieve-regional --ats_access_key=<access_key> --ats_secret_key=<secret_key>

and works as follows:

Gets top 10k csv from Tranco and saves it to the "global sites" db and cache.
Optionally fetches top 100 sites from specified regions if --retrieve-regional parameter is passed and saves the results in "regional sites" db.

ksy36 · 2022-04-21T23:01:54Z

tools/topsites/alexa.py

@@ -0,0 +1,156 @@
+#!/usr/bin/python


This class is based on the code from the original topsites.py

ksy36 · 2022-04-21T23:07:38Z

r? @karlcow

karlcow

Thanks @ksy36
Big update! Impressive.

I'm a bit concerned by the amount of mock we do for testing the DB. It's code inherited, so probably it would be good to open a new issue on this for the future. That could help us to probably test some weird scenarios.

There's also probably an optimization which would be good to do on the queries in a for loop which will return only one result.

Request Changes just for getting the discussion rolling and see your thoughts.

karlcow · 2022-04-22T01:54:47Z

webcompat/webhooks/helpers.py

        # No host_name in DB, find less-level domain (>2)
        # If host_name is lv4.lv3.example.com, find lv3.example.com/example.com


I wonder why we left host_name instead of hostname in there. Left over from a previous version?

Yeah, I've left this part untouched from the previous version. Updated now though :)

karlcow · 2022-04-22T02:37:27Z

webcompat/webhooks/helpers.py

+        for site in regional_site_db.query(SiteRegional).filter_by(url=hostname):       # noqa
+            return f'priority-{priorities[site.priority - 1]}'


I don't remember why the code is using a for loop. Maybe an opportunity to fix it.

Do we expect regional_site_db.query(SiteRegional).filter_by(url=hostname) to return more than one domain? and it would mean that even if there was more than one domain, only the first hit would be taken into consideration, because of the return.

Basically the code is currently the equivalent of

sites = ['A', 'B', 'C', 'A2', 'C'] def extract(sites): for site in sites: if site.startswith('A'): return site

which would return 'A' always.

Should we just have something like.

Suggested change

for site in regional_site_db.query(SiteRegional).filter_by(url=hostname): # noqa

return f'priority-{priorities[site.priority - 1]}'

site = regional_site_db.query(SiteRegional).filter_by(url=hostname).first()

if not site:

site = global_site_db.query(SiteGlobal).filter_by(url=hostname).first()

if site:

return f'priority-{priorities[site.priority - 1]}'

#… and so on for the subdomains.

.first() returns None if the row doesn't exist.

or something else.

Maybe there is a function to extract here. because the code is repeating the code in a loop.

Or maybe I totally misunderstood the code 🍭 Long time I have touched this part.

The domain names are unique yes, so using .first() will work indeed. Thanks for the suggestion, I've changed the function.

karlcow · 2022-04-22T02:38:48Z

webcompat/webhooks/helpers.py

+            for site in regional_site_db.query(SiteRegional).filter_by(url=domain):     # noqa
+                return f'priority-{priorities[site.priority - 1]}'
+
+            for site in global_site_db.query(SiteGlobal).filter_by(url=domain):


The fact that we do it here again makes me think there is an opportunity for a function to reduce it and test it.

karlcow · 2022-04-22T02:47:53Z

tests/unit/test_topsites_siterank.py

+# License, v. 2.0. If a copy of the MPL was not distributed with this
+# file, You can obtain one at http://mozilla.org/MPL/2.0/.
+
+"""Tests for Siterank class."""


For the future, maybe there should be two mock DBs covering what we want instead of mocking. or in the setup a dictionary with data creating the mock DB. to think. Nothing to do now.

Yeah, good idea, I'll file an issue for that.

karlcow · 2022-04-22T02:49:47Z

tools/fetch_topsites.py

+REGIONS = ['US', 'FR', 'IN', 'DE', 'TW', 'ID', 'HK', 'SG', 'PL',
+           'GB', 'RU']


Another thought, this was initially created like this I guess because of priority market for Firefox, but we can imagine that other browsers could have a desire to extend this. Nothing to do now.

karlcow · 2022-04-22T02:52:18Z

tools/fetch_topsites.py

+    if args.retrieve_regional:
+        print('Warning: Alexa APIs will be deprecated on December 15, 2022.')


Should the code have a failsafe feature. Either based on the date. "Hey it's past December 15, 2022. We will ignore the request." Or based on failing the request to Alexa.

Thanks, I've added the check 👍

…lexa

ksy36 · 2022-04-26T15:19:55Z

Could you please take another look @karlcow? I've refactored extract_priority_label and added the check for Alexa deprecation date.

karlcow

Thanks a lot @ksy36 for addressing the changes. 🔝 🥳

ksy36 · 2022-05-02T16:19:01Z

Thanks!!

ksy36 force-pushed the issue/3656/1 branch 2 times, most recently from b570c47 to 0ef8b4a Compare April 21, 2022 22:32

ksy36 commented Apr 21, 2022

View reviewed changes

tools/topsites/alexa.py

@@ -0,0 +1,156 @@

#!/usr/bin/python

Copy link

Contributor Author

ksy36 Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is based on the code from the original topsites.py

ksy36 marked this pull request as ready for review April 21, 2022 23:07

ksy36 requested a review from karlcow April 21, 2022 23:07

karlcow requested changes Apr 22, 2022

View reviewed changes

Fixes #3656 - Fetch and save top domains from Tranco and optionally A…

18ac69b

…lexa

ksy36 force-pushed the issue/3656/1 branch from 723855d to 18ac69b Compare April 26, 2022 03:43

karlcow approved these changes May 2, 2022

View reviewed changes

karlcow merged commit 9da9698 into main May 2, 2022

karlcow deleted the issue/3656/1 branch May 2, 2022 04:34

karlcow added the status: add to changelog label May 2, 2022

ksy36 removed the status: add to changelog label May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #3656 - Top sites priority change #3690

Fixes #3656 - Top sites priority change #3690

ksy36 commented Apr 18, 2022 •

edited

Loading

ksy36 Apr 21, 2022

ksy36 commented Apr 21, 2022

karlcow left a comment

karlcow Apr 22, 2022

ksy36 Apr 26, 2022

karlcow Apr 22, 2022

ksy36 Apr 26, 2022 •

edited

Loading

karlcow Apr 22, 2022

karlcow Apr 22, 2022

ksy36 Apr 26, 2022

karlcow Apr 22, 2022

karlcow Apr 22, 2022

ksy36 Apr 26, 2022

ksy36 commented Apr 26, 2022

karlcow left a comment

ksy36 commented May 2, 2022

		# No host_name in DB, find less-level domain (>2)
		# If host_name is lv4.lv3.example.com, find lv3.example.com/example.com

		for site in regional_site_db.query(SiteRegional).filter_by(url=hostname): # noqa
		return f'priority-{priorities[site.priority - 1]}'

-        for site in regional_site_db.query(SiteRegional).filter_by(url=hostname):       # noqa
-            return f'priority-{priorities[site.priority - 1]}'
+        site = regional_site_db.query(SiteRegional).filter_by(url=hostname).first()
+        if not site:
+            site = global_site_db.query(SiteGlobal).filter_by(url=hostname).first()
+        if site:
+            return f'priority-{priorities[site.priority - 1]}'
+        #… and so on for the subdomains.

		REGIONS = ['US', 'FR', 'IN', 'DE', 'TW', 'ID', 'HK', 'SG', 'PL',
		'GB', 'RU']

		if args.retrieve_regional:
		print('Warning: Alexa APIs will be deprecated on December 15, 2022.')

Fixes #3656 - Top sites priority change #3690

Fixes #3656 - Top sites priority change #3690

Conversation

ksy36 commented Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

ksy36 commented Apr 21, 2022

karlcow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ksy36 Apr 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ksy36 commented Apr 26, 2022

karlcow left a comment

Choose a reason for hiding this comment

ksy36 commented May 2, 2022

ksy36 commented Apr 18, 2022 •

edited

Loading

ksy36 Apr 26, 2022 •

edited

Loading