Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #3656 - Top sites priority change #3690

Merged
merged 1 commit into from
May 2, 2022
Merged

Fixes #3656 - Top sites priority change #3690

merged 1 commit into from
May 2, 2022

Conversation

ksy36
Copy link
Contributor

@ksy36 ksy36 commented Apr 18, 2022

The new script fetches results from both Tranco (global) and Alexa (regional) and saves results in 2 dbs. Regional db only includes domains whose priority is higher than in global db, therefore it's much smaller. I have decided to store the rankings in 2 dbs since once Alexa's API is deprecated, we can only update the global db and preserve the regional one for some time.

The script can be run is this way:
python3 ./tools/fetch_topsites.py --retrieve-regional --ats_access_key=<access_key> --ats_secret_key=<secret_key>

and works as follows:

  1. Gets top 10k csv from Tranco and saves it to the "global sites" db and cache.
  2. Optionally fetches top 100 sites from specified regions if --retrieve-regional parameter is passed and saves the results in "regional sites" db.

@ksy36 ksy36 force-pushed the issue/3656/1 branch 2 times, most recently from b570c47 to 0ef8b4a Compare April 21, 2022 22:32
@@ -0,0 +1,156 @@
#!/usr/bin/python
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is based on the code from the original topsites.py

@ksy36 ksy36 marked this pull request as ready for review April 21, 2022 23:07
@ksy36 ksy36 requested a review from karlcow April 21, 2022 23:07
@ksy36
Copy link
Contributor Author

ksy36 commented Apr 21, 2022

r? @karlcow

Copy link
Member

@karlcow karlcow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ksy36
Big update! Impressive.

I'm a bit concerned by the amount of mock we do for testing the DB. It's code inherited, so probably it would be good to open a new issue on this for the future. That could help us to probably test some weird scenarios.

There's also probably an optimization which would be good to do on the queries in a for loop which will return only one result.

Request Changes just for getting the discussion rolling and see your thoughts.

Comment on lines 118 to 119
# No host_name in DB, find less-level domain (>2)
# If host_name is lv4.lv3.example.com, find lv3.example.com/example.com
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we left host_name instead of hostname in there. Left over from a previous version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've left this part untouched from the previous version. Updated now though :)

Comment on lines 113 to 114
for site in regional_site_db.query(SiteRegional).filter_by(url=hostname): # noqa
return f'priority-{priorities[site.priority - 1]}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember why the code is using a for loop. Maybe an opportunity to fix it.

Do we expect regional_site_db.query(SiteRegional).filter_by(url=hostname) to return more than one domain? and it would mean that even if there was more than one domain, only the first hit would be taken into consideration, because of the return.

Basically the code is currently the equivalent of

sites = ['A', 'B', 'C', 'A2', 'C']
def extract(sites):
    for site in sites:
        if site.startswith('A'):
            return site

which would return 'A' always.

Should we just have something like.

Suggested change
for site in regional_site_db.query(SiteRegional).filter_by(url=hostname): # noqa
return f'priority-{priorities[site.priority - 1]}'
site = regional_site_db.query(SiteRegional).filter_by(url=hostname).first()
if not site:
site = global_site_db.query(SiteGlobal).filter_by(url=hostname).first()
if site:
return f'priority-{priorities[site.priority - 1]}'
#… and so on for the subdomains.

.first() returns None if the row doesn't exist.

or something else.

Maybe there is a function to extract here. because the code is repeating the code in a loop.

Or maybe I totally misunderstood the code 🍭 Long time I have touched this part.

Copy link
Contributor Author

@ksy36 ksy36 Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The domain names are unique yes, so using .first() will work indeed. Thanks for the suggestion, I've changed the function.

Comment on lines 125 to 128
for site in regional_site_db.query(SiteRegional).filter_by(url=domain): # noqa
return f'priority-{priorities[site.priority - 1]}'

for site in global_site_db.query(SiteGlobal).filter_by(url=domain):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we do it here again makes me think there is an opportunity for a function to reduce it and test it.

# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.

"""Tests for Siterank class."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future, maybe there should be two mock DBs covering what we want instead of mocking. or in the setup a dictionary with data creating the mock DB. to think. Nothing to do now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good idea, I'll file an issue for that.

Comment on lines +23 to +25
REGIONS = ['US', 'FR', 'IN', 'DE', 'TW', 'ID', 'HK', 'SG', 'PL',
'GB', 'RU']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought, this was initially created like this I guess because of priority market for Firefox, but we can imagine that other browsers could have a desire to extend this. Nothing to do now.

Comment on lines 50 to 51
if args.retrieve_regional:
print('Warning: Alexa APIs will be deprecated on December 15, 2022.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the code have a failsafe feature. Either based on the date. "Hey it's past December 15, 2022. We will ignore the request." Or based on failing the request to Alexa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've added the check 👍

@ksy36
Copy link
Contributor Author

ksy36 commented Apr 26, 2022

Could you please take another look @karlcow? I've refactored extract_priority_label and added the check for Alexa deprecation date.

Copy link
Member

@karlcow karlcow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @ksy36 for addressing the changes. 🔝 🥳

@karlcow karlcow merged commit 9da9698 into main May 2, 2022
@karlcow karlcow deleted the issue/3656/1 branch May 2, 2022 04:34
@ksy36
Copy link
Contributor Author

ksy36 commented May 2, 2022

Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants