Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: GitHub request to increase perpage values to 100 #94

Closed
lwasser opened this issue Feb 23, 2024 · 7 comments · Fixed by #119
Closed

Fix: GitHub request to increase perpage values to 100 #94

lwasser opened this issue Feb 23, 2024 · 7 comments · Fixed by #119
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed

Comments

@lwasser
Copy link
Member

lwasser commented Feb 23, 2024

Currently i think we went down a more difficult path by using only requests to parse github issues. Right now our workflow is set to fail when we hit 30 packages because pagination is not handled in our api.

Using pygithub, you can easily grab issues from a specific repository and it handles pagination in addition to easily grabbing metadata for each issue using built in methods.

The code below is one example of us quickly parsing through issues. i'm thikning it's worth considering this heavily - soonish to ensure our build doesn't break. we have 13 packages in review now so i suspect we will hit 30 in the upcoming months rather quickly.

this is both a bug that isn't realized (yet) and an enhancement needed to the api.

import os 
from github import Github
from dotenv import load_dotenv

load_dotenv()
github_token = os.environ.get('GITHUB_TOKEN')

g = Github(github_token)

# Get the repository
repo = g.get_repo("pyopensci/software-submission")

# Fetch issues with pagination
issues = repo.get_issues(labels=["New Submission!"])

for i, issue in enumerate(issues):
    print(issue.title)

print("there are ", i+1, "total issues")
@lwasser lwasser added bug Something isn't working enhancement New feature or request help wanted Extra attention is needed labels Feb 23, 2024
@pllim
Copy link
Contributor

pllim commented Mar 5, 2024

Where is your current code that will break?

https://pypi.org/project/PyGithub/ is definitely one of the packages that would suit your needs if you want Python interface.

Otherwise, you can go straight to GitHub's GraphQL as well. Here is example: https://github.com/scientific-python/devstats/blob/main/devstats/query.py (its output is then ingested by https://github.com/scientific-python/devstats.scientific-python.org ).

@pllim
Copy link
Contributor

pllim commented Mar 5, 2024

p.s. If you use GitHub REST API directly, usually there is a field in the output JSON to tell you how many pages total on top, so then you can send subsequent queries with increasing page number as well. I am less sure about GraphQL output since I never implemented it myself.

@pllim
Copy link
Contributor

pllim commented Mar 5, 2024

p.p.s. You might want to put in a sleep timer too, to avoid being blocked as spam.

@lwasser
Copy link
Member Author

lwasser commented Mar 7, 2024

@pllim my knowledge of working with api's and pagination is a big WIP.

i have request get and return response methods here.
But what i realized the other day is that no matter what i request, i am capped at 30 responses per page (i didn't realize it was paginated).

right now i'm only grabbing accepted reviews but i also want to add other steps so we can document our entire review process (how many packages are under review, etc) on our website! and we will be above 30 accepted soon at the rate we're going :) we have ~13 in review now.

So essentially the options here would be to

  1. write more code that handles pagination and tests or
  2. use a tool like pygithub to do it for me which maybe has rate limiting / sleep timer?? built in (i will check that now!). that just seems a bit simpler to me since they've done all the work creating the code.

are there benefits of the graphql approach? i did play with devstats (and want to use it for our work here too!!).

thank you so much for this input!!

@pllim
Copy link
Contributor

pllim commented Mar 7, 2024

FWIW, Option 2 would be easier on you in the long run. I don't think it has a timer but its PaginatedList object seems to support multiple pages natively and is iterable, so theoretically you would loop through it and then if you want, put a timer in yourself in each iteration. I hope that makes sense.

https://github.com/PyGithub/PyGithub/blob/96ad19aec782c879d72f2bea80fb8a3932761be9/github/PaginatedList.py#L61

List of issues: https://pygithub.readthedocs.io/en/stable/examples/Repository.html#get-list-of-open-issues

List of PRs: https://pygithub.readthedocs.io/en/stable/github_objects/Repository.html#github.Repository.Repository.get_pulls

@lwasser
Copy link
Member Author

lwasser commented Mar 7, 2024

ok update - i'm going to update the header of this issue with more specifics.
i read a bit more about this last night!

  1. pygithub seems to have some maintenance challenges and it may be rebuilding a maintainer team but i'm worried about depending on it.
  2. there is a EASY quick fix NOW - we can use perpage= (value up to 100 items returned). the default is 30. but you can request up to 100 at a time. 100 packages is WAY down the road for us so that would buy some time. so that is the quick easy fix that i can do now in about 10 minutes (with some testing).
  3. adding pagination and a sleep / rate limiter isn't as hard as i suspected either. so i think the best option would be to pull all of the github authentication pieces out into a new object that only handles github things. then we can either user inheritance or some other class magic (or just perform the calls separately and use the existing objects for cleaning steps.

#3 will take a bit more effort but we have plenty of time to implement it so i can begin to set us up for that approach!

i'm not sure what the benefits of graphQL are over rest calls at this point.

@lwasser lwasser changed the title Use pygithub to handle pagination and issue retrieval Fix: GitHub request to increase perpage values to 100 Mar 7, 2024
@pllim
Copy link
Contributor

pllim commented Mar 7, 2024

Re: GraphQL -- https://docs.github.com/en/graphql/overview/about-the-graphql-api

They advertised it as smaller footprint as you get exactly what you ask for. REST API returns everything.

(Glad you found a solution.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants