Fix: GitHub request to increase perpage values to 100 #94

lwasser · 2024-02-23T02:12:59Z

Currently i think we went down a more difficult path by using only requests to parse github issues. Right now our workflow is set to fail when we hit 30 packages because pagination is not handled in our api.

Using pygithub, you can easily grab issues from a specific repository and it handles pagination in addition to easily grabbing metadata for each issue using built in methods.

The code below is one example of us quickly parsing through issues. i'm thikning it's worth considering this heavily - soonish to ensure our build doesn't break. we have 13 packages in review now so i suspect we will hit 30 in the upcoming months rather quickly.

this is both a bug that isn't realized (yet) and an enhancement needed to the api.

import os 
from github import Github
from dotenv import load_dotenv

load_dotenv()
github_token = os.environ.get('GITHUB_TOKEN')

g = Github(github_token)

# Get the repository
repo = g.get_repo("pyopensci/software-submission")

# Fetch issues with pagination
issues = repo.get_issues(labels=["New Submission!"])

for i, issue in enumerate(issues):
    print(issue.title)

print("there are ", i+1, "total issues")

pllim · 2024-03-05T14:12:32Z

Where is your current code that will break?

https://pypi.org/project/PyGithub/ is definitely one of the packages that would suit your needs if you want Python interface.

Otherwise, you can go straight to GitHub's GraphQL as well. Here is example: https://github.com/scientific-python/devstats/blob/main/devstats/query.py (its output is then ingested by https://github.com/scientific-python/devstats.scientific-python.org ).

pllim · 2024-03-05T14:15:18Z

p.s. If you use GitHub REST API directly, usually there is a field in the output JSON to tell you how many pages total on top, so then you can send subsequent queries with increasing page number as well. I am less sure about GraphQL output since I never implemented it myself.

pllim · 2024-03-05T14:15:47Z

p.p.s. You might want to put in a sleep timer too, to avoid being blocked as spam.

lwasser · 2024-03-07T01:00:57Z

@pllim my knowledge of working with api's and pagination is a big WIP.

i have request get and return response methods here.
But what i realized the other day is that no matter what i request, i am capped at 30 responses per page (i didn't realize it was paginated).

right now i'm only grabbing accepted reviews but i also want to add other steps so we can document our entire review process (how many packages are under review, etc) on our website! and we will be above 30 accepted soon at the rate we're going :) we have ~13 in review now.

So essentially the options here would be to

write more code that handles pagination and tests or
use a tool like pygithub to do it for me which maybe has rate limiting / sleep timer?? built in (i will check that now!). that just seems a bit simpler to me since they've done all the work creating the code.

are there benefits of the graphql approach? i did play with devstats (and want to use it for our work here too!!).

thank you so much for this input!!

pllim · 2024-03-07T01:15:06Z

FWIW, Option 2 would be easier on you in the long run. I don't think it has a timer but its PaginatedList object seems to support multiple pages natively and is iterable, so theoretically you would loop through it and then if you want, put a timer in yourself in each iteration. I hope that makes sense.

https://github.com/PyGithub/PyGithub/blob/96ad19aec782c879d72f2bea80fb8a3932761be9/github/PaginatedList.py#L61

List of issues: https://pygithub.readthedocs.io/en/stable/examples/Repository.html#get-list-of-open-issues

List of PRs: https://pygithub.readthedocs.io/en/stable/github_objects/Repository.html#github.Repository.Repository.get_pulls

lwasser · 2024-03-07T16:00:54Z

ok update - i'm going to update the header of this issue with more specifics.
i read a bit more about this last night!

pygithub seems to have some maintenance challenges and it may be rebuilding a maintainer team but i'm worried about depending on it.
there is a EASY quick fix NOW - we can use perpage= (value up to 100 items returned). the default is 30. but you can request up to 100 at a time. 100 packages is WAY down the road for us so that would buy some time. so that is the quick easy fix that i can do now in about 10 minutes (with some testing).
adding pagination and a sleep / rate limiter isn't as hard as i suspected either. so i think the best option would be to pull all of the github authentication pieces out into a new object that only handles github things. then we can either user inheritance or some other class magic (or just perform the calls separately and use the existing objects for cleaning steps.

#3 will take a bit more effort but we have plenty of time to implement it so i can begin to set us up for that approach!

i'm not sure what the benefits of graphQL are over rest calls at this point.

pllim · 2024-03-07T17:15:54Z

Re: GraphQL -- https://docs.github.com/en/graphql/overview/about-the-graphql-api

They advertised it as smaller footprint as you get exactly what you ask for. REST API returns everything.

(Glad you found a solution.)

lwasser added bug enhancement help wanted labels Feb 23, 2024

lwasser changed the title ~~Use pygithub to handle pagination and issue retrieval~~ Fix: GitHub request to increase perpage values to 100 Mar 7, 2024

lwasser mentioned this issue Mar 7, 2024

Fix: parse up to 100 issues in a request #119

Merged

lwasser closed this as completed in #119 Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: GitHub request to increase perpage values to 100 #94

Fix: GitHub request to increase perpage values to 100 #94

lwasser commented Feb 23, 2024

pllim commented Mar 5, 2024

pllim commented Mar 5, 2024

pllim commented Mar 5, 2024

lwasser commented Mar 7, 2024

pllim commented Mar 7, 2024

lwasser commented Mar 7, 2024

pllim commented Mar 7, 2024

Fix: GitHub request to increase perpage values to 100 #94

Fix: GitHub request to increase perpage values to 100 #94

Comments

lwasser commented Feb 23, 2024

pllim commented Mar 5, 2024

pllim commented Mar 5, 2024

pllim commented Mar 5, 2024

lwasser commented Mar 7, 2024

pllim commented Mar 7, 2024

lwasser commented Mar 7, 2024

pllim commented Mar 7, 2024