-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scripts for PR metrics from github API #13
base: main
Are you sure you want to change the base?
Changes from all commits
b10b809
c531cae
a715053
bae9af3
00d8499
28dffa7
3d7880c
37844d4
cf9e41d
f06becf
cc05d6a
ed1adea
08c0b7c
b2ee775
3feb297
1d58093
5f6d268
94533e1
b7f7f76
e69fb3a
cd9c1f6
4d58ba0
45fa6ce
f1b54e1
ac21a51
e095fc3
ce08049
c86237c
b7a02f6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
__pycache__ | ||
pr-data.p | ||
*.png | ||
*.csv |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
These scripts collect some metrics about mbed TLS PRs over time. | ||
|
||
Usage | ||
----- | ||
|
||
1. `./get-pr-data.py` - this takes a long time and requires the environment | ||
variable `GITHUB_API_TOKEN` to be set to a valid [github API | ||
token](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token) (unauthenticated access to the API has a limit on the number or requests that is too low for our number of PRs). It generates `pr-data.p` with pickled data. | ||
2. `./do.sh` - this works offline from the data in | ||
`pr-data.p` and generates a bunch of png and csv files. | ||
|
||
For example, the report for the last quarter can be generated with: | ||
``` | ||
./get-pr-data.py # assuming GITHUB_API_TOKEN is set in the environement | ||
./do.sh | ||
``` | ||
Note that the metric "median lifetime" is special in that it can't always be | ||
computed right after the quarter is over, it sometimes need more time to pass | ||
and/or more PRs from that quarter to be closed. In that case, the uncertain | ||
quarter(s) will shown with an error bar the png graph, and in the csv file an | ||
interval will be reported for the value(s) that can't be determined yet. | ||
|
||
By default, data extends from start of 2020 to end of the previous quarter. It | ||
is possible to change that range using environment variables, for example: | ||
``` | ||
PR_FIRST_DATE=2016-01-01 PR_LAST_DATE=2022-12-32 ./do.sh | ||
``` | ||
gives date from 2016 to 2022 included. | ||
|
||
Requirements | ||
------------ | ||
|
||
These scripts require: | ||
|
||
- Python >= 3.6 (required by recent enough matplotlib) | ||
- matplotlib >= 3.1 (3.0 doesn't work) | ||
- PyGithub >= 1.43 (any version should work, that was just the oldest tested) | ||
|
||
### Ubuntu 20.04 (and probaly 18.04) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also works on 22.04 (tested by me) in case you want to note that for the future's sake. |
||
|
||
A simple `apt install python3-github python3-matplotlib` is enough. | ||
|
||
### Ubuntu 16.04 | ||
|
||
On Ubuntu 16.04, by default only Python 3.5 is available, which doesn't | ||
support a recent enough matplotlib to support those scripts, so the following | ||
was used to run those scripts on 16.04: | ||
|
||
sudo add-apt-repository ppa:deadsnakes/ppa | ||
sudo apt update | ||
sudo apt install python3.6 python3.6-venv | ||
python3.6 -m venv 36env | ||
source 36env/bin/activate | ||
pip install --upgrade pip | ||
pip install matlplotlib | ||
pip install pygithub | ||
|
||
See `requirements.txt` for an example of a set of working versions. | ||
|
||
Note: if you do this, I strongly recommend uninstalling python3.6, | ||
python3.6-venv and all their dependencies, then removing the deadsnakes PPA | ||
before any upgrade to 18.04. Failing to do so will result in | ||
dependency-related headaches as some packages in 18.04 depend on a specific | ||
version of python3.6 but the version from deadsnakes is higher, so apt won't | ||
downgrade it and manual intervention will be required. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
#!/bin/sh | ||
|
||
set -eu | ||
|
||
for topic in created closed pending lifetime backlog; do | ||
echo "PRs $topic..." | ||
rm -f prs-${topic}.png prs-${topic}.csv | ||
./pr-${topic}.py > prs-${topic}.csv | ||
done |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
#!/usr/bin/env python3 | ||
# coding: utf-8 | ||
|
||
"""Get PR data from github and pickle it.""" | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. General suggestion, not a blocker: all scripts should support |
||
import pickle | ||
import os | ||
|
||
from github import Github | ||
|
||
if "GITHUB_API_TOKEN" in os.environ: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion for later: make the token API (command line options, file storage, …) compatible with the official command line tool |
||
token = os.environ["GITHUB_API_TOKEN"] | ||
else: | ||
print("You need to provide a GitHub API token") | ||
|
||
g = Github(token) | ||
r = g.get_repo("ARMMbed/mbedtls") | ||
|
||
prs = list() | ||
for p in r.get_pulls(state="all"): | ||
print(p.number) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd be tempted to make this more informative - maybe changing it from Might not be necessary for such a simple script though |
||
# Accessing p.mergeable forces completion of PR data (by default, only | ||
# basic info such as status and dates is available) but makes things | ||
# slower (about 10x). Only do that for open PRs; we don't need the extra | ||
# info for old PRs (only the dates which are part of the basic info). | ||
if p.state == 'open': | ||
dummy = p.mergeable | ||
prs.append(p) | ||
|
||
# After a branch has been updated, github doesn't immediately go and recompute | ||
# potential conflicts for all open PRs against this branch; instead it does | ||
# that when the info is requested and even then it's done asynchronously: the | ||
# first request might return no data, but if we come back after we've done all | ||
# the other PRs, the info should have become available in the meantime. | ||
for p in prs: | ||
if p.state == 'open' and p.mergeable is None: | ||
print(p.number, 'update') | ||
p.update() | ||
|
||
with open("pr-data.p", "wb") as f: | ||
pickle.dump(prs, f) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
#!/usr/bin/env python3 | ||
# coding: utf-8 | ||
|
||
"""Produce summary or PRs pending per branch and their mergeability status.""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please document the requirements to run this script (run |
||
|
||
import pickle | ||
from datetime import datetime | ||
from collections import Counter | ||
|
||
with open("pr-data.p", "rb") as f: | ||
prs = pickle.load(f) | ||
|
||
c_open = Counter() | ||
c_mergeable = Counter() | ||
c_recent = Counter() | ||
c_recent2 = Counter() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Non-blocker: is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd prefer |
||
|
||
for p in prs: | ||
if p.state != "open": | ||
continue | ||
|
||
branch = p.base.ref | ||
c_open[branch] += 1 | ||
if p.mergeable: | ||
c_mergeable[branch] += 1 | ||
days = (datetime.now() - p.updated_at).days | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since Also applies to other scripts that call |
||
if days < 31: | ||
c_recent[branch] += 1 | ||
if days < 8: | ||
c_recent2[branch] += 1 | ||
|
||
|
||
print(" branch: open, mergeable, <31d, <8d") | ||
for b in sorted(c_open, key=lambda b: c_open[b], reverse=True): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd traditionally define lambda functions as |
||
print("{:>20}: {: 10}, {: 10}, {: 10}, {:10}".format( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This produces hard to read tables if a branch name is > 20 chars long (see below). However, I can't immediately see a good/simple way to fix this, so it might just have to be accepted as a limitation
|
||
b, c_open[b], c_mergeable[b], c_recent[b], c_recent2[b])) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
#!/usr/bin/env python3 | ||
# coding: utf-8 | ||
|
||
"""Produce analysis of PR backlog over time""" | ||
|
||
from prs import pr_dates, first, last, quarter | ||
|
||
from datetime import datetime, timedelta | ||
from collections import Counter | ||
from itertools import chain | ||
|
||
import matplotlib.pyplot as plt | ||
|
||
new_days = 90 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a bit of a push to standardise ages for what is considered recent/old in OSS a while back. The thresholds picked were 15 and 90. It might be useful to have a few extra ranges, e.g. <15, 15-90, 90-365, >365 to align better with this? If I have time I'll push an update. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, thinking more, this is a lot like the median lifetime graph, but with a couple of thresholds which are based on age rather than percentiles. Would it be better to have a graph that shows e.g., median, 75th percentile, 95th percentile? Pros and cons either way I think, maybe worth exploring it if it's quick/easy to do but I'm not sure if it would be better or not? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I didn't know about the push for standardised thresholds, indeed it would be good to align with that (and add some of our own if needed). Unfortunately the way the script is structured currently makes it a very manual change (you can't just give a list of threesholds at the top and have everything else work automagically). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding adding other percentiles to the median one, that's something I've been considering for a while, but the problem is over what set. Currently it's over "PRs created this quarter", which means, considering Q1 for the sake of concreteness we can only compute the median (or an upper bound for it) after we've closed at least 50% of PRs created in Q1. Fortunately, that's usually the case at the very beginning of Q2 when we prepare our report. (Even then, we might get only a range, as the value could still get lower if we closed a lot of PRs created near the end of Q1 at the very beginning of Q2.) But for the 4th quartile, resp. 95th percentile, we'd need to have closed 75%, resp. 95% of PRs created in Q1 by the time we produce our report in early Q2, which realistically is not going to happen most of the time. We could avoid the problem with incomplete data entirely by considering instead the set of PRs we closed this quarter - there all the lifetimes are known for sure and we can do stats without uncertainty. But I doesn't really tell the same thing: for example, doing at lot of historical review one quarter would raise the median age of PRs closed this quarter, but that's still a good thing. So the data might become more difficult to interpret. So, I think there are basically three ways to select PRs over which we make stats / grouping / etc for one quarter:
Wdyt? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree, there isn't an easy answer or obvious better way. So let's leave as-is for now. |
||
old_days = 365 | ||
|
||
new = Counter() | ||
med = Counter() | ||
old = Counter() | ||
|
||
for beg, end, com in pr_dates(): | ||
if end is None: | ||
tomorrow = datetime.now().date() + timedelta(days=1) | ||
n_days = (tomorrow - beg).days | ||
else: | ||
n_days = (end - beg).days | ||
for i in range(n_days): | ||
q = quarter(beg + timedelta(days=i)) | ||
q1 = quarter(beg + timedelta(days=i+1)) | ||
# Only count on each quarter's last day | ||
if q == q1: | ||
continue | ||
if i <= new_days: | ||
new[q] += 1 | ||
elif i <= old_days: | ||
med[q] += 1 | ||
else: | ||
old[q] += 1 | ||
|
||
first_q = quarter(first) | ||
last_q = quarter(last) | ||
|
||
quarters = (q for q in chain(new, med, old) if first_q <= q <= last_q) | ||
quarters = tuple(sorted(set(quarters))) | ||
|
||
new_y = tuple(new[q] for q in quarters) | ||
med_y = tuple(med[q] for q in quarters) | ||
old_y = tuple(old[q] for q in quarters) | ||
sum_y = tuple(old[q] + med[q] for q in quarters) | ||
|
||
old_name = "older than {} days".format(old_days) | ||
med_name = "medium" | ||
new_name = "recent (less {} days old)".format(new_days) | ||
|
||
width = 0.9 | ||
fig, ax = plt.subplots() | ||
ax.bar(quarters, old_y, width, label=old_name) | ||
ax.bar(quarters, med_y, width, label=med_name, bottom=old_y) | ||
ax.bar(quarters, new_y, width, label=new_name, bottom=sum_y) | ||
ax.legend(loc="upper left") | ||
ax.grid(True) | ||
ax.set_xlabel("quarter") | ||
ax.set_ylabel("Number or PRs pending") | ||
ax.tick_params(axis="x", labelrotation=90) | ||
fig.suptitle("State of the PR backlog at the end of each quarter") | ||
fig.set_size_inches(12.8, 7.2) # default 100 dpi -> 720p | ||
fig.savefig("prs-backlog.png") | ||
|
||
print("Quarter,recent,medium,old,total") | ||
for q in quarters: | ||
print("{},{},{},{},{}".format(q, new[q], med[q], old[q], | ||
new[q] + med[q] + old[q])) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
#!/usr/bin/env python3 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Several scripts, especially |
||
# coding: utf-8 | ||
|
||
"""Produce graph of PRs closed by time period.""" | ||
|
||
from prs import pr_dates, quarter, first, last | ||
|
||
from collections import Counter | ||
|
||
import matplotlib.pyplot as plt | ||
|
||
first_q = quarter(first) | ||
last_q = quarter(last) | ||
|
||
cnt_all = Counter() | ||
cnt_com = Counter() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could guess that (Obviously by reading the rest of the code I can tell that it's “community”. But I wouldn't have guessed.) And actually, as I wasn't familiar with |
||
|
||
for beg, end, com in pr_dates(): | ||
if end is None: | ||
continue | ||
q = quarter(end) | ||
cnt_all[q] += 1 | ||
if com: | ||
cnt_com[q] += 1 | ||
|
||
quarters = tuple(sorted(q for q in cnt_all if first_q <= q <= last_q)) | ||
|
||
prs_com = tuple(cnt_com[q] for q in quarters) | ||
prs_team = tuple(cnt_all[q] - cnt_com[q] for q in quarters) | ||
|
||
width = 0.9 | ||
fig, ax = plt.subplots() | ||
ax.bar(quarters, prs_com, width, label="community") | ||
ax.bar(quarters, prs_team, width, label="core team", bottom=prs_com) | ||
ax.legend(loc="upper left") | ||
ax.grid(True) | ||
ax.set_xlabel("quarter") | ||
ax.set_ylabel("Number or PRs closed") | ||
ax.tick_params(axis="x", labelrotation=90) | ||
fig.suptitle("Number of PRs closed per quarter") | ||
fig.set_size_inches(12.8, 7.2) # default 100 dpi -> 720p | ||
fig.savefig("prs-closed.png") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the world ready for svg yet? |
||
|
||
print("Quarter,community closed,total closed") | ||
for q in quarters: | ||
print("{},{},{}".format(q, cnt_com[q], cnt_all[q])) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
#!/usr/bin/env python3 | ||
# coding: utf-8 | ||
|
||
"""Produce graph of PRs created by time period.""" | ||
|
||
from prs import pr_dates, quarter, first, last | ||
|
||
from collections import Counter | ||
|
||
import matplotlib.pyplot as plt | ||
|
||
first_q = quarter(first) | ||
last_q = quarter(last) | ||
|
||
cnt_all = Counter() | ||
cnt_com = Counter() | ||
|
||
for beg, end, com in pr_dates(): | ||
q = quarter(beg) | ||
cnt_all[q] += 1 | ||
if com: | ||
cnt_com[q] += 1 | ||
|
||
quarters = tuple(sorted(q for q in cnt_all if first_q <= q <= last_q)) | ||
|
||
prs_com = tuple(cnt_com[q] for q in quarters) | ||
prs_team = tuple(cnt_all[q] - cnt_com[q] for q in quarters) | ||
|
||
width = 0.9 | ||
fig, ax = plt.subplots() | ||
ax.bar(quarters, prs_com, width, label="community") | ||
ax.bar(quarters, prs_team, width, label="core team", bottom=prs_com) | ||
ax.legend(loc="upper left") | ||
ax.grid(True) | ||
ax.set_xlabel("quarter") | ||
ax.set_ylabel("Number or PRs created") | ||
ax.tick_params(axis="x", labelrotation=90) | ||
fig.suptitle("Number of PRs created per quarter") | ||
fig.set_size_inches(12.8, 7.2) # default 100 dpi -> 720p | ||
fig.savefig("prs-created.png") | ||
|
||
print("Quarter,community created,total created") | ||
for q in quarters: | ||
print("{},{},{}".format(q, cnt_com[q], cnt_all[q])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: probaly -> probably