Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scripts for PR metrics from github API #13

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
b10b809
Add scripts for PR metrics from github API
mpg Jun 30, 2020
c531cae
Update requirements & allow use from venv
mpg Sep 30, 2020
a715053
Add comments about Ubuntu 20.04
mpg Dec 24, 2020
bae9af3
Make get-pr-data 10x faster
mpg Dec 24, 2020
00d8499
Avoid potential better-than-reality lifetime figures
mpg Dec 30, 2020
28dffa7
Adjust pr_dates() to reduce risk of misuse
mpg Dec 30, 2020
3d7880c
Adapt detection of community PRs
mpg Apr 2, 2021
37844d4
Add warning about making this work on 16.04
mpg Apr 2, 2021
cf9e41d
Avoid repeating the start date in many places
mpg Apr 2, 2021
f06becf
Update outdated comment
mpg Apr 2, 2021
cc05d6a
Make first and last date configurable
mpg Apr 2, 2021
ed1adea
Fix flake8 warnings
mpg Apr 2, 2021
08c0b7c
Rotate labels for quarters
mpg Apr 2, 2021
b2ee775
Clarify community detection
mpg May 19, 2021
3feb297
Smarter handling of p.mergeable in get-pr-data
mpg May 20, 2021
1d58093
Update pending-mergeability
mpg May 20, 2021
5f6d268
We no longer use labels for community PRs
mpg Sep 30, 2022
94533e1
Update list of core contributors
mpg Oct 12, 2022
b7f7f76
Update Readme (PR last date)
mpg Oct 12, 2022
e69fb3a
Shift one month for quarterly PR lifetime
mpg Jan 11, 2023
cd9c1f6
Update list of team member
mpg Jan 11, 2023
4d58ba0
Revert "Shift one month for quarterly PR lifetime"
mpg Jan 11, 2023
45fa6ce
Use statistics.median
mpg Jan 11, 2023
f1b54e1
Handle uncertainty about lifetimes
mpg Jan 11, 2023
ac21a51
Update Readme about incomplete results
mpg Jan 12, 2023
e095fc3
Update team members with current reviewers
mpg Apr 6, 2023
ce08049
Draw error bars, don't skip uncertain quarters
mpg Apr 6, 2023
c86237c
New script pr-backlog.py
mpg Apr 6, 2023
b7a02f6
Cosmetic adjustments
mpg Apr 6, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions pr-metrics/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
__pycache__
pr-data.p
*.png
*.csv
65 changes: 65 additions & 0 deletions pr-metrics/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
These scripts collect some metrics about mbed TLS PRs over time.

Usage
-----

1. `./get-pr-data.py` - this takes a long time and requires the environment
variable `GITHUB_API_TOKEN` to be set to a valid [github API
token](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token) (unauthenticated access to the API has a limit on the number or requests that is too low for our number of PRs). It generates `pr-data.p` with pickled data.
2. `./do.sh` - this works offline from the data in
`pr-data.p` and generates a bunch of png and csv files.

For example, the report for the last quarter can be generated with:
```
./get-pr-data.py # assuming GITHUB_API_TOKEN is set in the environement
./do.sh
```
Note that the metric "median lifetime" is special in that it can't always be
computed right after the quarter is over, it sometimes need more time to pass
and/or more PRs from that quarter to be closed. In that case, the uncertain
quarter(s) will shown with an error bar the png graph, and in the csv file an
interval will be reported for the value(s) that can't be determined yet.

By default, data extends from start of 2020 to end of the previous quarter. It
is possible to change that range using environment variables, for example:
```
PR_FIRST_DATE=2016-01-01 PR_LAST_DATE=2022-12-32 ./do.sh
```
gives date from 2016 to 2022 included.

Requirements
------------

These scripts require:

- Python >= 3.6 (required by recent enough matplotlib)
- matplotlib >= 3.1 (3.0 doesn't work)
- PyGithub >= 1.43 (any version should work, that was just the oldest tested)

### Ubuntu 20.04 (and probaly 18.04)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: probaly -> probably

Suggested change
### Ubuntu 20.04 (and probaly 18.04)
### Ubuntu 20.04 (and probably 18.04)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also works on 22.04 (tested by me) in case you want to note that for the future's sake.


A simple `apt install python3-github python3-matplotlib` is enough.

### Ubuntu 16.04

On Ubuntu 16.04, by default only Python 3.5 is available, which doesn't
support a recent enough matplotlib to support those scripts, so the following
was used to run those scripts on 16.04:

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.6 python3.6-venv
python3.6 -m venv 36env
source 36env/bin/activate
pip install --upgrade pip
pip install matlplotlib
pip install pygithub

See `requirements.txt` for an example of a set of working versions.

Note: if you do this, I strongly recommend uninstalling python3.6,
python3.6-venv and all their dependencies, then removing the deadsnakes PPA
before any upgrade to 18.04. Failing to do so will result in
dependency-related headaches as some packages in 18.04 depend on a specific
version of python3.6 but the version from deadsnakes is higher, so apt won't
downgrade it and manual intervention will be required.
9 changes: 9 additions & 0 deletions pr-metrics/do.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/sh

set -eu

for topic in created closed pending lifetime backlog; do
echo "PRs $topic..."
rm -f prs-${topic}.png prs-${topic}.csv
./pr-${topic}.py > prs-${topic}.csv
done
41 changes: 41 additions & 0 deletions pr-metrics/get-pr-data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/usr/bin/env python3
# coding: utf-8

"""Get PR data from github and pickle it."""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General suggestion, not a blocker: all scripts should support --help, and should have most of their code inside functions or classes so that they can be called from another script bypassing the command line interface.

import pickle
import os

from github import Github

if "GITHUB_API_TOKEN" in os.environ:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for later: make the token API (command line options, file storage, …) compatible with the official command line tool gh. (This is not the markedly less popular python gh.) I'd like to make my own scripts compatible too — we (you, me, and I guess at least @bensze01 as well) should coordinate and write a python module for that, if there isn't already one. (I didn't find one in a cursory search but “python github gh” are not very specific search terms.)

token = os.environ["GITHUB_API_TOKEN"]
else:
print("You need to provide a GitHub API token")

g = Github(token)
r = g.get_repo("ARMMbed/mbedtls")

prs = list()
for p in r.get_pulls(state="all"):
print(p.number)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to make this more informative - maybe changing it from p.number to "Fetching #" + str(p.number)

Might not be necessary for such a simple script though

# Accessing p.mergeable forces completion of PR data (by default, only
# basic info such as status and dates is available) but makes things
# slower (about 10x). Only do that for open PRs; we don't need the extra
# info for old PRs (only the dates which are part of the basic info).
if p.state == 'open':
dummy = p.mergeable
prs.append(p)

# After a branch has been updated, github doesn't immediately go and recompute
# potential conflicts for all open PRs against this branch; instead it does
# that when the info is requested and even then it's done asynchronously: the
# first request might return no data, but if we come back after we've done all
# the other PRs, the info should have become available in the meantime.
for p in prs:
if p.state == 'open' and p.mergeable is None:
print(p.number, 'update')
p.update()

with open("pr-data.p", "wb") as f:
pickle.dump(prs, f)
36 changes: 36 additions & 0 deletions pr-metrics/pending-mergeability.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/usr/bin/env python3
# coding: utf-8

"""Produce summary or PRs pending per branch and their mergeability status."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document the requirements to run this script (run get-pr-data.py to produce pr-data.p). Goes for the other scripts as well.


import pickle
from datetime import datetime
from collections import Counter

with open("pr-data.p", "rb") as f:
prs = pickle.load(f)

c_open = Counter()
c_mergeable = Counter()
c_recent = Counter()
c_recent2 = Counter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocker: is recent2 more recent than recent or less? Are they exclusive or staggered? Better names would help.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer cnt_open or even count_open (etc), which I think help readability quite a bit


for p in prs:
if p.state != "open":
continue

branch = p.base.ref
c_open[branch] += 1
if p.mergeable:
c_mergeable[branch] += 1
days = (datetime.now() - p.updated_at).days
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since now() is called on every element, if you run this script twice at the same time, you might get inconsistently different day counts because prs is traversed in a different order. Ok, this is not critical in this reporting script, but it would be better practice to call now() only once and work from that reference time.

Also applies to other scripts that call now in a loop.

if days < 31:
c_recent[branch] += 1
if days < 8:
c_recent2[branch] += 1


print(" branch: open, mergeable, <31d, <8d")
for b in sorted(c_open, key=lambda b: c_open[b], reverse=True):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd traditionally define lambda functions as lambda x: etc, notable here since this makes it seem like there's some relationship between the lambda b and for b syntactically, which there isn't.

print("{:>20}: {: 10}, {: 10}, {: 10}, {:10}".format(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This produces hard to read tables if a branch name is > 20 chars long (see below). However, I can't immediately see a good/simple way to fix this, so it might just have to be accepted as a limitation

              branch:       open,  mergeable,       <31d,        <8d
         development:        177,         91,         51,         41
        mbedtls-2.28:          6,          6,          3,          3
dev/gilles-peskine-arm/psa-test-op-fail:          1,          1,          1,          1
        mbedtls-2.16:          1,          0,          0,          0

b, c_open[b], c_mergeable[b], c_recent[b], c_recent2[b]))
72 changes: 72 additions & 0 deletions pr-metrics/pr-backlog.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
#!/usr/bin/env python3
# coding: utf-8

"""Produce analysis of PR backlog over time"""

from prs import pr_dates, first, last, quarter

from datetime import datetime, timedelta
from collections import Counter
from itertools import chain

import matplotlib.pyplot as plt

new_days = 90
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a bit of a push to standardise ages for what is considered recent/old in OSS a while back. The thresholds picked were 15 and 90.

It might be useful to have a few extra ranges, e.g. <15, 15-90, 90-365, >365 to align better with this? If I have time I'll push an update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, thinking more, this is a lot like the median lifetime graph, but with a couple of thresholds which are based on age rather than percentiles. Would it be better to have a graph that shows e.g., median, 75th percentile, 95th percentile? Pros and cons either way I think, maybe worth exploring it if it's quick/easy to do but I'm not sure if it would be better or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't know about the push for standardised thresholds, indeed it would be good to align with that (and add some of our own if needed). Unfortunately the way the script is structured currently makes it a very manual change (you can't just give a list of threesholds at the top and have everything else work automagically).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding adding other percentiles to the median one, that's something I've been considering for a while, but the problem is over what set. Currently it's over "PRs created this quarter", which means, considering Q1 for the sake of concreteness we can only compute the median (or an upper bound for it) after we've closed at least 50% of PRs created in Q1. Fortunately, that's usually the case at the very beginning of Q2 when we prepare our report. (Even then, we might get only a range, as the value could still get lower if we closed a lot of PRs created near the end of Q1 at the very beginning of Q2.)

But for the 4th quartile, resp. 95th percentile, we'd need to have closed 75%, resp. 95% of PRs created in Q1 by the time we produce our report in early Q2, which realistically is not going to happen most of the time.

We could avoid the problem with incomplete data entirely by considering instead the set of PRs we closed this quarter - there all the lifetimes are known for sure and we can do stats without uncertainty. But I doesn't really tell the same thing: for example, doing at lot of historical review one quarter would raise the median age of PRs closed this quarter, but that's still a good thing. So the data might become more difficult to interpret.

So, I think there are basically three ways to select PRs over which we make stats / grouping / etc for one quarter:

  • PRs created this quarter;
  • PRs still open at the end of the quarter (or any specific date);
  • PRs closed this quarter;
    and each set will give slightly different information.

Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, there isn't an easy answer or obvious better way. So let's leave as-is for now.

old_days = 365

new = Counter()
med = Counter()
old = Counter()

for beg, end, com in pr_dates():
if end is None:
tomorrow = datetime.now().date() + timedelta(days=1)
n_days = (tomorrow - beg).days
else:
n_days = (end - beg).days
for i in range(n_days):
q = quarter(beg + timedelta(days=i))
q1 = quarter(beg + timedelta(days=i+1))
# Only count on each quarter's last day
if q == q1:
continue
if i <= new_days:
new[q] += 1
elif i <= old_days:
med[q] += 1
else:
old[q] += 1

first_q = quarter(first)
last_q = quarter(last)

quarters = (q for q in chain(new, med, old) if first_q <= q <= last_q)
quarters = tuple(sorted(set(quarters)))

new_y = tuple(new[q] for q in quarters)
med_y = tuple(med[q] for q in quarters)
old_y = tuple(old[q] for q in quarters)
sum_y = tuple(old[q] + med[q] for q in quarters)

old_name = "older than {} days".format(old_days)
med_name = "medium"
new_name = "recent (less {} days old)".format(new_days)

width = 0.9
fig, ax = plt.subplots()
ax.bar(quarters, old_y, width, label=old_name)
ax.bar(quarters, med_y, width, label=med_name, bottom=old_y)
ax.bar(quarters, new_y, width, label=new_name, bottom=sum_y)
ax.legend(loc="upper left")
ax.grid(True)
ax.set_xlabel("quarter")
ax.set_ylabel("Number or PRs pending")
ax.tick_params(axis="x", labelrotation=90)
fig.suptitle("State of the PR backlog at the end of each quarter")
fig.set_size_inches(12.8, 7.2) # default 100 dpi -> 720p
fig.savefig("prs-backlog.png")

print("Quarter,recent,medium,old,total")
for q in quarters:
print("{},{},{},{},{}".format(q, new[q], med[q], old[q],
new[q] + med[q] + old[q]))
46 changes: 46 additions & 0 deletions pr-metrics/pr-closed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/usr/bin/env python3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several scripts, especially pr-closed and pr-created, have a lot of code in common and should be factored into one with multiple outputs or options to choose between outputs.

# coding: utf-8

"""Produce graph of PRs closed by time period."""

from prs import pr_dates, quarter, first, last

from collections import Counter

import matplotlib.pyplot as plt

first_q = quarter(first)
last_q = quarter(last)

cnt_all = Counter()
cnt_com = Counter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could guess that cnt stands for “count” or “counter” even without seeing = Counter(), but what's com? “Command”? “Communication”?

(Obviously by reading the rest of the code I can tell that it's “community”. But I wouldn't have guessed.)

And actually, as I wasn't familiar with collections.Counter, I first thought that these were counters, but they're actually dictionaries of counters, or multisets. (I consider this a bad choice of name in the Python standard library. It's fine for this script to follow this naming choice.) It would help to know what the keys are. This goes in other scripts that use Counter as well.


for beg, end, com in pr_dates():
if end is None:
continue
q = quarter(end)
cnt_all[q] += 1
if com:
cnt_com[q] += 1

quarters = tuple(sorted(q for q in cnt_all if first_q <= q <= last_q))

prs_com = tuple(cnt_com[q] for q in quarters)
prs_team = tuple(cnt_all[q] - cnt_com[q] for q in quarters)

width = 0.9
fig, ax = plt.subplots()
ax.bar(quarters, prs_com, width, label="community")
ax.bar(quarters, prs_team, width, label="core team", bottom=prs_com)
ax.legend(loc="upper left")
ax.grid(True)
ax.set_xlabel("quarter")
ax.set_ylabel("Number or PRs closed")
ax.tick_params(axis="x", labelrotation=90)
fig.suptitle("Number of PRs closed per quarter")
fig.set_size_inches(12.8, 7.2) # default 100 dpi -> 720p
fig.savefig("prs-closed.png")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the world ready for svg yet?


print("Quarter,community closed,total closed")
for q in quarters:
print("{},{},{}".format(q, cnt_com[q], cnt_all[q]))
44 changes: 44 additions & 0 deletions pr-metrics/pr-created.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/usr/bin/env python3
# coding: utf-8

"""Produce graph of PRs created by time period."""

from prs import pr_dates, quarter, first, last

from collections import Counter

import matplotlib.pyplot as plt

first_q = quarter(first)
last_q = quarter(last)

cnt_all = Counter()
cnt_com = Counter()

for beg, end, com in pr_dates():
q = quarter(beg)
cnt_all[q] += 1
if com:
cnt_com[q] += 1

quarters = tuple(sorted(q for q in cnt_all if first_q <= q <= last_q))

prs_com = tuple(cnt_com[q] for q in quarters)
prs_team = tuple(cnt_all[q] - cnt_com[q] for q in quarters)

width = 0.9
fig, ax = plt.subplots()
ax.bar(quarters, prs_com, width, label="community")
ax.bar(quarters, prs_team, width, label="core team", bottom=prs_com)
ax.legend(loc="upper left")
ax.grid(True)
ax.set_xlabel("quarter")
ax.set_ylabel("Number or PRs created")
ax.tick_params(axis="x", labelrotation=90)
fig.suptitle("Number of PRs created per quarter")
fig.set_size_inches(12.8, 7.2) # default 100 dpi -> 720p
fig.savefig("prs-created.png")

print("Quarter,community created,total created")
for q in quarters:
print("{},{},{}".format(q, cnt_com[q], cnt_all[q]))
Loading