Skip to content
This repository has been archived by the owner on Aug 16, 2023. It is now read-only.

Add configuration options to allow filtering of self-owned or employer-owned repositories #10

Closed
DuaneOBrien opened this issue Aug 20, 2019 · 11 comments
Labels

Comments

@DuaneOBrien
Copy link
Collaborator

DuaneOBrien commented Aug 20, 2019

Currently Starfish does not filter out events in personal repositories, or in repositories owned by specific organizations. Future users may want to exclude contributions into repositories that their employer owns (key point - one employer may own many GitHub Organizations), or contributions into personal repositories (defined here as "any repository owned by the user who triggered the event"). We should add configuration options that allow future users to easily implement their own policies regarding where events will be counted.

The simplest version of this would be to:

Add variables to the .env.template:

  • Filter Out Events In User-Owned Repos? (true or false, default to false)
  • Filter Out Events From These Owners ([List,Of,Owners], default to an empty list)

Add new filterResponseFor functions that filter the events based off these variables.

Add documentation in the README for using these env variables.

@DuaneOBrien DuaneOBrien changed the title Add configuration options to filter out repos by owner/org Add configuration options to allow filtering of self-owned or employer-owned repositories Sep 26, 2019
@hnarasaki
Copy link

I'd like to try this one!

@danisyellis
Copy link
Collaborator

Sounds good :-)
Before you submit your PR, make sure you have the most recent version of the codebase merged in with your code (i.e. do a pull from upstream and handle any merge conflicts). Since there's someone else working on another issue right now, it's possible there will be new code to merge.

@hnarasaki
Copy link

hnarasaki commented Oct 4, 2019

@danisyellis Thanks for the heads up! :) I'm working on my local setup.

@hnarasaki
Copy link

Finished setting up my dev env today. Thank you for all the help! Will probably resume working on this around Tues. or so.

@danisyellis
Copy link
Collaborator

fyi, while researching something else, I noticed that each event has an "author_association", for example author_association": "NONE" or "author_association": "OWNER"

This might be the best way to find out if it's the persons own repo or not.

(I checked and for example, for Starfish, because it's owned by indeedeng, not danisyellis, I'm a "collaborator" not an "owner", so even if we filtered out repos owned by the author of the event, I would still get credit for my contributions to Starfish, a project I maintain, which is the behavior we're looking for.

API documentation is here if you need it https://developer.github.com/v3/

@jeffwilcox
Copy link

I've implemented some of this in my fraken version of starfish. It's heavily reliant on maintaining an e-tag cache of GitHub responses to avoid rate limiting challenges.

The most expensive operations for me have come from having to walk forks: if someone contributes to a fork (or a fork of a fork) of a MSFT project, I need to exclude that event from the eligible events for the user.

Our approach included having:

  • connected to an API of ours that has all of our corporate orgs
  • an override list of corporate orgs we still want to include (confusing, I know)
  • an eligibility function that also reviews repo forks and upstreams

The reason I had to spend the time on forks was because we were looking at over 20K employees on GitHub doing public activity (which is "great"), but then, turns out a much smaller # (four figures) of folks were contributing to open source not controlled by our company.

I wish we had a more efficient way of collecting the data. Our rolling job getting this data takes about 16 hours a day right now. (It's chock full of sleeps to avoid abuse limits with GitHub).

@danisyellis
Copy link
Collaborator

Jeff, I just want to make sure that you know about Microsoft's GHCrawler. I know you're at Microsoft, so you probably already know about it, but I believe that the crawler (and the ghrequestor project it uses) are so robust so that it can crawl all the data you need in a time-period much smaller than 16 hours?

I have a PR in to the crawler to make it crawl user events, though I haven't heard anything back about it and it's been a few months.

@jeffwilcox
Copy link

Thanks, appreciate it! My team shipped the crawler, but recently shifted the charter to a team that is dedicated to this sort of data collection, and unfortunately I feel like the project as it exists may not be getting too much love... it was essentially a hard internal fork they made. :/

We are working with them to eventually hook up to more of these events, thanks for the reminder.

Due to the way we have it internally organized, however, we'd have to build a way to partition the public events for non-corporate repos, which is part of why we've hacked things with these long-running jobs for the time being.

I still just want to find a way to dump a bag of cash for better real-time data in our GitHub account somehow vs having to page through APIs or use graphql.

@danisyellis
Copy link
Collaborator

Thanks for the crawler! It's been really useful to us - especially back when I was the only engineer on our open source team!
Ahhhh, there's been an internal fork. That makes sense. I figured no one was using the open source version over there any more because ghrequestor hasn't been fixed to deal with GitHub's API deprecation in July.
No worries. The crawler is more robust of a solution than we need for our user event gathering at the moment, so we're going to switch to using something more basic over here (might even use octokit).

This may be a super-naive suggestion because I don't know all the complexities of your system, but could you use the crawler to get all the contribution data efficiently and then once the data is in the crawler's database, do your filtering from there?
That's how we've been using ghcrawler at Indeed to gather all of our persistent employee event data (we use Starfish to determine voting eligibility in the moment and a project we call Goby for creating all of our contribution dashboards with persistent data).

  1. api call to users/exampleuser/events using ghcrawler
  2. grab all events from ghcrawler's database's events collection where element.type is one of the events we care about (PR, IssueComment, etc.)
  3. filter that by date (this is where you could filter by orgs/repos?)
  4. send the resulting data to another database, from which we do all of our dashboard making and other data analysis.

Just a thought. It might not make sense for your use case.

I too wish GH was making it easier to grab user events. They don't want to make a webhook (which I can understand from a user security perspective, but still makes me sad). And we were experimenting with switching to GraphQL for Goby but, as far as we can tell, there's no way in the GH GraphQL API to efficiently grab all of a given user's events. We'd have to switch to counting only the things GH thinks of as a "contribution" and we're not willing to lose Issue Comments, Commit Comments, and PR Review Comments. To us, those are valuable contributions people can make.

@danisyellis
Copy link
Collaborator

UPDATE: For anyone interested in this issue, it's currently not assigned. Feel free to leave a comment if you'd like me to assign it to you.

@danisyellis
Copy link
Collaborator

I'm closing this issue because
a) I think the conversation chain in here might be overwhelming to people new to Starfish who are looking to contribute
b) I separated this issue into 2 issues (#65 and #66 ) 1 for self-owned repos and 1 for employer-owned repos

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants