-
Notifications
You must be signed in to change notification settings - Fork 38
Add configuration options to allow filtering of self-owned or employer-owned repositories #10
Comments
I'd like to try this one! |
Sounds good :-) |
@danisyellis Thanks for the heads up! :) I'm working on my local setup. |
Finished setting up my dev env today. Thank you for all the help! Will probably resume working on this around Tues. or so. |
fyi, while researching something else, I noticed that each event has an "author_association", for example This might be the best way to find out if it's the persons own repo or not. (I checked and for example, for Starfish, because it's owned by indeedeng, not danisyellis, I'm a "collaborator" not an "owner", so even if we filtered out repos owned by the author of the event, I would still get credit for my contributions to Starfish, a project I maintain, which is the behavior we're looking for. API documentation is here if you need it https://developer.github.com/v3/ |
I've implemented some of this in my fraken version of starfish. It's heavily reliant on maintaining an e-tag cache of GitHub responses to avoid rate limiting challenges. The most expensive operations for me have come from having to walk forks: if someone contributes to a fork (or a fork of a fork) of a MSFT project, I need to exclude that event from the eligible events for the user. Our approach included having:
The reason I had to spend the time on forks was because we were looking at over 20K employees on GitHub doing public activity (which is "great"), but then, turns out a much smaller # (four figures) of folks were contributing to open source not controlled by our company. I wish we had a more efficient way of collecting the data. Our rolling job getting this data takes about 16 hours a day right now. (It's chock full of sleeps to avoid abuse limits with GitHub). |
Jeff, I just want to make sure that you know about Microsoft's GHCrawler. I know you're at Microsoft, so you probably already know about it, but I believe that the crawler (and the ghrequestor project it uses) are so robust so that it can crawl all the data you need in a time-period much smaller than 16 hours? I have a PR in to the crawler to make it crawl user events, though I haven't heard anything back about it and it's been a few months. |
Thanks, appreciate it! My team shipped the crawler, but recently shifted the charter to a team that is dedicated to this sort of data collection, and unfortunately I feel like the project as it exists may not be getting too much love... it was essentially a hard internal fork they made. :/ We are working with them to eventually hook up to more of these events, thanks for the reminder. Due to the way we have it internally organized, however, we'd have to build a way to partition the public events for non-corporate repos, which is part of why we've hacked things with these long-running jobs for the time being. I still just want to find a way to dump a bag of cash for better real-time data in our GitHub account somehow vs having to page through APIs or use graphql. |
Thanks for the crawler! It's been really useful to us - especially back when I was the only engineer on our open source team! This may be a super-naive suggestion because I don't know all the complexities of your system, but could you use the crawler to get all the contribution data efficiently and then once the data is in the crawler's database, do your filtering from there?
Just a thought. It might not make sense for your use case. I too wish GH was making it easier to grab user events. They don't want to make a webhook (which I can understand from a user security perspective, but still makes me sad). And we were experimenting with switching to GraphQL for Goby but, as far as we can tell, there's no way in the GH GraphQL API to efficiently grab all of a given user's events. We'd have to switch to counting only the things GH thinks of as a "contribution" and we're not willing to lose Issue Comments, Commit Comments, and PR Review Comments. To us, those are valuable contributions people can make. |
UPDATE: For anyone interested in this issue, it's currently not assigned. Feel free to leave a comment if you'd like me to assign it to you. |
Currently Starfish does not filter out events in personal repositories, or in repositories owned by specific organizations. Future users may want to exclude contributions into repositories that their employer owns (key point - one employer may own many GitHub Organizations), or contributions into personal repositories (defined here as "any repository owned by the user who triggered the event"). We should add configuration options that allow future users to easily implement their own policies regarding where events will be counted.
The simplest version of this would be to:
Add variables to the .env.template:
Add new filterResponseFor functions that filter the events based off these variables.
Add documentation in the README for using these env variables.
The text was updated successfully, but these errors were encountered: