Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release Tested. #1549

Merged
merged 23 commits into from
Jan 10, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
d8095e3
Bump tensorflow from 2.5.1 to 2.5.2 in /workers/message_insights_worker
dependabot[bot] Nov 10, 2021
b12ba0f
Testing up to date contributor_interface
IsaacMilarky Dec 26, 2021
5c2ef41
updating message analysis worker queries.
sgoggins Dec 31, 2021
bcc2579
Merge pull request #1513 from chaoss/dependabot/pip/workers/message_i…
sgoggins Dec 31, 2021
c131b92
message insights worker updates.
sgoggins Dec 31, 2021
7c1ed6a
Merge remote-tracking branch 'origin/sean-patch-aam' into sean-patch-aam
sgoggins Dec 31, 2021
0131096
CUDA docs update.
sgoggins Dec 31, 2021
8dd5503
Debugging halting on email resolution repeat. Keeps printing \'s@gogg…
IsaacMilarky Jan 3, 2022
f4ead9d
More debugging
IsaacMilarky Jan 3, 2022
3e4c80d
I don't want to use pdb to debug the workers again if I don't have to…
IsaacMilarky Jan 3, 2022
51fbdb9
debugging rate limit stall on heavily taxed hardware
sgoggins Jan 4, 2022
ccc365a
version update
sgoggins Jan 4, 2022
4961922
version update
sgoggins Jan 4, 2022
643ce8d
sql statement updated
IsaacMilarky Jan 4, 2022
8353f97
login_json being subscripted on NoneType fixed
IsaacMilarky Jan 4, 2022
2395b56
remove debug logging
IsaacMilarky Jan 4, 2022
b44be59
Fix no longer in scope variable ref
IsaacMilarky Jan 4, 2022
f90ec68
Typo
IsaacMilarky Jan 5, 2022
09eb4b0
Unneccessary print
IsaacMilarky Jan 5, 2022
d1fa200
Cleaner logging
IsaacMilarky Jan 5, 2022
57f996a
Merge pull request #1548 from chaoss/isaac-commit-resolution
sgoggins Jan 6, 2022
680b10c
metadata.py update
sgoggins Jan 6, 2022
1c08c43
sentiment file
sgoggins Jan 10, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,16 @@
Message Insights Worker
=======================

.. note::
- If you have an NVidia GPU available, you can install the `cuda` drivers to make this worker run faster.
- On Ubuntu 20.04, use the following commands:
- On the Ubuntu machine, open a Terminal. Type in the following commands to add the Nvidia ppa repository:
- `wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin`
- `sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600 && sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub`
- `sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"`
- `sudo apt-get update && sudo apt-get install -y nvidia-kernel-source-460`
- `sudo apt-get -y install cuda`

This worker analyzes the comments and text messages corresponding to all the issues and pull requests in a repository and performs two tasks:

- **Identifies novel messages** - Detects if a new message is semantically different from past messages in a repo
Expand Down
8 changes: 4 additions & 4 deletions metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
__slug__ = "augur"
__url__ = "https://github.com/chaoss/augur"

__short_description__ = "Python 3 package for free/libre and open-source software community metrics & data collection"
__short_description__ = "Python 3 package for free/libre and open-source software community metrics, models & data collection"

__version__ = "0.23.0"
__release__ = "v0.23.0"
__version__ = "0.23.5"
__release__ = "v0.23.5"

__license__ = "MIT"
__copyright__ = "CHAOSS & Augurlabs 2021"
__copyright__ = "University of Missouri, University of Nebraska-Omaha, CHAOSS & Augurlabs 2022"
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,40 @@ def initialize_logging(self):
self.tool_version = '\'0.2.0\''
self.data_source = '\'Git Log\''



def create_endpoint_from_commit_sha(self,commit_sha, repo_id):
self.logger.info(f"Trying to create endpoint from commit hash: {commit_sha}")

#https://api.github.com/repos/chaoss/augur/commits/53b0cc122ac9ecc1588d76759dc2e8e437f45b48

select_repo_path_query = s.sql.text("""
SELECT repo_path, repo_name from repo
WHERE repo_id = :repo_id_bind
""")

# Bind parameter
select_repo_path_query = select_repo_path_query.bindparams(
repo_id_bind=repo_id)
result = self.db.execute(select_repo_path_query).fetchall()

# if not found
if not len(result) >= 1:
raise LookupError

# Else put into a more readable local var
self.logger.info(f"Result: {result}")
repo_path = result[0]['repo_path'].split(
"/")[1] + "/" + result[0]['repo_name']

url = "https://api.github.com/repos/" + repo_path + "/commits/" + commit_sha

self.logger.info(f"Url: {url}")

return url



# Try to construct the best url to ping GitHub's API for a username given an email.
"""
I changed this because of the following note on the API site: With the in qualifier you can restrict your search to the username (login), full name, public email, or any combination of these. When you omit this qualifier, only the username and email address are searched. For privacy reasons, you cannot search by email domain name.
Expand Down Expand Up @@ -394,10 +428,11 @@ def update_contributor(self, cntrb, max_attempts=3):

canonical_email = contributor_table_data[0]['cntrb_canonical']
#check if the contributor has a NULL canonical email or not
self.logger.info(f"The value of the canonical email is : {canonical_email}")
#self.logger.info(f"The value of the canonical email is : {canonical_email}")

if canonical_email is not None:
del cntrb["cntrb_canonical"]
self.logger.info("Existing canonical email found in database and will not be overwritten.")

while attempts < max_attempts:
try:
Expand Down Expand Up @@ -482,6 +517,70 @@ def fetch_username_from_email(self, commit):
# failure condition returns None
return login_json

#Method to return the login given commit data using the supplemental data in the commit
# -email
# -name
def get_login_with_supplemental_data(self, commit_data):

# Try to get login from all possible emails
# Is None upon failure.
login_json = self.fetch_username_from_email(commit_data)

# Check if the email result got anything, if it failed try a name search.
if login_json == None or 'total_count' not in login_json or login_json['total_count'] == 0:
self.logger.info(
"Could not resolve the username from the email. Trying a name only search...")

try:
url = self.create_endpoint_from_name(commit_data)
except Exception as e:
self.logger.info(
f"Couldn't resolve name url with given data. Reason: {e}")
return None

login_json = self.request_dict_from_endpoint(
url, timeout_wait=30)

# total_count is the count of username's found by the endpoint.
if login_json == None or 'total_count' not in login_json:
self.logger.info(
"Search query returned an empty response, moving on...\n")
return None
if login_json['total_count'] == 0:
self.logger.info(
"Search query did not return any results, adding commit's table remains null...\n")

return None

# Grab first result and make sure it has the highest match score
match = login_json['items'][0]
for item in login_json['items']:
if item['score'] > match['score']:
match = item

self.logger.info("When searching for a contributor, we found the following users: {}\n".format(match))

return match['login']

def get_login_with_commit_hash(self, commit_data, repo_id):

#Get endpoint for login from hash
url = self.create_endpoint_from_commit_sha(commit_data['hash'], repo_id)

#Send api request
login_json = self.request_dict_from_endpoint(url)

if login_json is None or 'sha' not in login_json:
self.logger.info("Search query returned empty data. Moving on")
return None

try:
match = login_json['author']['login']
except:
match = None

return match

# Update the contributors table from the data facade has gathered.

def insert_facade_contributors(self, repo_id):
Expand All @@ -492,7 +591,8 @@ def insert_facade_contributors(self, repo_id):
# in the contributors table or the contributors_aliases table.
new_contrib_sql = s.sql.text("""
SELECT DISTINCT
commits.cmt_author_name AS NAME,--commits.cmt_id AS id,
commits.cmt_author_name AS NAME,
commits.cmt_commit_hash AS hash,
commits.cmt_author_raw_email AS email_raw,
'not_unresolved' as resolution_status
FROM
Expand All @@ -504,10 +604,12 @@ def insert_facade_contributors(self, repo_id):
AND ( commits.cmt_author_name ) IN ( SELECT C.cmt_author_name FROM commits AS C WHERE C.repo_id = :repo_id GROUP BY C.cmt_author_name ))
GROUP BY
commits.cmt_author_name,
commits.cmt_commit_hash,
commits.cmt_author_raw_email
UNION
SELECT DISTINCT
commits.cmt_author_name AS NAME,--commits.cmt_id AS id,
commits.cmt_commit_hash AS hash,
commits.cmt_author_raw_email AS email_raw,
'unresolved' as resolution_status
FROM
Expand All @@ -518,6 +620,7 @@ def insert_facade_contributors(self, repo_id):
AND ( commits.cmt_author_name ) IN ( SELECT C.cmt_author_name FROM commits AS C WHERE C.repo_id = :repo_id GROUP BY C.cmt_author_name )
GROUP BY
commits.cmt_author_name,
commits.cmt_commit_hash,
commits.cmt_author_raw_email
ORDER BY
NAME
Expand All @@ -526,10 +629,14 @@ def insert_facade_contributors(self, repo_id):
'repo_id': repo_id}).to_json(orient="records"))

# Try to get GitHub API user data from each unique commit email.

#self.logger.info(
# f"DEBUG: The data to process looks like this: {new_contribs}"
#)

for contributor in new_contribs:

# Get list of all emails in the commit data.
# Start with the fields we know that we can start with
# Get the email from the commit data
email = contributor['email_raw'] if 'email_raw' in contributor else contributor['email']

# check the email to see if it already exists in contributor_aliases
Expand All @@ -549,46 +656,17 @@ def insert_facade_contributors(self, repo_id):
self.logger.info(
f"alias table query failed with error: {e}")

# Try to get login from all possible emails
# Is None upon failure.
login_json = self.fetch_username_from_email(contributor)

# Check if the email result got anything, if it failed try a name search.
if login_json == None or 'total_count' not in login_json or login_json['total_count'] == 0:
self.logger.info(
"Could not resolve the username from the email. Trying a name only search...")

try:
url = self.create_endpoint_from_name(contributor)
except Exception as e:
self.logger.info(
f"Couldn't resolve name url with given data. Reason: {e}")
continue

login_json = self.request_dict_from_endpoint(
url, timeout_wait=30)

# total_count is the count of username's found by the endpoint.
if login_json == None or 'total_count' not in login_json:
self.logger.info(
"Search query returned an empty response, moving on...\n")
continue
if login_json['total_count'] == 0:
self.logger.info(
"Search query did not return any results, adding commit's table remains null...\n")

#Try to get the login from the commit sha
login = self.get_login_with_commit_hash(contributor, repo_id)

if login == None or login == "":
#Try to get the login from supplemental data if not found with the commit hash
login = self.get_login_with_supplemental_data(contributor)

if login == None:
continue

# Grab first result and make sure it has the highest match score
match = login_json['items'][0]
for item in login_json['items']:
if item['score'] > match['score']:
match = item

self.logger.info("When searching for a contributor with info {}, we found the following users: {}\n".format(
contributor, match))

url = ("https://api.github.com/users/" + match['login'])
url = ("https://api.github.com/users/" + login)

user_data = self.request_dict_from_endpoint(url)

Expand Down Expand Up @@ -684,11 +762,14 @@ def insert_facade_contributors(self, repo_id):
self.logger.info(
f"Deleting now resolved email failed with error: {e}")

#self.logger.info("DEBUG: Got through the new_contribs")

# sql query used to find corresponding cntrb_id's of emails found in the contributor's table
# i.e., if a contributor already exists, we use it!
resolve_email_to_cntrb_id_sql = s.sql.text("""
SELECT DISTINCT
cntrb_id,
contributors.cntrb_login AS login,
contributors.cntrb_canonical AS email,
commits.cmt_author_raw_email
FROM
Expand All @@ -699,21 +780,29 @@ def insert_facade_contributors(self, repo_id):
AND commits.repo_id = :repo_id
UNION
SELECT DISTINCT
cntrb_id,
contributors_aliases.cntrb_id,
contributors.cntrb_login as login,
contributors_aliases.alias_email AS email,
commits.cmt_author_raw_email
FROM
contributors,
contributors_aliases,
commits
WHERE
contributors_aliases.alias_email = commits.cmt_author_raw_email
AND contributors.cntrb_id = contributors_aliases.cntrb_id
AND commits.repo_id = :repo_id
""")

#self.logger.info("DEBUG: got passed the sql statement declaration")
# Get a list of dicts that contain the emails and cntrb_id's of commits that appear in the contributor's table.
existing_cntrb_emails = json.loads(pd.read_sql(resolve_email_to_cntrb_id_sql, self.db, params={
'repo_id': repo_id}).to_json(orient="records"))

#self.logger.info("DEBUG: got passed the sql statement's execution")

#self.logger.info(f"DEBUG: Here are the existing emails: {existing_cntrb_emails}")

# iterate through all the commits with emails that appear in contributors and give them the relevant cntrb_id.
for cntrb_email in existing_cntrb_emails:
self.logger.info(
Expand Down Expand Up @@ -754,7 +843,7 @@ def create_endpoint_from_repo_id(self, repo_id):
# Create endpoint for committers in a repo.
url = "https://api.github.com/repos/" + repo_path + "/contributors?state=all&direction=asc&per_page=100&page={}"

self.logger.info(f"Url: {url}")
#self.logger.info(f"Url: {url}")

return url

Expand All @@ -778,10 +867,6 @@ def grab_committer_list(self, repo_id, platform="github"):

#Prepare for pagination and insertion into the contributor's table with an action map
# TODO: this might be github specific

## SPG 12/1/2021: I think we need to update as well. I am not sure this is happening. If the contributor is
## already in the database without github stuff, are we updating the additional info in the contributor
## record?
committer_action_map = {
'insert': {
'source': ['login'],
Expand Down
Loading