Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconciling Moved Repositories #2890

Open
sgoggins opened this issue Aug 9, 2024 · 3 comments
Open

Reconciling Moved Repositories #2890

sgoggins opened this issue Aug 9, 2024 · 3 comments
Assignees
Labels
bug Documents unexpected/wrong/buggy behavior critical-fix Should be addressed before any other issue/PRs database Related to Augur's unifed data model deployed version Live problems with deployed versions python Pull requests that update Python code server Related to the Augur server workers Related to data workers

Comments

@sgoggins
Copy link
Member

sgoggins commented Aug 9, 2024

A sequence of events occurring out of order can cause two Augur features to add a renamed repository in addition to the originally named repository. After significant investigation on our largest instance, what seems to occur is:

  1. A repository from the Apache Foundation (largest cleaned up set so far) will be created with incubating- as the start of its name. When it graduates the incubator, the incubator- is removed.
  2. IF, before our renaming process runs the GitHub organization repository list is updated, that process does not appear to also check for moved repositories, so a duplicate is created.
  3. The result is a set of error messages for duplicate PRs and Issues because those URLs must be unique. This is actually intended behavior to ensure that nothing about Augur can create misleading data sets without generating the errors we see and alerting us to address a new condition in the data.

So, Augur is working as expected, and we need to adjust a bit for the way data is evolving.

Here is an example error message:

Traceback (most recent call last):
  File "/home/ubuntu/github/virtualenvs/hosted/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1969, in _exec_single_context
    self.dialect.do_execute(
  File "/home/ubuntu/github/virtualenvs/hosted/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 922, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "pull-request-insert-unique"
DETAIL:  Key (pr_url)=(https://api.github.com/repos/GSA/datagov-harvester/pulls/88) already exists.

In this case, the repo first existed here: https://github.com/gsa/datagov-harvesting-logic (repo_id 128320) and later was "added" at the GSA/datagov-harvester (repo_id 149077) URL before the repository could be renamed. So, these two are in conflict in perpetuity.

This appears by far to be the most common case of this occurring, our previous analysis of divergent commit logs also being a source of some issues notwithstanding, this seems like the primary cause to address.

@sgoggins sgoggins converted this from a draft issue Aug 9, 2024
@sgoggins sgoggins added bug Documents unexpected/wrong/buggy behavior deployed version Live problems with deployed versions server Related to the Augur server workers Related to data workers database Related to Augur's unifed data model critical-fix Should be addressed before any other issue/PRs python Pull requests that update Python code labels Aug 9, 2024
@sgoggins
Copy link
Member Author

sgoggins commented Aug 9, 2024

This issue currently affects approximately 30 out of 106,341 repositories in the Augur public instance. FYI.

@FaridMalekpour
Copy link

To handle the duplicate repository issue caused by renaming, I recommend adding a check in your repository synchronization process to handle repository moves more effectively. This involves:

  1. Check for Renamed Repositories: Modify the repository synchronization logic to detect and handle renamed repositories by checking the moved status of a repository. GitHub’s API provides a way to check if a repository has been renamed or moved, using the moved field in the repository response.

  2. Merge Duplicate Repositories: When a renamed repository is detected, ensure that its data is merged with the existing repository entry to prevent duplication. This can be achieved by:

    • Updating the repository URL and ID to reflect the new repository.
    • Ensuring that associated PRs and Issues are not duplicated.

    Example:

    • Original repository: repo_id 128320 (GSA/datagov-harvesting-logic)
    • New repository after renaming: repo_id 149077 (GSA/datagov-harvester)

    You need to reconcile these two entries so that only one repository exists, merging any associated data (PRs, Issues, etc.).

  3. Update Augur’s Repository Handling Logic: Adjust Augur's repository handling logic to better handle repository renames and moves. This can be done by:

    • Running a pre-check before inserting new repositories to see if they were previously listed under a different name.
    • Avoiding the creation of a new repository if the URL or repository ID already exists in your database but with a different name.

Example Code Snippet for GitHub API Check

You can use the following API endpoint to detect repository renames:

import requests

def check_repo_moved(repo_url):
    response = requests.get(repo_url)
    if response.status_code == 301:  # 301 Moved Permanently
        new_repo_url = response.headers['Location']
        return new_repo_url
    return None

# Example usage
old_repo_url = "https://api.github.com/repos/GSA/datagov-harvesting-logic"
new_repo_url = check_repo_moved(old_repo_url)
if new_repo_url:
    print(f"Repository has been moved to {new_repo_url}")

@Ulincsys
Copy link
Contributor

@FaridMalekpour We have logic to detect a 301 already. You can see this in the detect_move core task.

We discussed last week how to check if a repo has moved since the last core collection when someone tries to add the same repo from the new URL. There are a few issues with relying solely on 301 logic for this specific scenario.

  1. GitHub sometimes arbitrarily returns 404 when a repo has moved, and so we may need to use another source of truth in that instance
  2. A malicious actor may utilize repo-jacking to replace the existing repo with an unexpected data source
    • This is admittedly not much of a concern for us at the moment, but it is still important to be mindful of it
  3. If a user renames an existing repo and then creates a new repo with the old name, then a 301 will no longer be given

Currently, we are considering using GitHub's repo source IDs to assist in determining the identity of a moved repo, but we are always open to new and better solutions for ensuring our data is consistent and complete 😁

@ABrain7710 ABrain7710 moved this from In Progress to Dev Testing in Augur TSC Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Documents unexpected/wrong/buggy behavior critical-fix Should be addressed before any other issue/PRs database Related to Augur's unifed data model deployed version Live problems with deployed versions python Pull requests that update Python code server Related to the Augur server workers Related to data workers
Projects
Status: Dev Testing
Development

No branches or pull requests

5 participants