Identifying Files and Workflows Contributing to Technical Debt in GitHub Repositories Using Data Mining and Natural Language Processing Techniques
-
Fetching data from GitHub Actions Workflows
Efficiently fetch data from GitHub Actions workflows for further analysis. -
Data Cleaning
Clean and preprocess the fetched data to ensure consistency and accuracy for downstream tasks. -
Automatic Text Classification with NLP
Leverage a pre-trained NLP model to automatically classify text,as TD and Not_TD instances -
Technical Debt (TD) Visualization
To visualize technical debt, we have generated insightful plots using popular Python libraries such as Matplotlib, Pandas, and Seaborn. These visualizations help to better understand the distribution and impact of technical debt across different aspects of the project.
Ensure that you have the following installed:
First, clone the project to your local machine:
git clone https://github.com/Aqila-Farahmand/MasterThesis
cd your-repository
python -m venv venv
source venv/bin/activate # For Linux/macOS
venv\Scripts\activate # For Windows
Use pip to install all the required packages from requirements.txt:
pip install -r requirements.txt
After installing the dependencies, you can run the project with:
- Fetching Data
To fetch the required data, run the following command:
python -m data_fetching.__main__
-
Data Cleaning For data cleaning, you can use Google Colab or Jupyter Notebook. Import the file from
data_cleaning/clean_data.ipynb
and run the code there. -
Text Classification The text classification process uses an NLP model trained on a large dataset from GitHub issues. Due to GitHub's large file size limitations, only the inference code is provided in this repository.
-
Technical Debt (TD) Visualization
You can generate simple plots to visualize your data using the script located atdata_visualization/workflow_analysis.py
. Simply run the file to create visualizations based on your dataset.
To fetch data, you'll need to configure authentication for making GitHub API requests.
-
Generate a Personal Access Token (GITHUB_TOKEN) for authenticated API requests.
-
Set the
GITHUB_TOKEN
as an environment variable:env: GITHUB_TOKEN: ${{your_github_token}}
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create a new branch (git checkout -b feature-branch)
- Commit your changes (git commit -m 'Add new feature')
- Push to the branch (git push origin feature-branch)
- Create a new pull request