Repository Classifier

Pipeline to find out what kind of resource a Code Repository linked to Throughput is:

educational
miscellaneous
software development
storage

Contributors

This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a code of conduct. Please review and follow this code of conduct as part of your contribution.

Tips for Contributing

Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through branches.

All products of the Throughput Annotation Project are licensed under an MIT License unless otherwise noted.

How to use this repository

For now, the repository contains 3 notebooks to outline the process.

The first notebook is to get the Repositories ReadMe files. This is done using Neo4j to identify the repositories in Throughtput and then using GitHub's API with a Developer Key.

Content from ReadMe files is encoded in base64 so, decoding is also necessary for our NLP procedures.

Workflow Overview

This project uses the Throughput Graph Database as an input from neo4j:

neotoma: tsv file

These files are used as input that will help create a Recommender System.

Predict whether a code repository is educational, misc, etc..

System Requirements

This project is developed using Python and Neo4j.
This project will need Neo4j installed. It runs on a MacOS system. Continuous integration uses TravisCI.

Data Requirements

The project pulls data from the Throughput database. Need a GitHub API secret Labels were currently given by Morgan Wofford but should be able to get these from annotations in the Throughtput DB in the near future.

Key Outputs

This project will generate a structured dataset that provides the following information:

Whether a CR belongs to a certain class.

Currently, the model's accuracy is very poor due to low quantity of labeled data. However, test performance at its best is 66% (it is still painfully overfitting)

Pipeline

TODO: \n \n

[include workflow chart]

Instructions

View notebooks in following order:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Labeling_Data.ipynb		Labeling_Data.ipynb
README.md		README.md
Repo_Classifier_Readme_Getter.ipynb		Repo_Classifier_Readme_Getter.ipynb
SpaCy Preprocessing.ipynb		SpaCy Preprocessing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repository Classifier

Contributors

Tips for Contributing

How to use this repository

Workflow Overview

System Requirements

Data Requirements

Key Outputs

Pipeline

Instructions

About

Releases

Packages

Languages

License

throughput-ec/repo_classifier

Folders and files

Latest commit

History

Repository files navigation

Repository Classifier

Contributors

Tips for Contributing

How to use this repository

Workflow Overview

System Requirements

Data Requirements

Key Outputs

Pipeline

Instructions

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages