Pipeline to find out what kind of resource a Code Repository linked to Throughput is:
- educational
- miscellaneous
- software development
- storage
This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a code of conduct. Please review and follow this code of conduct as part of your contribution.
Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through branches.
All products of the Throughput Annotation Project are licensed under an MIT License unless otherwise noted.
For now, the repository contains 3 notebooks to outline the process.
The first notebook is to get the Repositories ReadMe files. This is done using Neo4j to identify the repositories in Throughtput and then using GitHub's API with a Developer Key.
Content from ReadMe files is encoded in base64 so, decoding is also necessary for our NLP procedures.
This project uses the Throughput Graph Database as an input from neo4j:
neotoma:
tsv file
These files are used as input that will help create a Recommender System.
- Predict whether a code repository is educational, misc, etc..
This project is developed using Python and Neo4j.
This project will need Neo4j installed.
It runs on a MacOS system.
Continuous integration uses TravisCI.
The project pulls data from the Throughput database. Need a GitHub API secret Labels were currently given by Morgan Wofford but should be able to get these from annotations in the Throughtput DB in the near future.
This project will generate a structured dataset that provides the following information:
- Whether a CR belongs to a certain class.
Currently, the model's accuracy is very poor due to low quantity of labeled data. However, test performance at its best is 66% (it is still painfully overfitting)
TODO: \n \n
[include workflow chart]
View notebooks in following order: