cralw repositories

Notes

https://github.com/octokit/rest.js#custom-requests

TODO

Installation of the project

1. Git Clone

git clone [email protected]:Londane/git-data-loader.git

2. NPM stuff

prerequisites:

installed Node & npm : https://nodejs.org/en/download/
maybe install TypeScript :
run in your terminal
```
npm install -g typescript
```

After all prerequisites are installed, go into the project directory and run these commands :

let npm do it's install "magic"
```
npm install
```
create your very own special little .env file
(you will need it later for you api token)
```
npm run init-env
```
(if you already have an .env file this command will do nothing. That will happen if you run it twice. So if in doubt check your existing .env file)

2. GIT-Hub Personal API Token

If we do not authorize we only have 60 calls per HOUR!
But If we do authorize we will have 5000 calls per Hour.
See

So this script needs a valid API-Token and thus will eat up all of your Remaining calls to the github api. But do not fear, the rate limit will be reset after an hour!

Do:

create a personal api Token: Github-Blog
MIND that you only see your token once.
So write it down or ctrl + c it !
to see your private repos you have to give full controle in the repo section
ctrl + v your api token into the .env file.
(Use the GITHUB_PERSONAL_TOKEN variable)

cralw repositories

The script uses a 2 step procedure to crawl the desired repositories.

First you generate a input-list witch holds all repository identifiers you want to cralw.
- At the moment you have to manually configure the desired search query in the code in the generateInputFile: index.ts line:299.
- Then use the following command to append the searched repositorys to your input file
```
npm run gen-input
```
If you generated the input file place it in the input folder and start the script with
```
 npm run go
```
- The script will now load all repositorys to the output folder. You can let the script run as it will save a temporary resoult every 200 loaded repositorys and will also wait and restart if your git hub api rate limit runs out.
- The script will also write all unloaded repositorys to the todo file. That way the original input file won't be changed.
- There exist some repositorys wich can not be loaded. These repositorys will land in the naughtyRepos.jsonlist.

generate datasets model

The script will save the repository data into the output.json file. THen you can use

npm run gen-analyticData

To generate the matrices for the apriori rule extraction. It will compute the most used config.MAX_HEADER_TOPICS topics and config.MAX_HEADER_LANG languages and then will check for each repository if it contains one of these topics / languages. It will create a data.zip wich you can then analyse with the R scripts.

Util

Use the following command to see your git hub api rate limit and when it will refresh

npm run limit

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.vscode		.vscode
repo_input		repo_input
results		results
src		src
.example.env		.example.env
.gitignore		.gitignore
data.zip		data.zip
output.json		output.json
package-lock.json		package-lock.json
package.json		package.json
readme.md		readme.md
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Notes

TODO

Installation of the project

1. Git Clone

2. NPM stuff

2. GIT-Hub Personal API Token

cralw repositories

generate datasets model

Util

About

Releases

Packages

Languages

busch-matthias/git-data-loader

Folders and files

Latest commit

History

Repository files navigation

Notes

TODO

Installation of the project

1. Git Clone

2. NPM stuff

2. GIT-Hub Personal API Token

cralw repositories

generate datasets model

Util

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages