UniTSyn

Multilingual Unit Test and Function Source Synhronization for CodeLLM. Code for our ISSTA 2024 paper https://arxiv.org/abs/2402.03396.

Requirements

Python 3.10+
requirements.txt
rustfmt to use frontend/rust/collect_fuzz.py

Language Server

To run this script on a new project, you need to install the corresponding language server:

Language	Language Server	Frontend	Backend
Python	pylsp	✔	✔
Java	java-language-server*	✔	✔
JavaScript	typescript-language-server	✔	✔
Go	gopls	✔	✔
C/C++	clangd	✔	✔

*NOTE: you need git clone the repo to workdir of this project, then follow the instructions in the repo to install the language server.

You can find language server for other languages at language-server-protocol/implementors/servers. Other languages are not supported yet, but will be as the research progresses. To support a new language, you need a frontend to do the following:

Collect the unit tests locations and focal functions locations in the repo (see scripts/collect_test.py and scripts/collect_focal.py for Python frontend).
Given a Location of function declaration, extract the function source code (see unitsyncer/source_code.py).

Setup

mkdir -p data/focal data/repos data/repos_tarball data/tests
source ./scripts/env.sh
cd frontend/parser & python3 build.py
cd ../..

Run

python3 scripts/download_repos.py
python3 scripts/decompress_repos.py

python3 frontend/<language>/collect_all.py
python3 main.py

Automated Repo Mining

Automatic repo mining is supported through scripts/find_repos.py.
Note: Please run source ./scripts/env.sh from the root of the repo before mining

Current checks that are supported are:

"stars"
"latest commit"
"language"
"fuzzers"

The corresponding value in reqs to check against should be at the same index as the check in checks_list.

# Command template
python3 scripts/find_repos.py --language='<language>' --checks_list='[<checks>]' --reqs='[<values>]' --num_searches='<num_searches>'

# Rust example
python3 scripts/find_repos.py --language='Rust' --checks_list='["stars", "latest commit", "language", "fuzzers"]' --reqs='["10", "2020-1-1", "Rust", None]' --num_searches='1'

# Python example
python3 scripts/find_repos.py --language='Python' --checks_list='["stars", "latest commit", "language"]' --reqs='["10", "2020-1-1", "Python"]' --num_searches='1'

Cursors representing where the search left off are saved to data/repo_cursors/<language>_cursor.txt. find_repos.py will automatically use and update this cursor to avoid mining duplicate repos.

Reference

Please cite our work in your publications if it helps your research:

@inproceedings{he2024unitsyn,
    author = {He, Yifeng and Huang, Jiabo and Rong, Yuyang and Guo, Yiwen and Wang, Ethan and Chen, Hao},
    title = {UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing},  
    booktitle = {International Symposium on Software Testing and Analysis (ISSTA)},
    date = {2024-09-16/2024-09-20},
    address = {Vienna, Austria},
}

Name		Name	Last commit message	Last commit date
Latest commit History 300 Commits
.github/workflows		.github/workflows
data		data
evaluation		evaluation
frontend		frontend
scripts		scripts
tests		tests
unitsyncer		unitsyncer
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
main.py		main.py
mypy.ini		mypy.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniTSyn

Requirements

Language Server

Setup

Run

Automated Repo Mining

Reference

About

Releases

Packages

Contributors 4

Languages

License

SecurityLab-UCD/UniTSyn

Folders and files

Latest commit

History

Repository files navigation

UniTSyn

Requirements

Language Server

Setup

Run

Automated Repo Mining

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages