Siamese Network for Code Clone Detection

This project implements a Siamese Network with a Bidirectional LSTM backbone for detecting semantic similarity in source code. The network predicts whether two code snippets are functionally identical, enabling use cases like code clone detection, plagiarism detection, and software refactoring.

Features

Dataset: Utilizes the BigCloneBench dataset from Hugging Face's datasets library.
Model:
- Embedding layer for converting tokenized code into dense vectors.
- Bidirectional LSTM to capture sequential dependencies from both directions.
- Fully connected layers for feature extraction.
- Cosine similarity to compute similarity between code snippet embeddings.
Loss Function: Implements contrastive loss, ensuring that embeddings for similar code snippets are close while dissimilar embeddings are far apart.
Evaluation:
- Metrics include accuracy, F1-score, precision, and recall.
- Provides detailed results on the test dataset for robust analysis.

Installation

Prerequisites

Ensure the following tools and libraries are installed:

Python 3.8 or higher
PyTorch 1.9 or newer
Additional Python dependencies listed in requirements.txt

Installation Steps

Clone the repository:

git clone https://github.com/shivbera18/CloneCodeDetection.git
cd CloneCodeDetection

Install dependencies:

pip install -r requirements.txt Verify installation by running:

python --version python -c "import torch; print(torch.version)" Dataset The BigCloneBench dataset is used for training and testing. It is a large benchmark dataset for detecting functionally similar code snippets.

Dataset Details: Source: BigCloneBench on Hugging Face Classes: Binary labels (1 for similar code pairs, 0 for dissimilar). Preprocessing: Tokenized code snippets using Keras' Tokenizer. Padded sequences to ensure uniform input length.

Model Architecture:

The Siamese network consists of two identical sub-networks for feature extraction. Key components:

1.Embedding Layer: Maps tokenized code into dense vector representations. Embedding size: 128.

2.Bidirectional LSTM: Captures contextual relationships in code from both directions. Hidden size: 128. Two layers with dropout for regularization.

3.Fully Connected Layers: Reduces feature dimensionality. Includes Batch Normalization and Dropout.

4.Cosine Similarity: Measures the similarity between embeddings of code snippets.

5.Contrastive Loss: Encourages similar pairs to have high cosine similarity and dissimilar pairs to have low similarity.

Training To train the model, run:

python train.py

Testing Evaluate the model's performance on the test dataset:

python evaluate.py

Adjust hyperparameters (e.g., learning rate, batch size, number of epochs) in the respective scripts (train.py, evaluate.py).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
dataset.py		dataset.py
evaluate.py		evaluate.py
models.py		models.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Siamese Network for Code Clone Detection

Features

Installation

Prerequisites

Installation Steps

About

Releases

Packages

Languages

shivbera18/CloneCodeDetection

Folders and files

Latest commit

History

Repository files navigation

Siamese Network for Code Clone Detection

Features

Installation

Prerequisites

Installation Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages