How to Fine-tune LLM4Decompile (Based on Ghidra Pseudo-Code)

We provide script finetune.py, adapted from the deepseek-coder repository.

The script supports the training with DeepSpeed. You need install required packages by:

pip install -r requirements.txt

If you want to leverage FlashAttention to accelerate training, install it via:

pip install flash-attn

Please download the decompile-ghidra-100k dataset to your workspace, and process it into JSON format. Each line is a json-serialized string with two required fields instruction and output.

After data preparation, you can use the sample shell script to finetune llm4decompile model. Remember to specify DATA_PATH, OUTPUT_PATH.

WORKSPACE="/workspace"
DATA_PATH="${WORKSPACE}/decompile-ghidra-100k.json"
OUTPUT_PATH="${WORKSPACE}/output_models/llm4decompile-ref"
MODEL_PATH="deepseek-ai/deepseek-coder-1.3b-base"

CUDA_VISIBLE_DEVICES=0 deepspeed finetune.py \
    --model_name_or_path $MODEL_PATH \
    --data_path $DATA_PATH \
    --output_dir $OUTPUT_PATH \
    --num_train_epochs 2 \
    --model_max_length 1024 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --use_flash_attention \
    --save_steps 100 \
    --save_total_limit 100 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --weight_decay 0.1 \
    --warmup_ratio 0.025 \
    --logging_steps 1 \
    --lr_scheduler_type "cosine" \
    --gradient_checkpointing True \
    --report_to "tensorboard" \
    --bf16 True

Simple demo on constructing the training data (Based on Objdump assembly). Note we use ExeBench as our final dataset.

Before compiling, please clone the AnghaBench dataset.

git clone https://github.com/brenocfg/AnghaBench

Then use the following script to compile AnghaBench:

python compile.py --root Anghabench_path --output AnghaBench_compile.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

How to Fine-tune LLM4Decompile (Based on Ghidra Pseudo-Code)

Simple demo on constructing the training data (Based on Objdump assembly). Note we use ExeBench as our final dataset.

Files

README.md

Latest commit

History

README.md

File metadata and controls

How to Fine-tune LLM4Decompile (Based on Ghidra Pseudo-Code)

Simple demo on constructing the training data (Based on Objdump assembly). Note we use ExeBench as our final dataset.