Skip to content

Latest commit

 

History

History
113 lines (91 loc) · 8.6 KB

README_EN.md

File metadata and controls

113 lines (91 loc) · 8.6 KB

CodeGPT: A Code-Related Dialogue Dataset Generated by GPT and for GPT

[中文版]

Programming is one of the key capabilities of ChatGPT and its replicas, such as the LLaMA family. To improve the programming ability of Large Language Models (LLMs), a large number of code-related training data are essential during the fine-tuning process. However, most of the existing code-related datasets are too difficult for LLMs to understand, which results in a poor performance, and further, there is a lack of Chinese dataset. Thus, we introduce "CodeGPT-Dataset-V1-CN", a Chinese dataset including 32K code-related dialogues. Some of these dialogues are automatically generated by existing LLMs and the others are collected from the Internet. We use these data to improve the programming ability of GPT.

Method

We collect four kinds of code-related dialogues of different difficulty:

  1. Code-Basic: Utilize existing LLMs, e.g., paid APIs, to generate code-related dialogues about the basic library functions of various programming languages. The prompt is akin to "Please generate a very simple programming question-answer pair for the XX function in C". The API then generate a question-answer pair related to the basic library function. We use the question as user input and the answer as model output (in Markdown format), which forms a dialogue.
  2. Code-Medium: Utilize existing LLMs, e.g., paid APIs, to generate programming question-answer pairs of medium difficulty, such as writing to a file, finding substrings, implementing linked lists. We also use the question as user input and the answer as model output (in Markdown format), which forms a dialogue.
  3. Code-LeetCode: Collect question-answer pairs from LeetCode. These questions are often of high difficulty. We use the question as user input and the answer as model output (in Markdown format), which forms a dialogue.
  4. Code-Text: Collect interview question-answer pairs from the Internet. These questions often do not involve programming, but rather focus on understanding of certain concepts in different fields of computer science. We use the question as user input and the answer as model output (in text format), which forms a dialogue.

Released Datasets

We have used the above method to collect a large number of code-related dialogues. We are releasing a part of this dataset, which we have named "CodeGPT-Dataset-V1-CN". This dataset consists of 32K Chinese code-related dialogues. The data samples are shown below.

Example 1-Basic

{
    "system": "Assistant扮演一名对话机器人。User每次输入一个问题,Assistant做出相应回答。不使用任何插件。",
    "source": "CodeGPT",
    "language": "Chinese",
    "user": "这是关于C中abs的问题。编写一个程序,输入一个整数n,并输出n的绝对值。",
    "assistant": "```c\n#include <stdio.h>\n#include <stdlib.h>\n\nint main() {\n    int n;\n    scanf(\"%d\", &n);\n    printf(\"%d\", abs(n));\n    return 0;\n}\n```",
    "thought": "这是关于C中abs的问题,该函数的函数描述为:返回一个整数的绝对值。",
    "command": "",
    "plugin": ""
}

Example 2-Medium

{
    "system": "Assistant扮演一名对话机器人。User每次输入一个问题,Assistant做出相应回答。不使用任何插件。",
    "source": "CodeGPT",
    "language": "Chinese",
    "user": "用C语言将整型数组按从小到大的顺序排序;",
    "assistant": "```c\n#include <stdio.h>\n\nvoid sort(int arr[], int n) {\n   int i, j, temp;\n   for (i = 0; i < n - 1; i++) {\n      for (j = 0; j < n - i - 1; j++) {\n         if (arr[j] > arr[j + 1]) {\n            temp = arr[j];\n            arr[j] = arr[j + 1];\n            arr[j + 1] = temp;\n         }\n      }\n   }\n}\n\nint main() {\n   int arr[] = {5, 2, 8, 4, 7};\n   int n = sizeof(arr) / sizeof(int);\n\n   sort(arr, n);\n\n   for (int i = 0; i < n; i++) {\n      printf(\"%d \", arr[i]);\n   }\n\n   return 0;\n}\n```",
    "thought": "这是关于C的问题",
    "command": "",
    "plugin": ""
}

Example 3-LeetCode

{
    "system": "Assistant扮演一名对话机器人。User每次输入一个问题,Assistant做出相应回答。不使用任何插件。",
    "source": "CodeGPT",
    "language": "Chinese",
    "user": "请用java完成以下题目:\n给定一个整数数组 nums 和一个整数目标值 target,请你在该数组中找出 和为目标值 target  的那 两个 整数,并返回它们的数组下标。\n\n你可以假设每种输入只会对应一个答案。但是,数组中同一个元素在答案里不能重复出现。\n\n你可以按任意顺序返回答案。\n\n \n\n示例 1:\n\n\n输入:nums = [2,7,11,15], target = 9\n输出:[0,1]\n解释:因为 nums[0] + nums[1] == 9 ,返回 [0, 1] 。\n\n\n示例 2:\n\n\n输入:nums = [3,2,4], target = 6\n输出:[1,2]\n\n\n示例 3:\n\n\n输入:nums = [3,3], target = 6\n输出:[0,1]\n\n\n \n\n提示:\n\n\n\t2 <= nums.length <= 104\n\t-109 <= nums[i] <= 109\n\t-109 <= target <= 109\n\t只会存在一个有效答案\n\n\n \n\n进阶:你可以想出一个时间复杂度小于 O(n2) 的算法吗?\n",
    "assistant": "```java\nclass Solution {\n\n    public int[] twoSum(int[] nums, int target) {\n\n        Map<Integer, Integer> map = new HashMap<>();\n\n        for (int i = 0; i< nums.length; i++){\n\n            int complement = target - nums[i];\n\n            if (map.containsKey(complement)){\n\n                return new int[]{map.get(complement), i};\n\n            }\n\n            map.put(nums[i], i);\n\n        }\n\n        return new int[]{};\n\n    }\n\n    \n\n}\n\n```",
    "thought": "这是关于数组、哈希表的问题",
    "command": "",
    "plugin": ""
}

Example 4-Text

{
    "system": "Assistant扮演一名对话机器人。User每次输入一个问题,Assistant做出相应回答。不使用任何插件。",
    "source": "CodeGPT",
    "language": "Chinese",
    "user": "变量的声明和定义有什么区别?",
    "assistant": "在C语言中声明变量:\n变量声明仅在编译时向编译器提供以给定类型和名称存在的变量的确定性,以便编译器继续进行进一步编译,而不需要该变量的所有细节。 在 C 语言中,当我们声明一个变量时,我们只将信息提供给编译器,但没有为它保留内存。 它只是一个引用,通过它我们只向编译器保证这个变量可以定义在函数内部或函数外部。\n注意:我们可以多次声明一个变量,但只定义一次。\n例如:\nextern int data;\nextern int foo(int, int);\nint fun(int, char); // extern can be omitted for function declarations\nC语言中变量的定义:\n定义是为变量分配存储空间的动作。 换句话说,可以说变量定义是一种方式来说明编译器在哪里以及为变量创建多少存储通常定义和声明同时发生但不几乎同时发生。\nint data;\nint foo(int, int) { }\n注意:当定义变量时,无需声明它,反之亦然不适用。",
    "thought": "",
    "command": "",
    "plugin": ""
}

Remark

  1. The data we have released, "CodeGPT-Dataset-V1-CN" is only in Chinese. You can follow our method to generate English data or opt to translate our existing dataset into English.
  2. Please note, "CodeGPT-Dataset-V1-CN" has not undergone manual verification, and as such, its accuracy and security cannot be strictly guaranteed. Users should be aware that they are responsible for the results generated using this data.

Contributors

  1. Xiaoxuan Zhu, [email protected]
  2. Zhuozhi Xiong, [email protected]
  3. Lin Zhang, [email protected]
  4. Haoning Ye, [email protected]
  5. Zhouhong Gu, [email protected]
  6. Zihan Li, [email protected]
  7. Sihang Jiang, [email protected]
  8. Hongwei Feng, [email protected]
  9. Yanghua Xiao, [email protected]
  10. Zili Wang, [email protected]
  11. Dongjie Yang, [email protected]
  12. Shusen Wang[email protected]

Reference

Should your work take inspiration from or make use of our method or data, we kindly request that you acknowledge and cite our GitHub repository as a reference.

@misc{codegpt,
  author = {Xiaoxuan, Zhu and Zhuozhi, Xiong and Lin, Zhang and Haoning, Ye and Zhouhong, Gu and Zihan, Li and Sihang, Jiang and Hongwei, Feng and Yanghua, Xiao and Zili, Wang and Dongjie, Yang and Shusen, Wang},
  title = {CodeGPT: A Code-Related Dialogue Dataset Generated by GPT and for GPT},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/zxx000728/CodeGPT}},
}