Skip to content

Commit

Permalink
Merge branch 'master' of github.com:hamelsmu/code_search
Browse files Browse the repository at this point in the history
  • Loading branch information
hamelsmu committed May 18, 2018
2 parents 3fb4257 + 62d5c96 commit 5d52981
Showing 1 changed file with 20 additions and 2 deletions.
22 changes: 20 additions & 2 deletions notebooks/1 - Preprocess Data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,22 @@
"EN = spacy.load('en')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download raw python files"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -107,7 +123,7 @@
" ' '.join(tokenize_code(function)),\n",
" ' '.join(tokenize_docstring(docstring.split('\\n\\n')[0]))\n",
" ))\n",
" except (SyntaxError, MemoryError, UnicodeEncodeError):\n",
" except (AssertionError, MemoryError, SyntaxError, UnicodeEncodeError):\n",
" pass\n",
" return pairs"
]
Expand Down Expand Up @@ -251,7 +267,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Output each set to train/valid/test.function/docstrings/lineage files"
"## Output each set to train/valid/test.function/docstrings/lineage files\n",
"Original functions are also written to compressed json files. (Raw functions contain `,`, `\\t`, `\\n`, etc., it is less error-prone using json format)"
]
},
{
Expand All @@ -264,6 +281,7 @@
"source": [
"def write_to(df, filename):\n",
" df.function_tokens.to_csv('{}.function'.format(filename), index=False)\n",
" df.original_function.to_json('{}_original_function.json.gz'.format(filename), orient='values', compression='gzip')\n",
" if filename != 'without_docstrings':\n",
" df.docstring_tokens.to_csv('{}.docstring'.format(filename), index=False)\n",
" df.url.to_csv('{}.lineage'.format(filename), index=False)"
Expand Down

0 comments on commit 5d52981

Please sign in to comment.