From 9f9c43d0d344d6ab9b31fd7c3efd918eed26e1d2 Mon Sep 17 00:00:00 2001 From: Vladimir Kovalenko Date: Wed, 4 Aug 2021 20:24:00 +0200 Subject: [PATCH 1/7] edit README.md --- README.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index f94b5b08..d6c9f57c 100644 --- a/README.md +++ b/README.md @@ -26,9 +26,9 @@ Currently, it supports extraction of: It is designed to be very easily extensible to new languages. -`astminer` lets you create end2end pipeline of data processing. -It allows convert source code, cloned from VCS to suitable for training datasets. -To do that, `astminer` provides multiple steps to handle data: +`astminer` lets you create an end-to-end pipeline to processing code for machine learning models. +It allows to convert source code cloned from VCS to formats suitable for training. +To achieve that, `astminer` caters for multiple data processing steps: - [filters](./docs/filters.md) to remove redundant samples from data - [label extractors](./docs/label_extractors.md) to create label for each tree - [storages](./docs/storages.md) to define storage format. @@ -41,12 +41,12 @@ There are two ways to use `astminer`. ### Using `astminer` cli -Define config (examples of them in [configs](./configs) directory) and pass it shell script: +Define config (examples of them in [configs](./configs) directory) and pass it to shell script: ```shell ./cli.sh ``` -For details about config format and other navigate to [docs/cli](./docs/cli.md). +For details on CLI configuration, see [docs/cli](./docs/cli.md). ### Using `astminer` as a dependency @@ -88,10 +88,10 @@ After that, add `mavenLocal()` into the `repositories` section in your gradle co If you want to use `astminer` as a library in your Java/Kotlin based data mining tool, check the following: -* A few simple [examples](src/examples) of `astminer` usage in Java and Kotlin. +* A few simple [examples](src/examples) of using `astminer` in Java and Kotlin. * Using `astminer` as a part of another mining tool — [psiminer](https://github.com/JetBrains-Research/psiminer). -Please consider trying Kotlin for your data mining pipelines: from our experience, it is much better suited for data collection and transformation instruments. +Please consider trying Kotlin for your data mining pipelines: from our experience, it is much better suited for data collection and transformation instruments than Java. ## Contribution From f789c7b5ac46ec86044476f57844f9927f3c47b6 Mon Sep 17 00:00:00 2001 From: Vladimir Kovalenko Date: Wed, 4 Aug 2021 20:34:08 +0200 Subject: [PATCH 2/7] edit cli.sh --- docs/cli.md | 37 +++++++++++++++++++++---------------- 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/docs/cli.md b/docs/cli.md index e309fd0e..1d5ef81d 100644 --- a/docs/cli.md +++ b/docs/cli.md @@ -1,39 +1,44 @@ # `astminer` CLI usage -You can run `astminer` through command-line interface. -CLI allow to run the tool on any implemented parser with specifying filtering, label extracting and storage options. +You can run `astminer` through a command line interface (CLI). +The CLI allows to run the tool on any implemented parser with specified options for filtering, label extraction, and storage of the results. ## How to -You can prepare and run CLI on any branch you want. Just navigate to it and do follow steps: -1. Build shadow jar for `astminer`: +You can build and run the CLI with any version of `astminer`: +1. Check out the relevant version of `astminer` sources (for example, the `master-dev` branch) +2. Build a shadow jar for `astminer`: ```shell gradle shadowJar ``` -2. [Optionally] Pull docker image with all parsers dependencies installed: +3. [optional] Pull a docker image with all parsers dependencies installed: ```shell docker pull voudy/astminer ``` -3. Run `astminer` with specified config: +4. Run `astminer` with specified config: ```shell ./cli.sh ``` ## Config -CLI usage of the `astminer` completely configured by YAML config. +CLI of `astminer` is fully configured by a YAML config. The config should contain next values: -- `inputDir` — path to directory with input data -- `outputDir` — path to output directory +- `inputDir` — path to the directory with input data +- `outputDir` — path to the output directory - `parser` — parser name and list of target languages -- `filters` — list of filters with their parameters -- `label` — label extractor strategy +- `filters` — list of filters and parameters +- `label` — label extraction strategy - `storage` — storage format -[configs](../configs) already contain some config examples, look at them for more structure details. +[configs](../configs) contains some config examples that could be used as a reference for the YAML structure. ## Docker -Since some parsers have additional dependencies, -e.g. G++ must be installed for Fuzzy parser (see [parsers](./parsers.md)). -We introduce Docker image with already installed parser dependencies. -To use this image you should only pull this image from DockerHub and run CLI by `./cli.sh`. +Some parsers have non-trivial environment requirements. +For example, g++ must be installed for Fuzzy parser (see [parsers](./parsers.md)). + +To ease dealing with such cases, we provide a Docker image with all parser dependencies. +This image can be pulled from DockerHub: +```shell +docker pull voudy/astminer +``` From 171b16a7c0438b1733ddd363556b31e3ca7c1299 Mon Sep 17 00:00:00 2001 From: Vladimir Kovalenko Date: Wed, 4 Aug 2021 20:43:19 +0200 Subject: [PATCH 3/7] edit filters.md --- docs/filters.md | 30 ++++++++++++++---------------- 1 file changed, 14 insertions(+), 16 deletions(-) diff --git a/docs/filters.md b/docs/filters.md index ab20199f..6a9c3448 100644 --- a/docs/filters.md +++ b/docs/filters.md @@ -1,20 +1,18 @@ # Filters -Each filter dedicate to remove *bad* trees from data, e.g. too large trees. -Also, each filter works only for certain levels of granulaity. -Here we describe all implemented filters. -Each description contains corresponding YAML config. - -Since filters may be language or parser specific, `astminer` should support all this zoo. -And since we **do not** use any of intermediate representation it is impossible to unify filtering. -Therefore some languages or parsers may not support needed filter +Each filter is dedicated to removing *bad* trees from the data, e.g. trees that are too big. +Moreover, each filter works only for certain levels of granulaity. +Here we describe all filters provided by `astminer`. +Each description contains the corresponding YAML config. + +Filters can be specific to a language or a parser. +Therefore, some languages or parsers may not support the needed filter (`FunctionInfoPropertyNotImplementedException` appears). -To handle this user should manually add specific logic of parsing AST to get info about function or code at -all. +To handle this, the user might manually add specific logic of parsing AST to get the desired information about function or code at all. Filter config classes are defined in [FilterConfigs.kt](../src/main/kotlin/astminer/config/FilterConfigs.kt). -## by tree size +## Filter by tree size **granularity**: files, functions Exclude ASTs that are too small or too big. @@ -25,7 +23,7 @@ Exclude ASTs that are too small or too big. maxTreeSize: 100 ``` -## by words number +## Filter by words count **granularity**: files, functions Exclude ASTs that have too many words in any token. @@ -35,7 +33,7 @@ Exclude ASTs that have too many words in any token. maxTokenWordsNumber: 10 ``` -## by function name length +## Filter by function name length **granularity**: functions Exclude functions that have too many words in their name. @@ -45,7 +43,7 @@ Exclude functions that have too many words in their name. maxWordsNumber: 10 ``` -## no constructors +## Exclude constructors **granularity**: functions Exclude constructors @@ -54,7 +52,7 @@ Exclude constructors name: no constructors ``` -## by annotations +## Filter by annotation **granularity**: functions Exclude functions that have certain annotations (e.g. `@Override`) @@ -64,7 +62,7 @@ Exclude functions that have certain annotations (e.g. `@Override`) annotations: [ override ] ``` -## by modifiers +## Filter by modifiers **granularity**: functions Exclude functions with certain modifiers (e.g. `private` functions) From b3f5d78d0bb87692868d2ff47a79be2bba17d0ac Mon Sep 17 00:00:00 2001 From: Vladimir Kovalenko Date: Wed, 4 Aug 2021 20:54:59 +0200 Subject: [PATCH 4/7] edit label_extractors.md --- docs/label_extractors.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/label_extractors.md b/docs/label_extractors.md index ec41845e..fbad4f8e 100644 --- a/docs/label_extractors.md +++ b/docs/label_extractors.md @@ -1,15 +1,15 @@ # Label extractors -Label extractors are required for correct extracting of labels from raw ASTs. -Inside themselves they extract label from tree and process tree to avoid data leak. -Also, label extractors define granularity level for the whole pipeline. +Label extractors are required for correct extraction of labels from raw ASTs. +Internally, they extract labels from the tree and process the tree to avoid data leaks. +Also, label extractors define the granularity level for the whole pipeline. Label extractor config classes are defined in [LabelExtractorConfigs.kt](src/main/kotlin/astminer/config/LabelExtractorConfigs.kt). ## file name **granularity**: files -Use file name of source file as label. +Use file name of source file as a label. ```yaml name: file name @@ -18,7 +18,7 @@ Use file name of source file as label. ## folder name **granularity**: files -Use name of the parent folder of source file as label. +Use the name of the parent folder of source file as a label. May be useful for code classification datasets, e.g., POJ-104. ```yaml @@ -28,8 +28,8 @@ May be useful for code classification datasets, e.g., POJ-104. ## function name **granularity**: functions -Use name of each function as label. -This label extractor will also hide the function name in the AST and all recursive calls. +Use name of each function as a label. +This label extractor will also hide the function name in the AST and all recursive calls to prevent data leaks. ```yaml name: function name From 5452f07c5365ff7a773915a0fd375efc8cbd970b Mon Sep 17 00:00:00 2001 From: Vladimir Kovalenko Date: Wed, 4 Aug 2021 20:55:07 +0200 Subject: [PATCH 5/7] edit parsers.md --- docs/parsers.md | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/docs/parsers.md b/docs/parsers.md index bfbd52c1..42bc9d7b 100644 --- a/docs/parsers.md +++ b/docs/parsers.md @@ -1,31 +1,30 @@ # Parsers -`astminer` supports multiple parsers for a large wide of programming languages. -Here we describe integrated parsers and their peculiarities. +`astminer` supports multiple parsers for various programming languages. +Here we describe the integrated parsers and their peculiarities. ## ANTLR -ANother Tool for Language Recognition from [antlr.org](https://www.antlr.org). -It provides lexer and parsers for languages that can be generated into Java code. -For now, `astminer` supports Java, Python, JS, and PHP. +[ANTLR](https://www.antlr.org) provides an infrastructure to generate lexers and parsers for languages based on grammars. +For now, `astminer` supports ANTLR-based parsers for Java, Python, JS, and PHP. ## GumTree [GumTree](https://github.com/GumTreeDiff/gumtree) -framework to work with source code as trees and to compute difference between them. -It also builds language-agnostic representation. -For now, `astminer` supports Java and Python. +is a framework to work with source code as trees and to compute differences of trees between different versions of code. +It also builds language-agnostic representations of code. +For now, `astminer` supports GumTree-based parsers for Java and Python. ### python-parser -You should install python-parser to run GumTree with Python. -There is instruction of how to do it: +Running GumTree with Python requires `python-parser`. +It can be set up through the following steps: 1. Download sources from [GitHub](https://github.com/JetBrains-Research/pythonparser/blob/master/) 2. Install dependencies ```shell pip install -r requirements.txt ``` -3. Make python parser script executable +3. Make the `python-parser` script executable ```shell chmod +x src/main/python/pythonparser/pythonparser_3.py ``` @@ -37,9 +36,9 @@ export PATH="/src/main/python/pythonparser/pythonparser:${PATH}" ## Fuzzy -Originally [fuzzyc2cpg](https://github.com/ShiftLeftSecurity/fuzzyc2cpg) -and now part of [codepropertygraph](https://github.com/ShiftLeftSecurity/codepropertygraph/). -`astminer`uses it C/C++ parser from that. `G++`required for this parser. +Originally [fuzzyc2cpg](https://github.com/ShiftLeftSecurity/fuzzyc2cpg), Fuzzy is +now part of [codepropertygraph](https://github.com/ShiftLeftSecurity/codepropertygraph/). +`astminer`uses it to parse C/C++ code. `g++` is required for this parser. ## Other languages and parsers From 7090f2708d44d1346cb160825c18cfde79348b4b Mon Sep 17 00:00:00 2001 From: Vladimir Kovalenko Date: Wed, 4 Aug 2021 21:00:19 +0200 Subject: [PATCH 6/7] edit storages.md --- docs/storages.md | 35 +++++++++++++++++------------------ 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/docs/storages.md b/docs/storages.md index 6fd28031..058bd426 100644 --- a/docs/storages.md +++ b/docs/storages.md @@ -1,7 +1,7 @@ # Storages -Storages defines how ASTs should be saved on a disk. -For now, `astminer` support saving in tree and path-based formats. +The storage defines how the ASTs should be saved on disk. +For now, `astminer` support tree-based and path-based storage formats. Storage config classes are defined in [StorageConfigs.kt](../src/main/kotlin/astminer/config/StorageConfigs.kt). @@ -9,8 +9,8 @@ Storage config classes are defined in [StorageConfigs.kt](../src/main/kotlin/ast ### CSV -Save trees with labels in comma-separated file. -Each tree encodes into line using sequence of parenthesis. +Saves the trees with labels to a comma-separated file. +Each tree is encoded to a single line using parentheses sequences. ```yaml name: csv AST @@ -18,8 +18,8 @@ Each tree encodes into line using sequence of parenthesis. ### Dot -Save each tree in separate file using [dot](https://graphviz.org/doc/info/lang.html) syntax. -Along with dot files, storage also saves `description.csv` with matching between files, source file, and label. +Saves each tree in separate file using [dot](https://graphviz.org/doc/info/lang.html) syntax. +Along with dot files, this storage also saves `description.csv` with mapping between files, source files, and labels. ```yaml @@ -28,8 +28,8 @@ Along with dot files, storage also saves `description.csv` with matching between ### Json lines -Save each tree with label in Json Lines format. -Json format of AST inspired by Python-150k dataset. +Saves each tree with its label in the Json Lines format. +Json format of AST inspired by the [150k Python](https://www.sri.inf.ethz.ch/py150) dataset. ```yaml name: json AST @@ -38,7 +38,7 @@ Json format of AST inspired by Python-150k dataset. ## Path-based representations Path-based representation was introduced by [Alon et al.](https://arxiv.org/abs/1803.09544). -It uses in models like code2vec or code2seq. +It is used in popular code representation models such as `code2vec` and `code2seq`. ### Code2vec @@ -46,10 +46,9 @@ Extract paths from each AST. Output is 4 files: 1. `node_types.csv` contains numeric ids and corresponding node types with directions (up/down, as described in [paper](https://arxiv.org/pdf/1803.09544.pdf)); 2. `tokens.csv` contains numeric ids and corresponding tokens; 3. `paths.csv` contains numeric ids and AST paths in form of space-separated sequences of node type ids; -4. `path_contexts.c2s` contains labels and sequences of path contexts (triples of two tokens and a path between them). +4. `path_contexts.c2s` contains the labels and sequences of path-contexts (each representing two tokens and a path between them). -Each line in `path_contexts.c2s` starts with a label, -then it contains a sequence of space-separated triples. Each triple contains start token id, path id, end token id, separated with commas. +Each line in `path_contexts.c2s` starts with a label, followed by a sequence of space-separated triples. Each triple contains start token id, path id, end token id, separated with commas. ```yaml name: code2vec @@ -63,13 +62,13 @@ then it contains a sequence of space-separated triples. Each triple contains sta ### Code2seq -Extract paths from each AST and save in code2seq format. -Output is `path_context.c2s` file, -each line in it starts with a label, then it contains a sequence of space-separated triples. -Each triple contains start token, path node types, end token id, separated with commas. +Extract paths from each AST and save in the code2seq format. +The output is `path_context.c2s` file. +Each line starts with a label, followed by a sequence of space-separated triples. +Each triple contains the start token, path node types, and end token id, separated with commas. -To reduce memory usage you can enable `nodesToNumber` option. -If it is `true` then all types are converted into numbers and `node_types.csv` would be added to output files. +To reduce memory usage, you can enable `nodesToNumber` option. +If `nodesToNumber` is set to `true`, all types are converted into numbers and `node_types.csv` is added to output files. ```yaml name: code2seq From 455f33d23bd8cc5529b19ae0f359a837af00abf2 Mon Sep 17 00:00:00 2001 From: ElenaErratic <33476575+ElenaErratic@users.noreply.github.com> Date: Thu, 5 Aug 2021 16:28:47 +0300 Subject: [PATCH 7/7] Update README.md review additions --- README.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index d6c9f57c..3a3b32d4 100644 --- a/README.md +++ b/README.md @@ -28,20 +28,20 @@ It is designed to be very easily extensible to new languages. `astminer` lets you create an end-to-end pipeline to processing code for machine learning models. It allows to convert source code cloned from VCS to formats suitable for training. -To achieve that, `astminer` caters for multiple data processing steps: -- [filters](./docs/filters.md) to remove redundant samples from data -- [label extractors](./docs/label_extractors.md) to create label for each tree -- [storages](./docs/storages.md) to define storage format. +To achieve that, `astminer` incorporates the following processing modules: +- [Filters](./docs/filters.md) to remove redundant samples from data. +- [Label extractors](./docs/label_extractors.md) to create label for each tree. +- [Storages](./docs/storages.md) to define storage format. ## Usage -There are two ways to use `astminer`. +There are two ways to use `astminer`: -- [As a standalone CLI tool](#using-astminer-cli) with pre-implemented logic for common processing and mining tasks +- [As a standalone CLI tool](#using-astminer-cli) with a pre-implemented logic for common processing and mining tasks. - [Integrated](#using-astminer-as-a-dependency) into your Kotlin/Java mining pipelines as a Gradle dependency. ### Using `astminer` cli -Define config (examples of them in [configs](./configs) directory) and pass it to shell script: +Specify a config (see examples in [configs](./configs) directory) and pass it to the shell script: ```shell ./cli.sh ``` @@ -78,7 +78,7 @@ dependencies { #### Local development -To use a specific version of the library, navigate to the required branch and build local version of `astminer`: +To use a specific version of the library, navigate to the required branch and build a local version of `astminer`: ```shell ./gradlew publishToMavenLocal ``` @@ -86,7 +86,7 @@ After that, add `mavenLocal()` into the `repositories` section in your gradle co #### Examples -If you want to use `astminer` as a library in your Java/Kotlin based data mining tool, check the following: +If you want to use `astminer` as a library in your Java/Kotlin-based data mining tool, check the following: * A few simple [examples](src/examples) of using `astminer` in Java and Kotlin. * Using `astminer` as a part of another mining tool — [psiminer](https://github.com/JetBrains-Research/psiminer).