JetBrains-Research · vovak · Aug 5, 2021 · Aug 4, 2021 · Aug 4, 2021 · Aug 4, 2021
diff --git a/README.md b/README.md
@@ -26,9 +26,9 @@ Currently, it supports extraction of:
 
 It is designed to be very easily extensible to new languages.
 
-`astminer` lets you create end2end pipeline of data processing.
-It allows convert source code, cloned from VCS to suitable for training datasets.
-To do that, `astminer` provides multiple steps to handle data:
+`astminer` lets you create an end-to-end pipeline to processing code for machine learning models.
+It allows to convert source code cloned from VCS to formats suitable for training.
+To achieve that, `astminer` caters for multiple data processing steps:
 - [filters](./docs/filters.md) to remove redundant samples from data
 - [label extractors](./docs/label_extractors.md) to create label for each tree
 - [storages](./docs/storages.md) to define storage format.
@@ -41,12 +41,12 @@ There are two ways to use `astminer`.
 
 ### Using `astminer` cli
 
-Define config (examples of them in [configs](./configs) directory) and pass it shell script:
+Define config (examples of them in [configs](./configs) directory) and pass it to shell script:
 ```shell
 ./cli.sh <path-to-YAML-config>
 ```
 
-For details about config format and other navigate to [docs/cli](./docs/cli.md).
+For details on CLI configuration, see [docs/cli](./docs/cli.md).
 
 ### Using `astminer` as a dependency
 
@@ -88,10 +88,10 @@ After that, add `mavenLocal()` into the `repositories` section in your gradle co
 
 If you want to use `astminer` as a library in your Java/Kotlin based data mining tool, check the following:
 
-* A few simple [examples](src/examples) of `astminer` usage in Java and Kotlin.
+* A few simple [examples](src/examples) of using `astminer` in Java and Kotlin.
 * Using `astminer` as a part of another mining tool — [psiminer](https://github.com/JetBrains-Research/psiminer).
 
-Please consider trying Kotlin for your data mining pipelines: from our experience, it is much better suited for data collection and transformation instruments.
+Please consider trying Kotlin for your data mining pipelines: from our experience, it is much better suited for data collection and transformation instruments than Java.
 
 ## Contribution
 

diff --git a/docs/cli.md b/docs/cli.md
@@ -1,39 +1,44 @@
 # `astminer` CLI usage
 
-You can run `astminer` through command-line interface.
-CLI allow to run the tool on any implemented parser with specifying filtering, label extracting and storage options.
+You can run `astminer` through a command line interface (CLI).
+The CLI allows to run the tool on any implemented parser with specified options for filtering, label extraction, and storage of the results.
 
 ## How to
-You can prepare and run CLI on any branch you want. Just navigate to it and do follow steps:
-1. Build shadow jar for `astminer`:
+You can build and run the CLI with any version of `astminer`:
+1. Check out the relevant version of `astminer` sources (for example, the `master-dev` branch)
+2. Build a shadow jar for `astminer`:
 ```shell
 gradle shadowJar 
 ```
-2. [Optionally] Pull docker image with all parsers dependencies installed:
+3. [optional] Pull a docker image with all parsers dependencies installed:
 ```shell
 docker pull voudy/astminer
 ```
-3. Run `astminer` with specified config:
+4. Run `astminer` with specified config:
 ```shell
 ./cli.sh <path-to-yaml-config>
 ```
 
 ## Config
 
-CLI usage of the `astminer` completely configured by YAML config.
+CLI of `astminer` is fully configured by a YAML config.
 The config should contain next values:
-- `inputDir` — path to directory with input data
-- `outputDir` — path to output directory 
+- `inputDir` — path to the directory with input data
+- `outputDir` — path to the output directory 
 - `parser` — parser name and list of target languages
-- `filters` — list of filters with their parameters
-- `label` — label extractor strategy
+- `filters` — list of filters and parameters
+- `label` — label extraction strategy
 - `storage` — storage format
 
-[configs](../configs) already contain some config examples, look at them for more structure details.
+[configs](../configs) contains some config examples that could be used as a reference for the YAML structure.
 
 ## Docker
 
-Since some parsers have additional dependencies,
-e.g. G++ must be installed for Fuzzy parser (see [parsers](./parsers.md)).
-We introduce Docker image with already installed parser dependencies.
-To use this image you should only pull this image from DockerHub and run CLI by `./cli.sh`.
+Some parsers have non-trivial environment requirements.
+For example, g++ must be installed for Fuzzy parser (see [parsers](./parsers.md)).
+
+To ease dealing with such cases, we provide a Docker image with all parser dependencies.
+This image can be pulled from DockerHub:
+```shell
+docker pull voudy/astminer
+```
diff --git a/docs/filters.md b/docs/filters.md
@@ -1,20 +1,18 @@
 # Filters
 
-Each filter dedicate to remove *bad* trees from data, e.g. too large trees.
-Also, each filter works only for certain levels of granulaity.
-Here we describe all implemented filters.
-Each description contains corresponding YAML config.
-
-Since filters may be language or parser specific, `astminer` should support all this zoo.
-And since we **do not** use any of intermediate representation it is impossible to unify filtering.
-Therefore some languages or parsers may not support needed filter 
+Each filter is dedicated to removing *bad* trees from the data, e.g. trees that are too big.
+Moreover, each filter works only for certain levels of granulaity.
+Here we describe all filters provided by `astminer`.
+Each description contains the corresponding YAML config.
+
+Filters can be specific to a language or a parser.
+Therefore, some languages or parsers may not support the needed filter 
 (`FunctionInfoPropertyNotImplementedException` appears).
-To handle this user should manually add specific logic of parsing AST to get info about function or code at 
-all. 
+To handle this, the user might manually add specific logic of parsing AST to get the desired information about function or code at all. 
 
 Filter config classes are defined in [FilterConfigs.kt](../src/main/kotlin/astminer/config/FilterConfigs.kt).
 
-## by tree size
+## Filter by tree size
 **granularity**: files, functions
 
 Exclude ASTs that are too small or too big.
@@ -25,7 +23,7 @@ Exclude ASTs that are too small or too big.
  maxTreeSize: 100
  ```
 
-## by words number
+## Filter by words count
 **granularity**: files, functions
 
 Exclude ASTs that have too many words in any token.
@@ -35,7 +33,7 @@ Exclude ASTs that have too many words in any token.
  maxTokenWordsNumber: 10
  ```
 
-## by function name length
+## Filter by function name length
 **granularity**: functions
 
 Exclude functions that have too many words in their name.
@@ -45,7 +43,7 @@ Exclude functions that have too many words in their name.
  maxWordsNumber: 10
  ```
 
-## no constructors
+## Exclude constructors
 **granularity**: functions
 
 Exclude constructors
@@ -54,7 +52,7 @@ Exclude constructors
  name: no constructors
  ```
 
-## by annotations
+## Filter by annotation
 **granularity**: functions
 
 Exclude functions that have certain annotations (e.g. `@Override`)
@@ -64,7 +62,7 @@ Exclude functions that have certain annotations (e.g. `@Override`)
  annotations: [ override ]
  ```
 
-## by modifiers
+## Filter by modifiers
 **granularity**: functions
 
 Exclude functions with certain modifiers (e.g. `private` functions)

diff --git a/docs/label_extractors.md b/docs/label_extractors.md
@@ -1,15 +1,15 @@
 # Label extractors
 
-Label extractors are required for correct extracting of labels from raw ASTs.
-Inside themselves they extract label from tree and process tree to avoid data leak.
-Also, label extractors define granularity level for the whole pipeline.
+Label extractors are required for correct extraction of labels from raw ASTs.
+Internally, they extract labels from the tree and process the tree to avoid data leaks.
+Also, label extractors define the granularity level for the whole pipeline.
 
 Label extractor config classes are defined in [LabelExtractorConfigs.kt](src/main/kotlin/astminer/config/LabelExtractorConfigs.kt).
 
 ## file name
 **granularity**: files
 
-Use file name of source file as label.
+Use file name of source file as a label.
 
  ```yaml
  name: file name
@@ -18,7 +18,7 @@ Use file name of source file as label.
 ## folder name
 **granularity**: files
 
-Use name of the parent folder of source file as label.
+Use the name of the parent folder of source file as a label.
 May be useful for code classification datasets, e.g., POJ-104.
 
  ```yaml
@@ -28,8 +28,8 @@ May be useful for code classification datasets, e.g., POJ-104.
 ## function name
 **granularity**: functions
 
-Use name of each function as label.
-This label extractor will also hide the function name in the AST and all recursive calls.
+Use name of each function as a label.
+This label extractor will also hide the function name in the AST and all recursive calls to prevent data leaks.
 
  ```yaml
  name: function name

diff --git a/docs/parsers.md b/docs/parsers.md
@@ -1,31 +1,30 @@
 # Parsers
 
-`astminer` supports multiple parsers for a large wide of programming languages.
-Here we describe integrated parsers and their peculiarities.
+`astminer` supports multiple parsers for various programming languages.
+Here we describe the integrated parsers and their peculiarities.
 
 ## ANTLR
 
-ANother Tool for Language Recognition from [antlr.org](https://www.antlr.org).
-It provides lexer and parsers for languages that can be generated into Java code.
-For now, `astminer` supports Java, Python, JS, and PHP.
+[ANTLR](https://www.antlr.org) provides an infrastructure to generate lexers and parsers for languages based on grammars.
+For now, `astminer` supports ANTLR-based parsers for Java, Python, JS, and PHP.
 
 ## GumTree
 
 [GumTree](https://github.com/GumTreeDiff/gumtree)
-framework to work with source code as trees and to compute difference between them.
-It also builds language-agnostic representation.
-For now, `astminer` supports Java and Python.
+is a framework to work with source code as trees and to compute differences of trees between different versions of code.
+It also builds language-agnostic representations of code.
+For now, `astminer` supports GumTree-based parsers for Java and Python.
 
 ### python-parser
 
-You should install python-parser to run GumTree with Python.
-There is instruction of how to do it:
+Running GumTree with Python requires `python-parser`.
+It can be set up through the following steps:
 1. Download sources from [GitHub](https://github.com/JetBrains-Research/pythonparser/blob/master/)
 2. Install dependencies
 ```shell
 pip install -r requirements.txt
 ```
-3. Make python parser script executable
+3. Make the `python-parser` script executable
 ```shell
 chmod +x src/main/python/pythonparser/pythonparser_3.py
 ```
@@ -37,9 +36,9 @@ export PATH="<path>/src/main/python/pythonparser/pythonparser:${PATH}"
 
 ## Fuzzy
 
-Originally [fuzzyc2cpg](https://github.com/ShiftLeftSecurity/fuzzyc2cpg)
-and now part of [codepropertygraph](https://github.com/ShiftLeftSecurity/codepropertygraph/).
-`astminer`uses it C/C++ parser from that. `G++`required for this parser.
+Originally [fuzzyc2cpg](https://github.com/ShiftLeftSecurity/fuzzyc2cpg), Fuzzy is
+now part of [codepropertygraph](https://github.com/ShiftLeftSecurity/codepropertygraph/).
+`astminer`uses it to parse C/C++ code. `g++` is required for this parser.
 
 ## Other languages and parsers
 

diff --git a/docs/storages.md b/docs/storages.md
@@ -1,25 +1,25 @@
 # Storages
 
-Storages defines how ASTs should be saved on a disk.
-For now, `astminer` support saving in tree and path-based formats.
+The storage defines how the ASTs should be saved on disk.
+For now, `astminer` support tree-based and path-based storage formats.
 
 Storage config classes are defined in [StorageConfigs.kt](../src/main/kotlin/astminer/config/StorageConfigs.kt).
 
 ## Tree formats
 
 ### CSV
 
-Save trees with labels in comma-separated file.
-Each tree encodes into line using sequence of parenthesis.
+Saves the trees with labels to a comma-separated file.
+Each tree is encoded to a single line using parentheses sequences.
 
  ```yaml
  name: csv AST
  ```
 
 ### Dot
 
-Save each tree in separate file using [dot](https://graphviz.org/doc/info/lang.html) syntax.
-Along with dot files, storage also saves `description.csv` with matching between files, source file, and label.
+Saves each tree in separate file using [dot](https://graphviz.org/doc/info/lang.html) syntax.
+Along with dot files, this storage also saves `description.csv` with mapping between files, source files, and labels.
 
 
  ```yaml
@@ -28,8 +28,8 @@ Along with dot files, storage also saves `description.csv` with matching between
 
 ### Json lines
 
-Save each tree with label in Json Lines format.
-Json format of AST inspired by Python-150k dataset.
+Saves each tree with its label in the Json Lines format.
+Json format of AST inspired by the [150k Python](https://www.sri.inf.ethz.ch/py150) dataset.
 
  ```yaml
  name: json AST
@@ -38,18 +38,17 @@ Json format of AST inspired by Python-150k dataset.
 ## Path-based representations
 
 Path-based representation was introduced by [Alon et al.](https://arxiv.org/abs/1803.09544).
-It uses in models like code2vec or code2seq.
+It is used in popular code representation models such as `code2vec` and `code2seq`.
 
 ### Code2vec
 
 Extract paths from each AST. Output is 4 files:
 1. `node_types.csv` contains numeric ids and corresponding node types with directions (up/down, as described in [paper](https://arxiv.org/pdf/1803.09544.pdf));
 2. `tokens.csv` contains numeric ids and corresponding tokens;
 3. `paths.csv` contains numeric ids and AST paths in form of space-separated sequences of node type ids;
-4. `path_contexts.c2s` contains labels and sequences of path contexts (triples of two tokens and a path between them).
+4. `path_contexts.c2s` contains the labels and sequences of path-contexts (each representing two tokens and a path between them).
 
-Each line in `path_contexts.c2s` starts with a label,
-then it contains a sequence of space-separated triples. Each triple contains start token id, path id, end token id, separated with commas.
+Each line in `path_contexts.c2s` starts with a label, followed by a sequence of space-separated triples. Each triple contains start token id, path id, end token id, separated with commas.
 
  ```yaml
  name: code2vec
@@ -63,13 +62,13 @@ then it contains a sequence of space-separated triples. Each triple contains sta
 
 ### Code2seq
 
-Extract paths from each AST and save in code2seq format.
-Output is `path_context.c2s` file,
-each line in it starts with a label, then it contains a sequence of space-separated triples.
-Each triple contains start token, path node types, end token id, separated with commas.
+Extract paths from each AST and save in the code2seq format.
+The output is `path_context.c2s` file.
+Each line starts with a label, followed by a sequence of space-separated triples.
+Each triple contains the start token, path node types, and end token id, separated with commas.
 
-To reduce memory usage you can enable `nodesToNumber` option.
-If it is `true` then all types are converted into numbers and `node_types.csv` would be added to output files.
+To reduce memory usage, you can enable `nodesToNumber` option.
+If `nodesToNumber` is set to `true`, all types are converted into numbers and `node_types.csv` is added to output files.
 
  ```yaml
  name: code2seq