Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edit descriptions #173

Merged
merged 7 commits into from
Aug 5, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 14 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,27 +26,27 @@ Currently, it supports extraction of:

It is designed to be very easily extensible to new languages.

`astminer` lets you create end2end pipeline of data processing.
It allows convert source code, cloned from VCS to suitable for training datasets.
To do that, `astminer` provides multiple steps to handle data:
- [filters](./docs/filters.md) to remove redundant samples from data
- [label extractors](./docs/label_extractors.md) to create label for each tree
- [storages](./docs/storages.md) to define storage format.
`astminer` lets you create an end-to-end pipeline to processing code for machine learning models.
It allows to convert source code cloned from VCS to formats suitable for training.
To achieve that, `astminer` incorporates the following processing modules:
- [Filters](./docs/filters.md) to remove redundant samples from data.
- [Label extractors](./docs/label_extractors.md) to create label for each tree.
- [Storages](./docs/storages.md) to define storage format.

## Usage
There are two ways to use `astminer`.
There are two ways to use `astminer`:

- [As a standalone CLI tool](#using-astminer-cli) with pre-implemented logic for common processing and mining tasks
- [As a standalone CLI tool](#using-astminer-cli) with a pre-implemented logic for common processing and mining tasks.
- [Integrated](#using-astminer-as-a-dependency) into your Kotlin/Java mining pipelines as a Gradle dependency.

### Using `astminer` cli

Define config (examples of them in [configs](./configs) directory) and pass it shell script:
Specify a config (see examples in [configs](./configs) directory) and pass it to the shell script:
```shell
./cli.sh <path-to-YAML-config>
```

For details about config format and other navigate to [docs/cli](./docs/cli.md).
For details on CLI configuration, see [docs/cli](./docs/cli.md).

### Using `astminer` as a dependency

Expand Down Expand Up @@ -78,20 +78,20 @@ dependencies {

#### Local development

To use a specific version of the library, navigate to the required branch and build local version of `astminer`:
To use a specific version of the library, navigate to the required branch and build a local version of `astminer`:
```shell
./gradlew publishToMavenLocal
```
After that, add `mavenLocal()` into the `repositories` section in your gradle configuration.

#### Examples

If you want to use `astminer` as a library in your Java/Kotlin based data mining tool, check the following:
If you want to use `astminer` as a library in your Java/Kotlin-based data mining tool, check the following:

* A few simple [examples](src/examples) of `astminer` usage in Java and Kotlin.
* A few simple [examples](src/examples) of using `astminer` in Java and Kotlin.
* Using `astminer` as a part of another mining tool — [psiminer](https://github.com/JetBrains-Research/psiminer).

Please consider trying Kotlin for your data mining pipelines: from our experience, it is much better suited for data collection and transformation instruments.
Please consider trying Kotlin for your data mining pipelines: from our experience, it is much better suited for data collection and transformation instruments than Java.

## Contribution

Expand Down
37 changes: 21 additions & 16 deletions docs/cli.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,44 @@
# `astminer` CLI usage

You can run `astminer` through command-line interface.
CLI allow to run the tool on any implemented parser with specifying filtering, label extracting and storage options.
You can run `astminer` through a command line interface (CLI).
The CLI allows to run the tool on any implemented parser with specified options for filtering, label extraction, and storage of the results.

## How to
You can prepare and run CLI on any branch you want. Just navigate to it and do follow steps:
1. Build shadow jar for `astminer`:
You can build and run the CLI with any version of `astminer`:
1. Check out the relevant version of `astminer` sources (for example, the `master-dev` branch)
2. Build a shadow jar for `astminer`:
```shell
gradle shadowJar
```
2. [Optionally] Pull docker image with all parsers dependencies installed:
3. [optional] Pull a docker image with all parsers dependencies installed:
```shell
docker pull voudy/astminer
```
3. Run `astminer` with specified config:
4. Run `astminer` with specified config:
```shell
./cli.sh <path-to-yaml-config>
```

## Config

CLI usage of the `astminer` completely configured by YAML config.
CLI of `astminer` is fully configured by a YAML config.
The config should contain next values:
- `inputDir` — path to directory with input data
- `outputDir` — path to output directory
- `inputDir` — path to the directory with input data
- `outputDir` — path to the output directory
- `parser` — parser name and list of target languages
- `filters` — list of filters with their parameters
- `label` — label extractor strategy
- `filters` — list of filters and parameters
- `label` — label extraction strategy
- `storage` — storage format

[configs](../configs) already contain some config examples, look at them for more structure details.
[configs](../configs) contains some config examples that could be used as a reference for the YAML structure.

## Docker

Since some parsers have additional dependencies,
e.g. G++ must be installed for Fuzzy parser (see [parsers](./parsers.md)).
We introduce Docker image with already installed parser dependencies.
To use this image you should only pull this image from DockerHub and run CLI by `./cli.sh`.
Some parsers have non-trivial environment requirements.
For example, g++ must be installed for Fuzzy parser (see [parsers](./parsers.md)).

To ease dealing with such cases, we provide a Docker image with all parser dependencies.
This image can be pulled from DockerHub:
```shell
docker pull voudy/astminer
```
30 changes: 14 additions & 16 deletions docs/filters.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,18 @@
# Filters

Each filter dedicate to remove *bad* trees from data, e.g. too large trees.
Also, each filter works only for certain levels of granulaity.
Here we describe all implemented filters.
Each description contains corresponding YAML config.

Since filters may be language or parser specific, `astminer` should support all this zoo.
And since we **do not** use any of intermediate representation it is impossible to unify filtering.
Therefore some languages or parsers may not support needed filter
Each filter is dedicated to removing *bad* trees from the data, e.g. trees that are too big.
Moreover, each filter works only for certain levels of granulaity.
Here we describe all filters provided by `astminer`.
Each description contains the corresponding YAML config.

Filters can be specific to a language or a parser.
Therefore, some languages or parsers may not support the needed filter
(`FunctionInfoPropertyNotImplementedException` appears).
To handle this user should manually add specific logic of parsing AST to get info about function or code at
all.
To handle this, the user might manually add specific logic of parsing AST to get the desired information about function or code at all.

Filter config classes are defined in [FilterConfigs.kt](../src/main/kotlin/astminer/config/FilterConfigs.kt).

## by tree size
## Filter by tree size
**granularity**: files, functions

Exclude ASTs that are too small or too big.
Expand All @@ -25,7 +23,7 @@ Exclude ASTs that are too small or too big.
maxTreeSize: 100
```

## by words number
## Filter by words count
**granularity**: files, functions

Exclude ASTs that have too many words in any token.
Expand All @@ -35,7 +33,7 @@ Exclude ASTs that have too many words in any token.
maxTokenWordsNumber: 10
```

## by function name length
## Filter by function name length
**granularity**: functions

Exclude functions that have too many words in their name.
Expand All @@ -45,7 +43,7 @@ Exclude functions that have too many words in their name.
maxWordsNumber: 10
```

## no constructors
## Exclude constructors
**granularity**: functions

Exclude constructors
Expand All @@ -54,7 +52,7 @@ Exclude constructors
name: no constructors
```

## by annotations
## Filter by annotation
**granularity**: functions

Exclude functions that have certain annotations (e.g. `@Override`)
Expand All @@ -64,7 +62,7 @@ Exclude functions that have certain annotations (e.g. `@Override`)
annotations: [ override ]
```

## by modifiers
## Filter by modifiers
**granularity**: functions

Exclude functions with certain modifiers (e.g. `private` functions)
Expand Down
14 changes: 7 additions & 7 deletions docs/label_extractors.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# Label extractors

Label extractors are required for correct extracting of labels from raw ASTs.
Inside themselves they extract label from tree and process tree to avoid data leak.
Also, label extractors define granularity level for the whole pipeline.
Label extractors are required for correct extraction of labels from raw ASTs.
Internally, they extract labels from the tree and process the tree to avoid data leaks.
Also, label extractors define the granularity level for the whole pipeline.

Label extractor config classes are defined in [LabelExtractorConfigs.kt](src/main/kotlin/astminer/config/LabelExtractorConfigs.kt).

## file name
**granularity**: files

Use file name of source file as label.
Use file name of source file as a label.

```yaml
name: file name
Expand All @@ -18,7 +18,7 @@ Use file name of source file as label.
## folder name
**granularity**: files

Use name of the parent folder of source file as label.
Use the name of the parent folder of source file as a label.
May be useful for code classification datasets, e.g., POJ-104.

```yaml
Expand All @@ -28,8 +28,8 @@ May be useful for code classification datasets, e.g., POJ-104.
## function name
**granularity**: functions

Use name of each function as label.
This label extractor will also hide the function name in the AST and all recursive calls.
Use name of each function as a label.
This label extractor will also hide the function name in the AST and all recursive calls to prevent data leaks.

```yaml
name: function name
Expand Down
27 changes: 13 additions & 14 deletions docs/parsers.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,30 @@
# Parsers

`astminer` supports multiple parsers for a large wide of programming languages.
Here we describe integrated parsers and their peculiarities.
`astminer` supports multiple parsers for various programming languages.
Here we describe the integrated parsers and their peculiarities.

## ANTLR

ANother Tool for Language Recognition from [antlr.org](https://www.antlr.org).
It provides lexer and parsers for languages that can be generated into Java code.
For now, `astminer` supports Java, Python, JS, and PHP.
[ANTLR](https://www.antlr.org) provides an infrastructure to generate lexers and parsers for languages based on grammars.
For now, `astminer` supports ANTLR-based parsers for Java, Python, JS, and PHP.

## GumTree

[GumTree](https://github.com/GumTreeDiff/gumtree)
framework to work with source code as trees and to compute difference between them.
It also builds language-agnostic representation.
For now, `astminer` supports Java and Python.
is a framework to work with source code as trees and to compute differences of trees between different versions of code.
It also builds language-agnostic representations of code.
For now, `astminer` supports GumTree-based parsers for Java and Python.

### python-parser

You should install python-parser to run GumTree with Python.
There is instruction of how to do it:
Running GumTree with Python requires `python-parser`.
It can be set up through the following steps:
1. Download sources from [GitHub](https://github.com/JetBrains-Research/pythonparser/blob/master/)
2. Install dependencies
```shell
pip install -r requirements.txt
```
3. Make python parser script executable
3. Make the `python-parser` script executable
```shell
chmod +x src/main/python/pythonparser/pythonparser_3.py
```
Expand All @@ -37,9 +36,9 @@ export PATH="<path>/src/main/python/pythonparser/pythonparser:${PATH}"

## Fuzzy

Originally [fuzzyc2cpg](https://github.com/ShiftLeftSecurity/fuzzyc2cpg)
and now part of [codepropertygraph](https://github.com/ShiftLeftSecurity/codepropertygraph/).
`astminer`uses it C/C++ parser from that. `G++`required for this parser.
Originally [fuzzyc2cpg](https://github.com/ShiftLeftSecurity/fuzzyc2cpg), Fuzzy is
now part of [codepropertygraph](https://github.com/ShiftLeftSecurity/codepropertygraph/).
`astminer`uses it to parse C/C++ code. `g++` is required for this parser.

## Other languages and parsers

Expand Down
35 changes: 17 additions & 18 deletions docs/storages.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
# Storages

Storages defines how ASTs should be saved on a disk.
For now, `astminer` support saving in tree and path-based formats.
The storage defines how the ASTs should be saved on disk.
For now, `astminer` support tree-based and path-based storage formats.

Storage config classes are defined in [StorageConfigs.kt](../src/main/kotlin/astminer/config/StorageConfigs.kt).

## Tree formats

### CSV

Save trees with labels in comma-separated file.
Each tree encodes into line using sequence of parenthesis.
Saves the trees with labels to a comma-separated file.
Each tree is encoded to a single line using parentheses sequences.

```yaml
name: csv AST
```

### Dot

Save each tree in separate file using [dot](https://graphviz.org/doc/info/lang.html) syntax.
Along with dot files, storage also saves `description.csv` with matching between files, source file, and label.
Saves each tree in separate file using [dot](https://graphviz.org/doc/info/lang.html) syntax.
Along with dot files, this storage also saves `description.csv` with mapping between files, source files, and labels.


```yaml
Expand All @@ -28,8 +28,8 @@ Along with dot files, storage also saves `description.csv` with matching between

### Json lines

Save each tree with label in Json Lines format.
Json format of AST inspired by Python-150k dataset.
Saves each tree with its label in the Json Lines format.
Json format of AST inspired by the [150k Python](https://www.sri.inf.ethz.ch/py150) dataset.

```yaml
name: json AST
Expand All @@ -38,18 +38,17 @@ Json format of AST inspired by Python-150k dataset.
## Path-based representations

Path-based representation was introduced by [Alon et al.](https://arxiv.org/abs/1803.09544).
It uses in models like code2vec or code2seq.
It is used in popular code representation models such as `code2vec` and `code2seq`.

### Code2vec

Extract paths from each AST. Output is 4 files:
1. `node_types.csv` contains numeric ids and corresponding node types with directions (up/down, as described in [paper](https://arxiv.org/pdf/1803.09544.pdf));
2. `tokens.csv` contains numeric ids and corresponding tokens;
3. `paths.csv` contains numeric ids and AST paths in form of space-separated sequences of node type ids;
4. `path_contexts.c2s` contains labels and sequences of path contexts (triples of two tokens and a path between them).
4. `path_contexts.c2s` contains the labels and sequences of path-contexts (each representing two tokens and a path between them).

Each line in `path_contexts.c2s` starts with a label,
then it contains a sequence of space-separated triples. Each triple contains start token id, path id, end token id, separated with commas.
Each line in `path_contexts.c2s` starts with a label, followed by a sequence of space-separated triples. Each triple contains start token id, path id, end token id, separated with commas.

```yaml
name: code2vec
Expand All @@ -63,13 +62,13 @@ then it contains a sequence of space-separated triples. Each triple contains sta

### Code2seq

Extract paths from each AST and save in code2seq format.
Output is `path_context.c2s` file,
each line in it starts with a label, then it contains a sequence of space-separated triples.
Each triple contains start token, path node types, end token id, separated with commas.
Extract paths from each AST and save in the code2seq format.
The output is `path_context.c2s` file.
Each line starts with a label, followed by a sequence of space-separated triples.
Each triple contains the start token, path node types, and end token id, separated with commas.

To reduce memory usage you can enable `nodesToNumber` option.
If it is `true` then all types are converted into numbers and `node_types.csv` would be added to output files.
To reduce memory usage, you can enable `nodesToNumber` option.
If `nodesToNumber` is set to `true`, all types are converted into numbers and `node_types.csv` is added to output files.

```yaml
name: code2seq
Expand Down