Skip to content

JetBrains-Research/python-change-miner

Repository files navigation

JetBrains Research

Python Change Miner

A tool for mining graph-based change patterns in Python code.

What it does

A program dependence graph is a way of representing the code by showing its data dependencies and control dependencies.

A change graph is a program dependence graph for the fragment of code changes (using the versions of code before and after the target change).

Similar code changes will have similar change graphs, which means that any versioned code can be mined for patterns in its changes. This tool does exactly that for Python code: it can build program dependence graphs, build change graphs for changed files, mine such change graphs from Git repositories by traversing their VCS history, and discover patterns in these change graphs.

This functionality can be used for empirical research of coding practices, as well as for mining the candidates for potential IDE inspections.

Getting started

  1. The tool requires Python 3.8+ to run. We have also tested it only on Linux and macOS systems.

  2. Install the required dependencies:

    pip3 install -r requirements.txt
  3. Create the settings file settings.json based on conf/settings.json.example and save it in the same directory. You can find the description of individual settings in conf/help.md.

  4. If you want to use the tool for building change graphs or mining change graphs from the local repositories, you need to setup GumTree. But you can use the compiled version of GumTree (can be found in the external directory), it is slightly modified and uses environment variables GUMTREE_PYTHON_BIN (python interpreter path for GumTree pyparser calls) and GUMTREE_PYPARSER_PATH (python parser script path). For general cases, they are set up automatically. If you want to do it manually, set them, for instance, as follows:

    GUMTREE_PYTHON_BIN=python3
    GUMTREE_PYPARSER_PATH={project_dir}/external/pythonparser_3.py

How to use

You can run any step of the pipeline by using the following simple command:

python3 main.py <mode> <args>

The tool currently supports four operation modes:

  1. pfg — build a program dependence graph from the Python source.

    Arguments:

    • -i — a path to the source file.
    • -o — a path to the output file. Two files will be created, a .dot file with a graph and a .pdf file with its visualization.
    • --no-closure(optional) if passed, no closure will be built for the graph.
    • --show-deps(optional) if passed, edges with type dep will be present in the graph, indicating the dependence of the vertices on each other.
    • --hide-op-kinds(optional) if passed, the types of operations will be hidden in the graph.
    • --show-data-keys(optional) if passed, IDs of the variables will be present in the graph.

    Typical use:

    python3 main.py pfg -i examples/src.py -o images/pfg.dot
  2. cg — build a change graph from two source files (before and after change).

    Arguments:

    • -s — a path to the source file before changes.
    • -d — a path to the source file after changes.
    • -o — a path to the output file. Two files will be created, a .dot file with a graph and a .pdf file with its visualization.

    Typical use:

    python3 main.py cg -s examples/0_old.py -d examples/0_new.py -o images/cg.dot
  3. collect-cgs — mine change graphs from local repositories.

    All the general settings for this mode are located in the JSON file, see p. 3 of Getting started.

    Use:

    python3 main.py collect-cgs <args>

    Arguments:

    • --only-tests(optional) if passed, the tool will build change graphs only for the files with filenames containing "test" substring.

    The tool uses pickle to save the data, so the output files are serialized and can be only processed by pickle. Running the tool in the patterns mode for detecting patterns within the mined change graphs will deserialize them automatically.

  4. patterns — search for patterns in the change graphs.

    This mode can be run in two ways: from the results of the previous step or from the source files. The settings are located in the JSON file, see p. 3 of Getting started. If you want to look for patterns in the change graphs obtained from running the tool in the collect-sgs mode, simply run:

    python3 main.py patterns

    and the tool will find the input automatically. Alternatively, you can mine patterns directly from files with the following arguments:

    • -s — a path to the source files before changes.
    • -d — a path to the source files after changes.
    • --fake-mining(optional) if passed, no mining is carried out, the change graphs as a whole are considered to be the patterns (used in debug).

    Typical use:

    python3 main.py patterns -s examples/0_old.py examples/1_old.py -d examples/0_new.py examples/1_new.py

    Here, the files are automatically mapped (0_old.py -> 0_new.py, 1_old.py -> 1_new.py), their change graphs are built, and the patterns between them are mined.

    In both usage scenarios, the patterns mode will produce results as shown in the picture below:

    drawing

    The patterns are organized by their size in nodes. In the output directory, a directory is created for each size, in the example, the size is 17. In each of these directories we store the patterns, once again, as directories with their ID in the name. In the example, 1379 is the ID of a pattern with size 17.

    Within each pattern, we store details.html with its description and the listing of the pattern instances, and the instances themselves. For each instance, there are three types of files: sample{ID} is the code of the instance ( before and after the change), fragment{ID} is the change graph of this specific sample, and graph{ID} is the larger change graph, from which this sample came from. You can also control the specifics of the output by changing the settings file. contents.html on every level of the structure provides a convenient navigation. To understand the structure better, you can browse an example output in survey_patterns.tar.gz.

Contacts

If you have any questions or suggestions, don't hesitate to open an issue or contact the developers at [email protected].

About

A tool for mining graph-based change patterns in Python code

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published