Skip to content

ISSTA 2018 Artifact Evaluation

Arianna Blasi edited this page Jun 9, 2018 · 41 revisions

IMPORTANT NOTE

For anonymity, the ISSTA submission calls our technique JDoctor. The artifact instead uses the real name of the project, Toradocu. This document uses OldToradocu for what the paper calls Toradocu instead, which is the status of the project asof the ISSTA 2016 paper.

System Requirements

Toradocu requires Java JDK 1.8 and Python 2.7+. It has been tested on Ubuntu and macOS.

Building Toradocu

Clone the Toradocu repository and build the fat jar with gradle:

git clone https://github.com/albertogoffi/toradocu.git 
cd toradocu
./gradlew shadowJar

This creates a jar containing Toradocu and all its dependencies in toradocu/build/libs/toradocu-1.0-all.jar.

NOTE: When you build Toradocu for the first time the build file will download the Glove models from our repository. This will take some time, as the models contain approximately 1GB of information.

Reproduce accuracy experiment

These steps run the experiments described in Section 5, and produce Table 2 in the paper.

Run these commands in the toradocu folder.

  1. Run experiments with Toradocu and produce its result file:

    ./stats/precision_recall_summary.sh toradocu_semantics
    

    This creates file results_semantics.csv.

  2. Run experiments with @tComment and produce its result file:

    ./stats/precision_recall_summary.sh tcomment
    

    This creates file results_tcomment.csv. Some of the tests fail, which is expected since tComment test suite oracles are the same of Toradocu. This does not alter precision/recall numbers.

  1. Run experiments with OldToradocu, produce its result file, and return to the master branch:
    git checkout version0.1
    ./precision_recall_summary.sh
    git checkout master
    
    This creates file results_toradocu-1.0.csv.

Once all the CSV files with results are created, run the script that produces the result table:

./stats/latex.sh paper

If you are asked whether to replace the fat jar, answer no.

Once completed, you can inspect file accuracy-table.tex in the latex folder to see the results of Table 2 of the paper.

Note: OldToradocu has slightly worse precision and recall than what we reported in our submission. We found a minor bug in our older script. We will update the results in the preparation of the camera ready.

Reproduce Randoop and Randoop+Toradocu experiments

Note: between the ISSTA submission and the camera-ready deadline both Toradocu and Randoop improved a lot, as well as our experimental setup. In particular, Toradocu used to produce more false positives that were influencing Randoop results (especially project Commons-Math). For example, now Toradocu includes a compiler, so it cannot produce uncompilable specifications anymore. We have thus re-run the experiments in a more careful way, manually checking the results one by one. Section 6 of the camera-ready version of the paper is updated with the new results. With the new experiments, we could not confirm the bugs mentioned in 6.5 for Commons-Math.

These instructions would allow you to reproduce the results reported in Section 6. However, keep in mind that it took us several weeks of manual effort to produce the results. We provide links to our repositories mainly for you to assess how we ran the evaluation process:

https://gitlab.cs.washington.edu/randoop/toradocu-manual-evaluation-feb-2018.git

The repository has a branch for each project under test. These are the direct link to all the tests and related logs:

Commons-Collections

Commons-Math

Guava

Plume

JGrapht

Graphstream

The table in section 6 of the paper is inferred by manually comparing the logs. To help ourselves, after our manual checks, we put some labels for each test case:

  • [false-alarm] means that the test case was error-revealing according to Randoop/Randoop+Toradocu, but we verified it should instead pass;
  • [true-failure] means that the test case was error-revealing according to Randoop/Randoop+Toradocu, and we verified indeed that it was;
  • the [toradocu] label tells whether the test case is re-classified correctly thanks to Toradocu's specifications; [false-positive] when specifications lead to wrong re-classification;
  • [seen-already] means we already saw the test case during Randoop-only experiment.

A brief explanation of how the directories are organized:

  • each link listed above will lead you to a page with a set of folders and an evaluation-log.md file.
  • each of the folders contains the code of a single test case. To see the description of the test case itself, check the corresponding second-level header in the evaluation-log.md.
  • notice that folders and logs only represent error-revealing test cases generated by Randoop/Randoop+Toradocu. In table 3 of the paper, the Same column also counts passing test cases. We don't keep them in the repository, but the count is reported at the top of the evaluation-log.md files in the of the Randoop-only experiments.

Toy example

A much easier way to verify how the integration between Randoop and Toradocu works is to run the tools on a toy example:

https://github.com/ariannab/toyproject.git

Running Toradocu on any class

First, follow the instructions for building Toradocu if you haven't already.

To run Toradocu on a class MyClass of a certain project:

java -jar toradocu-1.0-all.jar \
   --target-class mypackage.MyClass \
   --source-dir project/src \
   --class-dir project/bin 

For example:

java -jar build/libs/toradocu-1.0-all.jar \
--target-class org.apache.commons.collections4.map.LRUMap \
--source-dir src/test/resources/src/commons-collections4-4.1-src/src/main/java \
--class-dir src/test/resources/bin/commons-collections4-4.1.jar

The terminal shows the output in a few seconds. It is formatted as JSON and contains the produced conditions for every method in the class. For each category of tag the method's Javadoc declares (throwsTags, paramTags, returnTag) you find the comment (field "comment") and the related translation produced by Toradocu (field "condition").