Skip to content

Latest commit

 

History

History
375 lines (353 loc) · 26.9 KB

TODO.md

File metadata and controls

375 lines (353 loc) · 26.9 KB

TO DO List for Diff Annotator package

  • cleanup
    • cleanup annotate.py
    • test annotate.py code
    • test annotate.py script
    • cleanup languages.py
    • cleanup lexer.py
    • move from click to typer for handling CLI parameters
    • add docstring for all files
  • make it a separate package
  • 3 4 scripts (their names may change in the future) - see [project.scripts] section in pyproject.toml
  • make it possible to use each script features from Python (see for example process_single_bug() function in annotate.py) and document such use (in docs/, in function docstring, in file docstring, in doctests)
  • improve common handling of command line parameters for all scripts
    • maybe make it possible to use configuration files to set parameters for CLI (similarly to Hydra) with typer-config (e.g. my-typer-app --config config.yml)
    • maybe implement options common to all scripts, like --version, putting their implementation __init__.py, and make use of "Options Anywhere" and "Dependency Injection" capabilities that typer-tools adds
    • maybe implement --log-file (defaults to '.log', supports '-' for stderr) and --log-level options, the latter with the help of click-loglevel and Typer support for Click custom type
  • add logging, save logged information to a *.log (or *.err and *.messages): currently uses logging module from standard library.

TO DO List for diff-generate script

This script can be used to generate patches (*.patch and *.diff files) from a given repository, in the format suitable for later analysis: annotating with diff-annotate, and computing statistics with diff-gather-stats.

However, you can also create annotations directly from the repository with diff-annotate from-repo subcommand.

  • improvements and new features for generate_patches.py
    • configure what and where to output
      • --use-fanout (e.g. save result in 'c0/dcf39b046d1b4ff6de14ac99ad9a1b10487512.diff' instead of in '0001-Create-.gitignore-file.patch');
        NOTE: this required switching from using git format-patch to using git log -p, and currently does not save the commit message.

TO DO List for diff-annotate script

This script can be used to annotate existing dataset (patch files in subdirectories), or selected subset of commits (of changes in commits) in given repository.

The result of annotation is saved in JSON files, one per patch / commit.

  • improvements and new features for annotate.py
    • subcommands
      • patch - annotate a given single patch file
      • dataset - annotate all patches in a given dataset (directory with directories with patches)
      • from-repo - annotate changesets of given selected revisions in a given Git repository
    • parse whole pre-image and post-image files (only via Git currently; or via GitHub / GitLab / ...)
    • configurable file type
      • global option --ext-to-language (the API it uses already existed)
      • global option --filename-to-language (using new API)
      • global option --glob-to-language (using new API)
      • global option --pattern-to-purpose (using new API)
      • (optionally?) use wcmatch.pathlib to be able to use ** in patterns (with globmatch and pathlib.GLOBSTAR)
    • option to limit analyzing changes to only "production code" changes, for example with --production-code-only, or --file-purpose-limit, etc.
    • support .gitattributes overrides of GitHub Linguist
    • optionally use Python clone of github/linguist, namely retanoj/linguist, installed from GitHub, with --use-pylinguist (note: install requires libmagic-dev and libicu-dev libraries)
      • make it use newer version of languages.yml by default
      • maybe use Language.detect(name=file_name, data=file_contents), or FileBlob(file_name).language.name (deprecated) to detect language based on file contents if extension is not enough to determine it
    • optionally use Python wrapper around github/linguist, namely scivision/linguist-python, with --use-ghlinguist (e.g. via RbCall, or via rython, or other technique)
    • configurable line annotation based on file type purpose
      • PURPOSE_TO_ANNOTATION global variable
      • global option --purpose-to-annotation in annotate.py script
        • do not modify the global variable PURPOSE_TO_ANNOTATION, reuse the code from diff-gather-stats timeline --purpose-to-annotation
    • configurable line annotation based on tokens
    • separate commit metadata, diff metadata (patch size and spread metrics), and changes/diff (parsed), instead of having them intermixed together (in "v2" format).
    • computing patch/diff size and spread, following "Dissection of a bug dataset: Anatomy of 395 patches from Defects4J" (and extending it) - independent implementation
      • patch size counting added ('+'), removed ('-'), and modified ('!') lines, with simplified changed lines detection:
        "Lines are considered modified when sequences of removed lines are straight followed by added lines (or vice versa). Thus, to count each modified line, a pair of added and removed lines is needed."
      • patch spreading - counting number of chunks / groups:
        "A chunk is a sequence of continuous changes in a file, consisting of the combination of addition, removal, and modification of lines."
      • patch spreading - sum of spreading of chunks:
        "number of lines interleaving chunks in a patch", per file
        (counting inter-hunk distances)
      • patch spreading - number of modified source files
      • patch spreading - number of modified classes (not planned)
      • patch spreading - number of modified methods [and functions] (not planned)
      • check the Python (and JavaScript) code used by work mentioned above, available at https://github.com/program-repair/defects4j-dissection, and maybe use it (copy, or import from PyPI/GitHub, or include as submodule and import): it calls defect4j binary from https://github.com/rjust/defects4j (Java code, Ant build system, with Perl wrappers - for Java code only)
      • find out which lines were modified, and not only their count with some kind of fuzzy matching between lines (RapidFuzz, thefuzz, maybe regex and orc, maybe SequenceMatcher, get_close_matches from difflib, or maybe the context diff algorithm)
    • retrieving and adding commit metadata
      • from Git repository - for 'from-repo'
      • from *.message files - for 'dataset' (see BugsInPy, HaPy-Bugs)
      • from git log -p generated *.diff files - for 'dataset'
      • from git format-patch generated *.patch/*.diff files - for 'dataset'
      • from Git (or GitHub) repository provided via CLI option - for 'dataset'
    • configuration file (*.toml, *.yaml, *.json, *.ini, *.cfg, or *.py);
      maybe using Hydra (see Using Typer and Hydra together), maybe using typer-config (e.g. my-typer-app --config config.yml), maybe using Dynaconf, maybe using configparser standard library (see also: files read by rcfile package, or better use platformdirs or appdirs)
    • documentation on how to use API, and change behavior
    • configure output format (and what to output)
      • for from-repo subcommand: --use-fanout (e.g. save in 'c0/dcf39b046d1b4ff6de14ac99ad9a1b10487512.json', instead of in 'c0dcf39b046d1b4ff6de14ac99ad9a1b10487512.json')
      • for dataset subcommand: --uses-fanout to process the result of generating patches with --use-fanout
      • for from-repo and dataset: --output-file=<filename> to save everything into single JSON or JSON Lines file
    • maybe configuration options
    • maybe configuration callbacks (in Python), like in git-filter-repo
      • AnnotatedPatchedFile.line_callback static field
      • global option --line-callback in annotate.py script
    • maybe generate skeleton, like a framework, like in Scrapy
    • maybe provide an API to generate processing pipeline, like in SciKit-Learn

TO DO List for diff-gather-stats script

This script and its subcommands can compute various statistics and metrics from patch annotation data generated by the diff-annotate script.

It saves extracted insights in a single file; currently only JSON is supported. Different subcommands use different schemas and save different data.

  • improvements and new features for gather_data.py
    • docstring for common() function
    • purpose-counter subcommand
      • rename to dataset-summary (and include other metrics)
      • draw Venn diagram of patches that contain added, removed and/or modified lines, like on Fig. 1 of "Dissection of a bug dataset: Anatomy of 395 patches from Defects4J"
      • draw Venn / Euler diagram, or upsetplot, of patches that unidiff contain added and/or removed lines; see above
      • table or DataFrame with descriptive statistics for patch size and spreading, like on Table 1 in "Dissection..."
        • patch size: # Added lines, # Removed lines, # Modified lines, Patch size
        • patch spreading: # Chunks (Groups), Spreading, # Files, # Classes, # Methods
        • statistics: min, 25%, 50%, 75%, 90%, 95%, max
      • if missing, table or DataFrame with statistics of dataset and patch size: # Patches/Bugs/Commits, # Files, # Lines (however the last one is determined: sum of '+' and '-' lines, max, average,...) like in the first third of the table on Fig. 1(b) "dataset characteristics" in unpublished "HaPy-Bug – Human Annotated Python Bug Resolution Dataset" paper
      • maybe number of patches/bugs/commits for each project, like on Table 1 in "BugsInPy:..."
      • maybe with Timeframe, # Bugs, # Commits, like on Table 3 in Herbold et al.
      • statistics of assigned line labels over all data (automatic, human consensus), like in Table 4 in Herbold et al.:
        • labels in rows (bug fix, test, documentation, refactoring,..., no consensus, total),
        • all changes, production code, other code in columns - number of lines, % of lines (% of lines is also used in second third of table in Fig. 1(b), "line annotations", in "HaPy Bug - ..." unpublished paper)
      • robust statistics of assigned line labels over all data (automatic,...) like in table in Fig. 2(a) in Herbold et al.:
        • labels in rows (bug fix, test, documentation, refactoring,..., no consensus, total),
        • overall (all changes), production code in columns - subdivided into median, MAD (Median Absolute Deviation from median), CI (Confidence Interval), >0 count
      • histogram of bug fixing lines percentage per commit (overall, production code) like in Fig. 2(b,c) in Herbold et al.
      • boxplot, (or boxenplot, violin plot, or scatterplot, or beeswarm plot) of percentages of line labels per commit (overall, production code) like in Fig. 2(b,c) in Herbold et al. and in Fig. 1(d) in "HaPy Bug - ..." - "distribution of number of line types divided by all changes made in the bugfix"
      • maybe hexgrid colormap showing relationship between the number of lines changed in production code files and the percentage of bug fixing lines and lines without consensus like in Fig. 9 in Herbold et al.. The plot has
        • percentage of bugfixing lines (or lines without consensus) on X axis (0.0..1.0),
        • # Lines changed on Y axis using logscale (10^0..10^4),
        • and log10(# Commits) or log10(# Issues) as the hue / color (10^0..10^3, mostly),
        • with the regression line for a linear relationship between the variables overlaid,
          and the r-value i.e. Pearson's correlation coefficient
      • maybe the table of observed label combinations; the Table 8 in the appendix of Herbold et al. is for lines without consensus, but we may put lines in a single commit / patch; instead of the table, UpSet Chart / UpSet: Visualizing Intersecting Sets may be used (using upsetplot library/package, or older pyupset for Python)
      • add --output option - currently supports only the JSON format
        • support for - as file name for printing to stdout
    • purpose-per-file subcommand
      • table, horizontal bar plot, or pie chart - of % of file purposes in the dataset, like bar plot in left part of Fig. 1(c) "percentage of lines by annotated file type" in "HaPy Bug - ..." unpublished paper
      • composition of different line labels for different file types, using horizontal stacked bar plot of percentages, or many pie charts, like the stacked bar plot on the right part of Fig. 1(c) "breakdown of line types by file type" in "HaPy Bug - ...";
        though note that for some file types all lines are considered to be specific type, and that this plot might be more interesting for human-generated line types, rather than for line types generated by diff-annotate tool
    • lines-stats subcommand
      • fix handling of 'commit_metadata' field (skip it)
    • timeline subcommand
      • maybe create pandas.DataFrame and save as Parquet, Feather, HDF5, or pickle
      • maybe resample / groupby (see notebooks/)
      • print information about results of --purpose-to-annotation
      • include information about patch size and spread metrics
    • store only basename of the dataset in *.json output, not the full path
    • global option --output-format (json, maybe jsonlines, csv, parquet,...)
    • global options --bugsinpy-layout, --from-repo-layout, --uses-fanout (mutually exclusive), configuring where the script searches for annotation data; print errors if there is a mismatch of expectations vs reality (if detectable)
    • option or subcommand to output flow diagram (here the flow could be from file purpose to line type, or from directory structure (with different steps) to line type or file purpose)
      using:
      • Mermaid diagramming language (optionally wrapped in Markdown block)
      • Plotly (for Python) plotly.graph_objects.Sankey() / plotly.express.parallel_categories() (or plotly.graph_objects.Parcats()), or
        HoloViews holoviews.Sankey() - with Bokeh and matplotlib backends, or
        pySankey - which uses matplotlib, but is limited to simple two divisions flow diagram
    • option or subcommand to generate ASCII-art chart in terminal;
      perhaps using Rich (used by typer by default) or Textual, or just Colorama - perhaps with tabulate or termtables. Possibilities:
      • pure Python: horizontal bar, created by repeating a character N times, like in How to Create Stunning Graphs in the Terminal with Python
      • terminalplot - only XY plot with '*', minimalistic
      • asciichartpy - only XY plot, somewhat configurable, uses Node.js asciichart
      • uniplot - XY plots using Unicode, fast, uses NumPy
      • termplot - XY plots and histograms, somewhat flexible
      • termplotlib - XY plots (using gnuplot), horizontal and vertical histograms
      • termgraph - candle stick graphs drawn using Unicode box drawing characters, with Colorama used for colors
      • plotille - XY plots, scatter plots, histograms and heatmaps in the terminal using braille dots
      • termcharts - bar, pie, and doughnut charts, with Rich compatibility
      • plotext - scatter, line, bar, histogram and date-time plots (including candlestick), with support for error bars and confusion matrices
      • matplotlib-sixel - a matplotlib backend which outputs sixel graphics onto the terminal (matplotlib.use('module://matplotlib-sixel'))

Other TODOs