Skip to content
This repository has been archived by the owner on Aug 1, 2024. It is now read-only.

InputUsername/iac-cost-linters

Repository files navigation

Supplementary Code and Data

This repository hosts supplementary code and data for the Master's Thesis titled "Catching Cost Issues in Infrastructure as Code Artifacts using Linters".

The results of the thematic analysis phase have been further refined and extended in a separate repository, search-rug/iac-cost-patterns. In addition, the extensions to Checkov and TFLint are also available in separate repositories.

Structure

This repository is organized as follows:

  1. Updating the original dataset with commits from 2022-2024
    • 1 - <name>.json, 2 - <name>.json: initial 2 rounds of individual labeling
    • 3 - merged.json: datasets merged after resolution by third rater
    • 4 - updated.json: dataset updated with additional labels
    • 5 - filtered.json: dataset without cost-unrelated commits
    • agreement.py: compute agreement (Krippendorff's alpha) between two or more raters
    • conflicts.py: highlight cases where raters disagree
  2. Fetching diffs and coding them
    • collect_data.ipynb: script to retrieve commit diffs and set up the dataset for coding. Produces:
      • diffs.json: main dataset of coded diffs
      • errors.json: errors while fetching diffs
      • codes.json: list of codes and descriptions
    • processing.py: helper script to process commits to-be-coded
  3. Combining codes into themes and patterns
    • analyse_data.ipynb: script to compute statistics and other useful information on the diffs dataset. Produces:
      • codes.csv: codes, descriptions and the number of occurrences
      • codes_grouped.csv: manually grouped codes to find patterns
      • (clouds.csv: distribution of cloud-specific codes)
    • themes_and_patterns.ipynb: combining codes into patterns and collecting their occurrences. Produces:
      • pattern_occurrences.csv: set of <pattern>, <occurrence url> pairs
      • themes.json: list of themes/patterns, occurrences per technology and relevant codes
      • theme_occurrences.json: list of occurrences with associated theme/pattern and codes
    • cooccurrences.py: helper script to compute co-occurrences of patterns within commits and repositories. Produces:
      • cooccurrences_with_commits.csv: set of <pattern>, <repository>, <commit hash> triples for further analysis
  4. Implementation helpers
    • old_gen.py: helper script to process commits that contain the Old generation antipattern
    • old-generation-analysis.csv: analysis of the Old generation commits to find the most common fixes
  5. Evaluation of the implemented linter rules
    • before_after.ipynb: grab repository states before and after commits. Produces:
      • snapshots/<owner>-<repo>-<original commit hash>/
        • before-<commit hash of parent>/ (one per parent commit): repository state before the commit
        • after/: repository state after the commit
        • latest/: repository state after the latest commit
      • errors.json: errors while retrieving repository states
    • benchmark.ipynb: benchmarks Checkov and TFLint. Produces:
      • results/checkov_<timestamp>.json
      • results/tflint_<timestamp>.json
    • evaluation.ipynb: computes statistics (precision/recall, performance) based on benchmark results.

Licenses

The software in this repository is available under the MIT license.

The data are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

About

No description, website, or topics provided.

Resources

License

MIT, CC-BY-4.0 licenses found

Licenses found

MIT
LICENSE
CC-BY-4.0
LICENSE-DATA

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published