This repository hosts supplementary code and data for the Master's Thesis titled "Catching Cost Issues in Infrastructure as Code Artifacts using Linters".
The results of the thematic analysis phase have been further refined and extended in a separate repository, search-rug/iac-cost-patterns. In addition, the extensions to Checkov and TFLint are also available in separate repositories.
This repository is organized as follows:
- Updating the original dataset with commits from 2022-2024
1 - <name>.json
,2 - <name>.json
: initial 2 rounds of individual labeling3 - merged.json
: datasets merged after resolution by third rater4 - updated.json
: dataset updated with additional labels5 - filtered.json
: dataset without cost-unrelated commitsagreement.py
: compute agreement (Krippendorff's alpha) between two or more ratersconflicts.py
: highlight cases where raters disagree
- Fetching diffs and coding them
collect_data.ipynb
: script to retrieve commit diffs and set up the dataset for coding. Produces:diffs.json
: main dataset of coded diffserrors.json
: errors while fetching diffscodes.json
: list of codes and descriptions
processing.py
: helper script to process commits to-be-coded
- Combining codes into themes and patterns
analyse_data.ipynb
: script to compute statistics and other useful information on the diffs dataset. Produces:codes.csv
: codes, descriptions and the number of occurrencescodes_grouped.csv
: manually grouped codes to find patterns- (
clouds.csv
: distribution of cloud-specific codes)
themes_and_patterns.ipynb
: combining codes into patterns and collecting their occurrences. Produces:pattern_occurrences.csv
: set of<pattern>, <occurrence url>
pairsthemes.json
: list of themes/patterns, occurrences per technology and relevant codestheme_occurrences.json
: list of occurrences with associated theme/pattern and codes
cooccurrences.py
: helper script to compute co-occurrences of patterns within commits and repositories. Produces:cooccurrences_with_commits.csv
: set of<pattern>, <repository>, <commit hash>
triples for further analysis
- Implementation helpers
old_gen.py
: helper script to process commits that contain the Old generation antipatternold-generation-analysis.csv
: analysis of the Old generation commits to find the most common fixes
- Evaluation of the implemented linter rules
before_after.ipynb
: grab repository states before and after commits. Produces:snapshots/<owner>-<repo>-<original commit hash>/
before-<commit hash of parent>/
(one per parent commit): repository state before the commitafter/
: repository state after the commitlatest/
: repository state after the latest commit
errors.json
: errors while retrieving repository states
benchmark.ipynb
: benchmarks Checkov and TFLint. Produces:results/checkov_<timestamp>.json
results/tflint_<timestamp>.json
evaluation.ipynb
: computes statistics (precision/recall, performance) based on benchmark results.
The software in this repository is available under the MIT license.
The data are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.