Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace κ with *k* to avoid pandoc font errors in build process #264

Merged
merged 2 commits into from
Jul 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions doc/md_src_files/10_evaluation_of_draft_trees.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,24 +97,24 @@ The vulnerabilities used as case studies are as follows. All quotes are from the

## Pilot Results

For each of the nine CVEs, six analysts rated the priority of the vulnerability as both a supplier and deployer. Table 13 summarizes the results by reporting the inter-rater agreement for each decision point. For all measures, agreement (κ) is above zero, which is generally interpreted as some agreement among analysts. Below zero is interpreted as noise or discord. Closer to 1 indicates more or stronger agreement.
For each of the nine CVEs, six analysts rated the priority of the vulnerability as both a supplier and deployer. Table 13 summarizes the results by reporting the inter-rater agreement for each decision point. For all measures, agreement (*k*) is above zero, which is generally interpreted as some agreement among analysts. Below zero is interpreted as noise or discord. Closer to 1 indicates more or stronger agreement.

How close κ should be to 1 before agreement can be considered strong enough or reliable enough is a matter of some debate. The value certainly depends on the number of options among which analysts select. For those decision points with five options (mission and safety impact), agreement is lowest. Although portfolio value has a higher κ than mission or safety impact, it may not actually have higher agreement because portfolio value only has two options. The results for portfolio value are nearly indistinguishable as far as level of statistical agreement from mission impact and safety impact. The statistical community does not have hard and fast rules for cut lines on adequate agreement. We treat κ as a descriptive statistic rather than a test statistic.
How close *k* should be to 1 before agreement can be considered strong enough or reliable enough is a matter of some debate. The value certainly depends on the number of options among which analysts select. For those decision points with five options (mission and safety impact), agreement is lowest. Although portfolio value has a higher *k* than mission or safety impact, it may not actually have higher agreement because portfolio value only has two options. The results for portfolio value are nearly indistinguishable as far as level of statistical agreement from mission impact and safety impact. The statistical community does not have hard and fast rules for cut lines on adequate agreement. We treat *k* as a descriptive statistic rather than a test statistic.

Table 13 is encouraging, though not conclusive. κ\<0 is a strong sign of discordance. Although it is unclear how close to 1 is success, κ\<0 would be clear sign of failure. In some ways, these results may be undercounting the agreement for SSVC as presented. These results are for SSVC prior to the improvements documented in [Improvement Instigated by the Pilot](#improvements-instigated-by-the-pilot), which are implemented in SSVC version 1. On the other hand, the participant demographics may inflate the inter-rater agreement based on shared tacit understanding through the process of authorship. The one participant who was not an author surfaced two places where this was the case, but we expect the organizational homogeneity of the participants has inflated the agreement somewhat. The anecdotal feedback from vulnerability managers at several organizations (including VMware [@akbar2020ssvc] and McAfee) is about refinement and tweaks, not gross disagreement. Therefore, while further refinement is necessary, this evidence suggests the results have some transferability to other organizations and are not a total artifact of the participant organization demographics.
Table 13 is encouraging, though not conclusive. *k*\<0 is a strong sign of discordance. Although it is unclear how close to 1 is success, *k*\<0 would be clear sign of failure. In some ways, these results may be undercounting the agreement for SSVC as presented. These results are for SSVC prior to the improvements documented in [Improvement Instigated by the Pilot](#improvements-instigated-by-the-pilot), which are implemented in SSVC version 1. On the other hand, the participant demographics may inflate the inter-rater agreement based on shared tacit understanding through the process of authorship. The one participant who was not an author surfaced two places where this was the case, but we expect the organizational homogeneity of the participants has inflated the agreement somewhat. The anecdotal feedback from vulnerability managers at several organizations (including VMware [@akbar2020ssvc] and McAfee) is about refinement and tweaks, not gross disagreement. Therefore, while further refinement is necessary, this evidence suggests the results have some transferability to other organizations and are not a total artifact of the participant organization demographics.

Table: Inter-Rater Agreement for Decision Points

| | Safety Impact | Exploitation | Technical Impact | Portfolio Value | Mission Impact| Exposure | Dev Result | Deployer Result |
| :--- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
Fleiss’ κ | 0.122 | 0.807 | 0.679 | 0.257 | 0.146 | 0.480 | 0.226 | 0.295 |
Fleiss’ *k* | 0.122 | 0.807 | 0.679 | 0.257 | 0.146 | 0.480 | 0.226 | 0.295 |
|Disagreement range | 2,4 | 1,2 | 1,1 | 1,1 | 2,4 | 1,2 | 1,3 | 2,3 |

For all decision points, the presumed goal is for κ to be close or equal to 1. The statistics literature has identified some limited cases in which Fleiss’ k behaves strangely—for example it is lower than expected when raters are split between 2 of q ratings when q\>2 [@falotico2015fleiss]. This paradox may apply to the safety and mission impact values, in particular. The paradox would bite hardest if the rating for each vulnerability was clustered on the same two values, for example, minor and major. Falotico and Quatto’s proposed solution is to permute the columns, which is safe with unordered categorical data. Since the nine vulnerabilities do not have the same answers as each other (that is, the answers are not clustered on the same two values), we happen to avoid the worst of this paradox, but the results for safety impact and mission impact should be interpreted with some care.
For all decision points, the presumed goal is for *k* to be close or equal to 1. The statistics literature has identified some limited cases in which Fleiss’ k behaves strangely—for example it is lower than expected when raters are split between 2 of q ratings when q\>2 [@falotico2015fleiss]. This paradox may apply to the safety and mission impact values, in particular. The paradox would bite hardest if the rating for each vulnerability was clustered on the same two values, for example, minor and major. Falotico and Quatto’s proposed solution is to permute the columns, which is safe with unordered categorical data. Since the nine vulnerabilities do not have the same answers as each other (that is, the answers are not clustered on the same two values), we happen to avoid the worst of this paradox, but the results for safety impact and mission impact should be interpreted with some care.

This solution identifies another difficulty of Fleiss’ kappa, namely that it does not preserve any order; none and catastrophic are considered the same level of disagreement as none and minor. Table 13 displays a sense of the range of disagreement to complement this weakness. This value is the largest distance between rater selections on a single vulnerability out of the maximum possible distance. So, for safety impact, the most two raters disagreed was by two steps (none to major, minor to hazardous, or major to catastrophic) out of the four possible steps (none to catastrophic). The only values of κ that are reliably comparable are those with the same number of options (that is, the same maximum distance). In other cases, closer to 1 is better, but how close is close enough to be considered “good” changes. In all but one case, if raters differed by two steps then there were raters who selected the central option between them. The exception was mission impact for CVE-201814781; it is unclear whether this discrepancy should be localized to a poor test scenario description, or to SSVC’s mission impact definition. Given it is an isolated occurrence, we expect the scenario description at least partly.
This solution identifies another difficulty of Fleiss’ kappa, namely that it does not preserve any order; none and catastrophic are considered the same level of disagreement as none and minor. Table 13 displays a sense of the range of disagreement to complement this weakness. This value is the largest distance between rater selections on a single vulnerability out of the maximum possible distance. So, for safety impact, the most two raters disagreed was by two steps (none to major, minor to hazardous, or major to catastrophic) out of the four possible steps (none to catastrophic). The only values of *k* that are reliably comparable are those with the same number of options (that is, the same maximum distance). In other cases, closer to 1 is better, but how close is close enough to be considered “good” changes. In all but one case, if raters differed by two steps then there were raters who selected the central option between them. The exception was mission impact for CVE-201814781; it is unclear whether this discrepancy should be localized to a poor test scenario description, or to SSVC’s mission impact definition. Given it is an isolated occurrence, we expect the scenario description at least partly.

Nonetheless, κ provides some way to measure improvement on this a conceptual engineering task. The pilot evaluation can be repeated, with more diverse groups of stakeholders after the descriptions have been refined by stakeholder input, to measure fit to this goal. For a standard to be reliably applied across different analyst backgrounds, skill sets, and cultures, a set of decision point descriptions should ideally achieve κ of 1 for each item in multiple studies with diverse participants. Such a high level of agreement would be difficult to achieve, but it would ensure that when two analysts assign a priority with the system that they get the same answer. Such agreement is not the norm with CVSS currently [@allodi2018effect].
Nonetheless, *k* provides some way to measure improvement on this a conceptual engineering task. The pilot evaluation can be repeated, with more diverse groups of stakeholders after the descriptions have been refined by stakeholder input, to measure fit to this goal. For a standard to be reliably applied across different analyst backgrounds, skill sets, and cultures, a set of decision point descriptions should ideally achieve *k* of 1 for each item in multiple studies with diverse participants. Such a high level of agreement would be difficult to achieve, but it would ensure that when two analysts assign a priority with the system that they get the same answer. Such agreement is not the norm with CVSS currently [@allodi2018effect].


Table: SSVC pilot scores compared with the CVSS base scores for the vulnerabilities provided by NVD.
Expand Down
2 changes: 1 addition & 1 deletion doc/md_src_files/13_future_work.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ Future work should provide some examples of how this holistic analysis of multip

More testing with diverse analysts is necessary before the decision trees are reliable. In this context, **reliable** means that two analysts, given the same vulnerability description and decision process description, will reach the same decision. Such reliability is important if scores and priorities are going to be useful. If they are not reliable, they will vary widely over time and among analysts. Such variability makes it impossible to tell whether a difference in scores is really due to one vulnerability being higher priority than other.

The SSVC version 1 pilot study provides a methodology for measuring and evaluating reliability of the decision process description based on the agreement measure κ.
The SSVC version 1 pilot study provides a methodology for measuring and evaluating reliability of the decision process description based on the agreement measure *k*.
This study methodology should be repeated with different analyst groups, from different sectors and with different experience, using the results to make changes in the decision process description until the agreement measure is adequately close to 1.

Internationalization and localization of SSVC will also need to be considered and tested in future work.
Expand Down
Loading