Skip to content

Commit

Permalink
Update 050_evaluationDraftTrees.md
Browse files Browse the repository at this point in the history
Changed "fixed" to "static".
Changed "fix" to "correct".
Changed "a patch" to "remediation"
  • Loading branch information
laurie-tyz authored Dec 17, 2020
1 parent f3a7771 commit 818c04e
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions doc/version_1/050_evaluationDraftTrees.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ We conducted a pilot test on the adequacy of the hypothesized decision trees. Th

The pilot study tested inter-rater agreement of decisions reached. Each author played the role of an analyst in both stakeholder groups (supplying and deploying) for nine vulnerabilities. We selected these nine vulnerabilities based on expert analysis, with the goal that the nine cases cover a useful series of vulnerabilities of interest. Specifically, we selected three vulnerabilities to represent safety-critical cases, three to represent regulated-systems cases, and three to represent general computing cases.

During the pilot, we did not form guidance on how to assess the values of the decision points. However, we did standardize the set of evidence that was taken to be true for the point in time representing the evaluation. Given this fixed information sheet, we did not synchronize an evaluation process for how to decide whether *Exploitation*, for example, should take the value of **none**, **PoC**, or **active**. We agreed on the descriptions of the decision points and the descriptions of their values in (a prior version of) Section 4.4. The goal of the pilot was to test the adequacy of those descriptions by evaluating whether the analysts agreed. We improved the decision point descriptions based on the results of the pilot; our changes are documented in Section 5.3.
During the pilot, we did not form guidance on how to assess the values of the decision points. However, we did standardize the set of evidence that was taken to be true for the point in time representing the evaluation. Given this static information sheet, we did not synchronize an evaluation process for how to decide whether *Exploitation*, for example, should take the value of **none**, **PoC**, or **active**. We agreed on the descriptions of the decision points and the descriptions of their values in (a prior version of) Section 4.4. The goal of the pilot was to test the adequacy of those descriptions by evaluating whether the analysts agreed. We improved the decision point descriptions based on the results of the pilot; our changes are documented in Section 5.3.

In the design of the pilot, we held constant the information available about the vulnerability. This strategy restricted the analyst to decisions based on the framework given that information. That is, it controlled for differences in information search procedure among the analysts. The information search procedure should be controlled because this pilot was about the framework content, not how people answer questions based on that content. After the framework is more stable, a separate study should be devised that shows how analysts should answer the questions in the framework. The basis for the assessment that information search will be an important aspect in using the evaluation framework is based on our experience while developing the framework. During informal testing, often disagreements about a result involved a disagreement about what the scenario actually was; different information search methods and prior experiences led to different understandings of the scenario. This pilot methodology holds the information and scenario constant to test agreement about the descriptions themselves. This strategy makes constructing a prioritization system more tractable by separating problems in how people search for information from problems in how people make decisions. This paper focuses only on the structure of decision making. Advice about how to search for information about a vulnerability is a separate project that will be part of future work. In some domains, namely exploit availability, we have started that work in parallel.

Expand Down Expand Up @@ -61,7 +61,7 @@ exploitation=active</td>

We evaluated inter-rater agreement in two stages. In the first stage, each analyst independently documented their decisions. This stage produced 18 sets of decisions (nine vulnerabilities across each of two stakeholder groups) per analyst. In the second stage, we met to discuss decision points where at least one analyst differed from the others. If any analyst changed their decision, they appended the information and evidence they gained during this meeting in the “supporting evidence” value in their documentation. No changes to decisions were forced, and prior decisions were not erased, just amended. After the second stage, we calculated some statistical measures of inter-rater agreement to help guide the analysis of problem areas.

To assess agreement, we calculate Fleiss’ kappa, both for the value in the leaf node reached for each case and every decision in a case [@fleiss1973equivalence]. Evaluating individual decisions is complicated slightly because the different paths through the tree mean that a different number of analysts may have evaluated certain items, and Fleiss’ kappa requires a fixed number of raters. “Leaf node reached” is described to pick out the specific path through the tree the analyst selected and to treat that as a label holistically. Measuring agreement based on the path has the drawback that ostensibly similar paths, which agree on 3 of 4 decisions for example, are treated as no more similar than paths that agree on 0 of 4 decisions. So the two measures of agreement (per decision and per path) are complementary, and we report both.
To assess agreement, we calculate Fleiss’ kappa, both for the value in the leaf node reached for each case and every decision in a case [@fleiss1973equivalence]. Evaluating individual decisions is complicated slightly because the different paths through the tree mean that a different number of analysts may have evaluated certain items, and Fleiss’ kappa requires a static number of raters. “Leaf node reached” is described to pick out the specific path through the tree the analyst selected and to treat that as a label holistically. Measuring agreement based on the path has the drawback that ostensibly similar paths, which agree on 3 of 4 decisions for example, are treated as no more similar than paths that agree on 0 of 4 decisions. So the two measures of agreement (per decision and per path) are complementary, and we report both.

### Pilot participant details

Expand Down Expand Up @@ -192,7 +192,7 @@ The following changes are reflected in Section 4, Decision Trees for Vulnerabili

- We clarified that “MEF failure” refers to any **one** essential function failing, not failure of all of them. We changed most severe mission impact to “mission failure” to better reflect the relationship between MEFs and the organization’s mission.

- We removed the “supplier portfolio value” question since it had poor agreement, and there is no clear way to fix it. We replaced this question with *Utility*, which better captures the relevant kinds of value (namely, to the adversary) of the affected component while remaining amenable to pragmatic analysis.
- We removed the “supplier portfolio value” question since it had poor agreement, and there is no clear way to correct it. We replaced this question with *Utility*, which better captures the relevant kinds of value (namely, to the adversary) of the affected component while remaining amenable to pragmatic analysis.

- We clarified that “proof of concept” (see *Exploitation*) includes cases in which existing tooling counts as a PoC. The examples listed are suggestive, not exhaustive.

Expand All @@ -204,11 +204,11 @@ The following changes are reflected in Section 4, Decision Trees for Vulnerabili

In this section, we present ideas we tried but rejected for various reasons. We are not presenting this section as the final word on excluding these ideas, but we hope the reasons for excluding them are instructive, will help prevent others from re-inventing the proverbial wheel, and can guide thinking on future work.

Initially, we brainstormed approximately 12 potential decision points, most of which we removed early in our development process through informal testing. These decision points included adversary’s strategic benefit of exploiting the vulnerability, state of legal or regulatory obligations, cost of developing a patch, patch distribution readiness, financial losses to customers due to potential exploitation, and business losses to the deployer.
Initially, we brainstormed approximately 12 potential decision points, most of which we removed early in our development process through informal testing. These decision points included adversary’s strategic benefit of exploiting the vulnerability, state of legal or regulatory obligations, cost of developing remediation, patch distribution readiness, financial losses to customers due to potential exploitation, and business losses to the deployer.

Some of these points left marks on other decision points. The decision point “financial losses of customers” led to an amendment of the definition of *Safety* to include “well-being,” such that, for example, bankruptcies of third parties are now a major safety impact. The “business losses to the deployer” decision point is covered as a mission impact insofar as profit is a mission of publicly traded corporations.

Three of the above decision points left no trace on the current system. “State of legal or regulatory obligations,” “cost of developing a patch,” and “patch distribution readiness” were dropped as either being too vaguely defined, too high level, or otherwise not within the scope of decisions by the defined stakeholders. The remaining decision point, “adversary’s strategic benefit of exploiting the vulnerability,” transmuted to a different sense of system value. In an attempt to be more concrete and not speculate about adversary motives, we considered a different sense of value: supplier portfolio value.
Three of the above decision points left no trace on the current system. “State of legal or regulatory obligations,” “cost of developing remediation,” and “patch distribution readiness” were dropped as either being too vaguely defined, too high level, or otherwise not within the scope of decisions by the defined stakeholders. The remaining decision point, “adversary’s strategic benefit of exploiting the vulnerability,” transmuted to a different sense of system value. In an attempt to be more concrete and not speculate about adversary motives, we considered a different sense of value: supplier portfolio value.

The only decision point that we have removed following the pilot is developer portfolio value. This notion of value was essentially an over-correction to the flaws identified in the “adversary’s strategic benefit of exploiting the vulnerability” decision point. “Supplier portfolio value” was defined as “the value of the affected component as a part of the developer’s product portfolio. Value is some combination of importance of a given piece of software, number of deployed instances of the software, and how many people rely on each. The developer may also include lifecycle stage (early development, stable release, decommissioning, etc.) as an aspect of value.” It had two possible values: low and high. As Table 13 demonstrates, there was relatively little agreement among the six analysts about how to evaluate this decision point. We replaced this sense of portfolio value with *Utility*, which combines *Value Density* and *Automatability*.

0 comments on commit 818c04e

Please sign in to comment.