I am going to be calculating inter-rater reliability with binary, imbalanced data (approximately 90% negative class, 10% positive class). This means asking an individual to manually code a number of sentences that have already been coded. I will then measure the agreement rate using:
- Cohen's Kappa
- Krippendorf's alpha
The purpose of this repo is to simulate data to establish the optimal number of sentences to test. This means trying to minimise the number of sentences that need to be manually classified twice, while still achieving an acceptable confidence interval for the estimates.
The parameters that I will test at different levels:
- Number of samples, i.e. number of sentences to be manually classified by both raters.
- Imbalance. Although the raw dataset is imbalanced, it may be desirable to select a more balanced subset of the data, as more samples in the positive class should reduce the uncertainty of the estimate.
- Error rate. It is unknown what proportion of sentences the raters will disagree on. Once a sample size and imbalance is decided upon, we will want to make sure that the uncertainty is not sensitive to the error rate.
Although confidence increases with larger samples, Krippendorf’s alpha has wider confidence intervals than Cohen’s Kappa with imbalanced data.
This discrepancy disappears with balanced data.
It is also notable that with balanced data, the confidence intervals are smaller than with extremely imbalanced data (like a random subset of our data), particularly for Krippendorf's alpha. Further simulations confirm this:
I originally calculated the confidence intervals for Krippendorf's alpha using kripp.boot. However the size of the confidence intervals did not appear to decline as the sample size increased. I found this surprising although I am not sure if it is expected behaviour so raised this as a github issue.
This script calculates alpha using kripp.boot
and the krippendorffsalpha
package. In the second case the confidence interval declines as the sample size increases. I discuss this more here. Apart from when comparing the packages, the estimates in this repo use the krippendorfsalpha
package rather than kripp.boot
, as the confidence interval should be sensitive to the sample size.
I have used the phrase "error rate" to refer to the proportion of true positives and true negatives that are misclassified. For example, with a balanced dataset of 100 perfectly classified samples, the confusion matrix would be:
Rater 2 | ||||
---|---|---|---|---|
Rater 1 | Class | 0 | 1 | |
0 | 50 | 0 | ||
1 | 0 | 50 |
If we set the error rate at 0.2, the confusion matrix would be:
Rater 2 | ||||
---|---|---|---|---|
Rater 1 | Class | 0 | 1 | |
0 | 40 | 10 | ||
1 | 10 | 40 |
This would lead to Krippendorf's alpha and Cohen's Kappa of about 0.6.
Conversely in an imbalanced dataset with perfect agreement between raters, we would have this confusion matrix:
Rater 2 | ||||
---|---|---|---|---|
Rater 1 | Class | 0 | 1 | |
0 | 80 | 0 | ||
1 | 0 | 20 |
Applying a 0.2 error rate would lead to:
Rater 2 | ||||
---|---|---|---|---|
Rater 1 | Class | 0 | 1 | |
0 | 64 | 16 | ||
1 | 4 | 16 |
Note that for simplicity I have assumed that the error rate is applied in equal proportions to the negative and positive samples.
Clone the project by running in a terminal:
git clone https://github.com/samrickman/krippendorf-alpha-cohen-kappa-simulation
The data is generated in R 4.1
, with no additional packages. However, the estimates of alpha and Kappa use packages, as does reshaping and plotting. The versions are listed in renv.lock
. The easiest thing is to use renv
to ensure you have the same package versions. The first time you run the project you will need to open an R terminal in this directory and run:
renv::restore()
You should then be able to run any of the simulations by running the relevant R scripts. These files are:
1__k_alpha_sim_sample_size_balanced.R
Comparison of Krippendorf's alpha confidence intervals calculated by kripp.boot andkrippendorffsalpha
packages. Simulation holds class balance constant at 0.5 and error rate constant at 0.2 and changes the number of samples.2_k_alpha_sim_sample_size_imbalanced.R
Comparison of Krippendorf's alpha confidence intervals calculated by kripp.boot andkrippendorffsalpha
packages. Simulation holds class balance constant at 0.9 and error rate constant at 0.2 and changes the number of samples.3_k_alpha_sim_error_rate.R
Comparison of Krippendorf's alpha confidence intervals calculated by kripp.boot andkrippendorffsalpha
packages. Simulation holds class balance constant at 0.5 and number of samples at 300 and changes the error rate.4_kappa_alpha_balanced_sim.R
Comparison of Krippendorf's alpha and Cohen's kappa. Simulation holds class balance constant at 0.5 and number of samples at 300 and changes the error rate.5_kappa_alpha_imbalanced_sim.R
Comparison of Krippendorf's alpha and Cohen's kappa. Simulation holds class balance constant at 0.9 and number of samples at 300 and changes the error rate.6_kappa_alpha_balanced_error_rate_sim.R
Comparison of Krippendorf's alpha and Cohen's kappa. Simulation holds class balance constant at 0.5 and number of samples at 300 and changes the error rate.7_kappa_alpha_imbalanced_error_rate_sim.R
Comparison of Krippendorf's alpha and Cohen's kappa. Simulation holds class balance constant at 0.9 and number of samples at 300 and changes the error rate.8_kappa_alpha_balance_sim.R
Comparison of Krippendorf's alpha and Cohen's kappa. Simulation holds error rate constant at 0.2 and number of samples at 300 and changes the class balance.