alt text

This document contains statistical data regarding abstract evaluation for the APSA program. The current dataset is limited to the 418 abstracts submitted for the 2020 meeting. Future iterations could be expanded to contain data from previous years

This dataset is stratified by individual raters per column, therefore it will now be possible to do the ‘hawks’ and ‘doves’ analysis5,6. This will be done in the future.

The project was launched in an effort to evaluate and possibly increase overall inter-rater reliability for the APSA abstract submission.


Data Analysis

Descriptive Statistics of the Entire Dataset

Total Records Total Scores No Score Submitted Std Deviation Median Minimum Maximum Skewedness Kurtosis Std Error
6282 5978 304 1.72 6 1 10 -0.26 -0.12 0.02

Plot the Dataset (Histogram with Normal Distribution Overlay)

Descriptive Statistics by Rater

Below is a table with descriptive statistics for each rater. As you can see some of them did not complete one of the assigned evaluations.
Rater ID Assigned Abstracts(n) Completed Abstracts(n) Lowest Score Highest Score Median Mean Standard Deviation
1 7 7 1 7 4.0 4.57 1.99
2 6 0 Inf -Inf NA NaN NA
3 12 12 2 8 4.5 5.08 1.93
4 53 52 1 10 7.0 6.46 3.02
5 167 166 1 7 6.0 5.69 1.02
6 59 59 5 10 7.0 7.46 1.51
7 157 157 1 10 6.0 5.62 1.81
8 159 158 3 8 6.0 5.64 1.14
9 13 13 4 8 6.0 6.00 1.35
10 154 154 4 8 7.0 6.44 0.89
11 157 154 1 10 4.0 4.52 2.09
12 418 417 1 9 6.0 5.52 1.50
13 53 53 4 7 5.0 5.47 0.91
14 114 0 Inf -Inf NA NaN NA
15 32 32 2 9 5.0 5.06 1.68
16 161 160 1 10 5.0 4.69 2.06
17 53 52 2 9 7.0 6.37 1.76
18 159 155 1 9 5.0 4.89 1.53
19 151 150 1 10 5.0 5.36 2.07
20 152 148 1 9 4.0 4.57 2.30
21 148 142 1 8 5.0 4.86 1.57
22 1 1 2 2 2.0 2.00 NA
23 6 6 4 9 7.5 7.00 1.79
24 159 159 3 9 5.0 5.55 1.33
25 159 158 1 10 6.0 5.96 1.99
26 7 7 4 8 7.0 6.43 1.27
27 23 23 4 10 8.0 7.78 1.78
28 148 147 1 9 6.0 6.14 1.62
29 169 169 1 7 4.0 4.11 1.01
30 23 23 3 8 6.0 5.70 1.66
31 158 155 3 8 6.0 5.94 1.15
32 142 138 1 10 5.0 4.91 1.90
33 32 0 Inf -Inf NA NaN NA
34 154 153 3 9 6.0 5.69 1.48
35 13 13 1 8 5.0 4.77 2.20
36 3 3 4 9 9.0 7.33 2.89
37 148 147 1 8 4.0 4.46 1.43
38 418 418 2 9 6.0 6.22 1.15
39 7 7 4 8 5.0 5.71 1.38
40 157 157 4 10 7.0 7.30 1.11
41 12 12 3 8 5.5 5.75 1.54
42 3 3 4 7 6.0 5.67 1.53
43 6 6 2 7 3.5 3.83 1.94
44 23 23 3 10 7.0 6.87 2.01
45 160 158 1 10 5.0 4.85 1.89
46 143 140 1 9 7.0 6.24 1.41
47 152 152 2 10 6.0 6.02 1.72
48 156 156 1 8 5.0 4.97 1.25
49 53 53 4 9 6.0 5.92 1.17
50 1 1 3 3 3.0 3.00 NA
51 53 53 6 8 7.0 6.70 0.50
52 53 52 3 10 7.5 7.48 1.63
53 164 163 4 9 7.0 6.47 0.98
54 59 53 4 9 7.0 6.64 0.96
55 7 7 3 8 6.0 5.57 1.99
56 59 57 3 9 7.0 6.77 1.34
57 156 153 2 9 6.0 5.76 1.31
58 17 15 4 7 5.0 5.40 0.99
59 53 47 2 7 4.0 4.11 1.01
60 53 53 3 10 7.0 7.36 1.47
61 59 59 4 8 6.0 5.80 0.92
62 6 6 5 8 6.0 6.33 1.03
63 154 152 2 9 6.0 5.88 1.50
64 17 0 Inf -Inf NA NaN NA
65 174 103 2 8 4.0 4.47 1.10
66 165 164 2 9 6.0 5.71 1.76
67 6 6 4 7 6.5 6.00 1.26
68 6 6 2 7 4.5 4.67 2.07

Descriptive Statistics of the Program Committee Dataset

Total Records Total Scores No Score Submitted Std Deviation Median Minimum Maximum Skewedness Kurtosis Std Error
5219 5103 116 1.68 6 1 10 -0.31 -0.2 0.02

Plot the Dataset (Histogram with Normal Distribution Overlay)

Inter-Rater Reliability (using Intraclass Correlation (ICC))

Our dataset contains multiple abstracts and multiple raters. Our dataset is unbalanced and not fully crossed (the number of ratings per submission varies) and we have missing data (not every rater scored every paper).

Therefore ICC provides the best way to assess inter-rater reliability for this dataset. A different set of raters is “randomly” selected from a larger population of raters for each submission, therefore we need to use a one-way model to calculate the ICC1,2.

I did an analysis using the iccNA R package3. It allows to calculate ICC from the dataset and looking at all abstracts and all raters without removing missing data.

Here are the results:
ICC p-value lower CI limit upper CI limit
ICC(1) 0.1749588 0 0.1497055 0.2039065
ICC(k) 0.7520025 0 0.7157122 0.7855233
ICC(A,1) 0.2106269 0 0.1831294 0.2418616
ICC(A,k) 0.7923343 0 0.7620897 0.8202889
ICC(C,1) 0.2222910 0 0.1938148 0.2545296
ICC(C,k) 0.8034248 0 0.7746565 0.8299969

ICC(1) represents the single rater, ICC(k) the average rater reliability. As you can see the single rater reliability is 17.5%, which is not good. High reliability exams (e.g. certifying exam for the American Board of Surgery) require reliability between raters above 80%. However, the averaged rater reliability is much better at 75.2%. This might be the more relevant number in our case (see below).

I then used the ICC function from the psych package4, which also allows to perform this analysis with missing data using a slightly different algorithm.

Here are the results…

type ICC F df1 df2 p lower bound upper bound
Single_raters_absolute ICC1 0.1664209 14.57594 417 28006 0 0.1504504 0.1847605
Single_random_raters ICC2 0.1691943 20.03060 417 27939 0 0.1501878 0.1906060
Single_fixed_raters ICC3 0.2186656 20.03060 417 27939 0 0.1992761 0.2406691
Average_raters_absolute ICC1k 0.9313938 14.57594 417 28006 0 0.9233271 0.9390655
Average_random_raters ICC2k 0.9326520 20.03060 417 27939 0 0.9231814 0.9412230
Average_fixed_raters ICC3k 0.9500764 20.03060 417 27939 0 0.9442062 0.9556591

Based on our design conditions, ICC2 and ICC2k apply. The individual inter-rater reliability is close to the previous method, around 16.9%. The average rating analysis looks quite a bit better at 93.3%.

Our decisions whether to accept or reject a submission are mostly based on the scoring average. Therefore, in our case the intra-class correlation (ICC) most likely applies. Here a direct quote in support of this:

In studies where all subjects are coded by multiple raters and the average of their ratings is used for hypothesis testing, average-measures ICCs are appropriate. However, in studies where a subset of subjects is coded by multiple raters and the reliability of their ratings is meant to generalize to the subjects rated by one coder, a single-measures ICC must be used. Just as the average of multiple measurements tends to be more reliable than a single measurement, average-measures ICCs tend to be higher than single-measures ICCs. In cases where single-measures ICCs are low but average-measures ICCs are high, the researcher may report both ICCs to demonstrate this discrepancy.1

We can therefore argue that despite the low single rater reliability the use of a large amount of random raters makes the process fairly reliable. It is still possible to increase the single rater reliability if we can develop a standardized framework for the reviewers that might help them apply a more consistent score across all evaluations.

Hawk and Doves

To analyse the raters and look for possible hawks and doves I followed a protocol described by Bartman in 20116. There are three steps outlined:

Step 1: Find Rating Outliers

In step one you look for ratings that are significantly lower or higher than the average. The paper mentions 3 * above or below the standard deviation of the mean. I used 2 *, which the reference considers reasonable and allowed me to have more data to work with in our dataset. Here are the inital results for hawks and doves:

Hawk Raters and Number of Outlier Ratings
Rater.ID count id
Rater_4 7 4
Rater_7 2 7
Rater_8 1 8
Rater_11 17 11
Rater_12 2 12
Rater_14 1 14
Rater_15 16 15
Rater_16 5 16
Rater_17 7 17
Rater_19 17 19
Rater_20 7 20
Rater_21 2 21
Rater_24 3 24
Rater_27 1 27
Rater_28 3 28
Rater_32 6 32
Rater_34 3 34
Rater_37 8 37
Rater_38 1 38
Rater_43 1 43
Rater_46 10 46
Rater_47 1 47
Rater_48 2 48
Rater_57 1 57
Rater_59 2 59
Rater_60 4 60
Rater_68 1 68
Dove Raters and Number of Outlier Ratings
Rater.ID count id
Rater_4 7 4
Rater_6 5 6
Rater_7 1 7
Rater_11 3 11
Rater_12 2 12
Rater_15 1 15
Rater_17 2 17
Rater_19 1 19
Rater_21 1 21
Rater_23 2 23
Rater_26 4 26
Rater_27 4 27
Rater_34 1 34
Rater_36 1 36
Rater_38 3 38
Rater_39 12 39
Rater_44 1 44
Rater_45 1 45
Rater_47 3 47
Rater_52 5 52
Rater_59 2 59
Rater_61 2 61
Rater_66 4 66

You can see we had more hawk ratings than dove ratings. Surgeons are tough.

Step 2 Compare Distribution of Hawks and Doves with Other Raters

The second step compares the overall score distribution between identified hawks or doves to the distribution of all raters (page 15 of the reference). Here are the descriptive stats for the three groups:

All Raters:

Total Scores Mean Std Deviation Median Minimum Maximum Skewedness Kurtosis Std Error
5103 5.53 1.68 6 1 10 -0.31 -0.2 0.02

Hawk Raters:

n mean sd median min max skew kurtosis se
3249 5.46 1.7 6 1 10 -0.35 -0.19 0.03

Dove Raters:

n mean sd median min max skew kurtosis se
2065 5.55 1.71 6 1 10 -0.33 -0.3 0.04

Here are the plots:

Conclusion and Final Thoughts

There is plenty of other analyses that can be done based on the collected data, I am more than happy to run more queries if you have any other ideas what to look at.

A few ideas:

  • ICC per committee membership

  • ICC per topic

  • ICC for abstracts accepted for podium or poster presentations or rejected

  • Dove and hawk calculations between raters (in progress)

  • score changes based on timestamp of rating (would need timestamp of review submission)

  • cross reference between membership interests of abstracts and abstract ratings (using clicks of the virtual meeting as a proxy, will have to wait until virtual APSA meeting is completed)

Pretty interesting stuff!

Andreas

References

  1. Hallgren KA, Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. Tutor Quant Methods Psychol. 2012; 8(1): 23–34., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032, accessed 5/11/2020

  2. Shrout PE, Fleiss JL, Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979 Mar;86(2):420-8, http://rokwa.x-y.net/Shrout-Fleiss-ICC.pdf, accessed 5/12/2020

  3. Brueckl M, Heuer F, Package ‘irrNA’, https://cran.r-project.org/web/packages/irrNA/irrNA.pdf, accessed 5/11/2020

  4. Revelle W, psych: Procedures for Psychological, Psychometric, and Personality Research, https://cran.r-project.org/web/packages/psych, accessed 5/12/2020

  5. Bartman I, Smee S, Roy M, A method for identifying extreme OSCE examiners. Clin Teach. 2013 Feb;10(1):27-31

  6. Bartman I, Catching the Hawks and Doves: A Method for Identifying Extreme Examiners on Objective Structured Clinical Examinations, https://mcc.ca/media/Technical-Reports-Bartman-2011.pdf, accessed 5/15/2020