This document contains statistical data regarding abstract evaluation for the APSA program. The current dataset is limited to the 418 abstracts submitted for the 2020 meeting. Future iterations could be expanded to contain data from previous years
This dataset is stratified by individual raters per column, therefore it will now be possible to do the ‘hawks’ and ‘doves’ analysis5,6. This will be done in the future.
The project was launched in an effort to evaluate and possibly increase overall inter-rater reliability for the APSA abstract submission.
| Total Records | Total Scores | No Score Submitted | Std Deviation | Median | Minimum | Maximum | Skewedness | Kurtosis | Std Error |
|---|---|---|---|---|---|---|---|---|---|
| 6282 | 5978 | 304 | 1.72 | 6 | 1 | 10 | -0.26 | -0.12 | 0.02 |
| Rater ID | Assigned Abstracts(n) | Completed Abstracts(n) | Lowest Score | Highest Score | Median | Mean | Standard Deviation |
|---|---|---|---|---|---|---|---|
| 1 | 7 | 7 | 1 | 7 | 4.0 | 4.57 | 1.99 |
| 2 | 6 | 0 | Inf | -Inf | NA | NaN | NA |
| 3 | 12 | 12 | 2 | 8 | 4.5 | 5.08 | 1.93 |
| 4 | 53 | 52 | 1 | 10 | 7.0 | 6.46 | 3.02 |
| 5 | 167 | 166 | 1 | 7 | 6.0 | 5.69 | 1.02 |
| 6 | 59 | 59 | 5 | 10 | 7.0 | 7.46 | 1.51 |
| 7 | 157 | 157 | 1 | 10 | 6.0 | 5.62 | 1.81 |
| 8 | 159 | 158 | 3 | 8 | 6.0 | 5.64 | 1.14 |
| 9 | 13 | 13 | 4 | 8 | 6.0 | 6.00 | 1.35 |
| 10 | 154 | 154 | 4 | 8 | 7.0 | 6.44 | 0.89 |
| 11 | 157 | 154 | 1 | 10 | 4.0 | 4.52 | 2.09 |
| 12 | 418 | 417 | 1 | 9 | 6.0 | 5.52 | 1.50 |
| 13 | 53 | 53 | 4 | 7 | 5.0 | 5.47 | 0.91 |
| 14 | 114 | 0 | Inf | -Inf | NA | NaN | NA |
| 15 | 32 | 32 | 2 | 9 | 5.0 | 5.06 | 1.68 |
| 16 | 161 | 160 | 1 | 10 | 5.0 | 4.69 | 2.06 |
| 17 | 53 | 52 | 2 | 9 | 7.0 | 6.37 | 1.76 |
| 18 | 159 | 155 | 1 | 9 | 5.0 | 4.89 | 1.53 |
| 19 | 151 | 150 | 1 | 10 | 5.0 | 5.36 | 2.07 |
| 20 | 152 | 148 | 1 | 9 | 4.0 | 4.57 | 2.30 |
| 21 | 148 | 142 | 1 | 8 | 5.0 | 4.86 | 1.57 |
| 22 | 1 | 1 | 2 | 2 | 2.0 | 2.00 | NA |
| 23 | 6 | 6 | 4 | 9 | 7.5 | 7.00 | 1.79 |
| 24 | 159 | 159 | 3 | 9 | 5.0 | 5.55 | 1.33 |
| 25 | 159 | 158 | 1 | 10 | 6.0 | 5.96 | 1.99 |
| 26 | 7 | 7 | 4 | 8 | 7.0 | 6.43 | 1.27 |
| 27 | 23 | 23 | 4 | 10 | 8.0 | 7.78 | 1.78 |
| 28 | 148 | 147 | 1 | 9 | 6.0 | 6.14 | 1.62 |
| 29 | 169 | 169 | 1 | 7 | 4.0 | 4.11 | 1.01 |
| 30 | 23 | 23 | 3 | 8 | 6.0 | 5.70 | 1.66 |
| 31 | 158 | 155 | 3 | 8 | 6.0 | 5.94 | 1.15 |
| 32 | 142 | 138 | 1 | 10 | 5.0 | 4.91 | 1.90 |
| 33 | 32 | 0 | Inf | -Inf | NA | NaN | NA |
| 34 | 154 | 153 | 3 | 9 | 6.0 | 5.69 | 1.48 |
| 35 | 13 | 13 | 1 | 8 | 5.0 | 4.77 | 2.20 |
| 36 | 3 | 3 | 4 | 9 | 9.0 | 7.33 | 2.89 |
| 37 | 148 | 147 | 1 | 8 | 4.0 | 4.46 | 1.43 |
| 38 | 418 | 418 | 2 | 9 | 6.0 | 6.22 | 1.15 |
| 39 | 7 | 7 | 4 | 8 | 5.0 | 5.71 | 1.38 |
| 40 | 157 | 157 | 4 | 10 | 7.0 | 7.30 | 1.11 |
| 41 | 12 | 12 | 3 | 8 | 5.5 | 5.75 | 1.54 |
| 42 | 3 | 3 | 4 | 7 | 6.0 | 5.67 | 1.53 |
| 43 | 6 | 6 | 2 | 7 | 3.5 | 3.83 | 1.94 |
| 44 | 23 | 23 | 3 | 10 | 7.0 | 6.87 | 2.01 |
| 45 | 160 | 158 | 1 | 10 | 5.0 | 4.85 | 1.89 |
| 46 | 143 | 140 | 1 | 9 | 7.0 | 6.24 | 1.41 |
| 47 | 152 | 152 | 2 | 10 | 6.0 | 6.02 | 1.72 |
| 48 | 156 | 156 | 1 | 8 | 5.0 | 4.97 | 1.25 |
| 49 | 53 | 53 | 4 | 9 | 6.0 | 5.92 | 1.17 |
| 50 | 1 | 1 | 3 | 3 | 3.0 | 3.00 | NA |
| 51 | 53 | 53 | 6 | 8 | 7.0 | 6.70 | 0.50 |
| 52 | 53 | 52 | 3 | 10 | 7.5 | 7.48 | 1.63 |
| 53 | 164 | 163 | 4 | 9 | 7.0 | 6.47 | 0.98 |
| 54 | 59 | 53 | 4 | 9 | 7.0 | 6.64 | 0.96 |
| 55 | 7 | 7 | 3 | 8 | 6.0 | 5.57 | 1.99 |
| 56 | 59 | 57 | 3 | 9 | 7.0 | 6.77 | 1.34 |
| 57 | 156 | 153 | 2 | 9 | 6.0 | 5.76 | 1.31 |
| 58 | 17 | 15 | 4 | 7 | 5.0 | 5.40 | 0.99 |
| 59 | 53 | 47 | 2 | 7 | 4.0 | 4.11 | 1.01 |
| 60 | 53 | 53 | 3 | 10 | 7.0 | 7.36 | 1.47 |
| 61 | 59 | 59 | 4 | 8 | 6.0 | 5.80 | 0.92 |
| 62 | 6 | 6 | 5 | 8 | 6.0 | 6.33 | 1.03 |
| 63 | 154 | 152 | 2 | 9 | 6.0 | 5.88 | 1.50 |
| 64 | 17 | 0 | Inf | -Inf | NA | NaN | NA |
| 65 | 174 | 103 | 2 | 8 | 4.0 | 4.47 | 1.10 |
| 66 | 165 | 164 | 2 | 9 | 6.0 | 5.71 | 1.76 |
| 67 | 6 | 6 | 4 | 7 | 6.5 | 6.00 | 1.26 |
| 68 | 6 | 6 | 2 | 7 | 4.5 | 4.67 | 2.07 |
| Total Records | Total Scores | No Score Submitted | Std Deviation | Median | Minimum | Maximum | Skewedness | Kurtosis | Std Error |
|---|---|---|---|---|---|---|---|---|---|
| 5219 | 5103 | 116 | 1.68 | 6 | 1 | 10 | -0.31 | -0.2 | 0.02 |
Our dataset contains multiple abstracts and multiple raters. Our dataset is unbalanced and not fully crossed (the number of ratings per submission varies) and we have missing data (not every rater scored every paper).
Therefore ICC provides the best way to assess inter-rater reliability for this dataset. A different set of raters is “randomly” selected from a larger population of raters for each submission, therefore we need to use a one-way model to calculate the ICC1,2.
I did an analysis using the iccNA R package3. It allows to calculate ICC from the dataset and looking at all abstracts and all raters without removing missing data.
Here are the results:| ICC | p-value | lower CI limit | upper CI limit | |
|---|---|---|---|---|
| ICC(1) | 0.1749588 | 0 | 0.1497055 | 0.2039065 |
| ICC(k) | 0.7520025 | 0 | 0.7157122 | 0.7855233 |
| ICC(A,1) | 0.2106269 | 0 | 0.1831294 | 0.2418616 |
| ICC(A,k) | 0.7923343 | 0 | 0.7620897 | 0.8202889 |
| ICC(C,1) | 0.2222910 | 0 | 0.1938148 | 0.2545296 |
| ICC(C,k) | 0.8034248 | 0 | 0.7746565 | 0.8299969 |
ICC(1) represents the single rater, ICC(k) the average rater reliability. As you can see the single rater reliability is 17.5%, which is not good. High reliability exams (e.g. certifying exam for the American Board of Surgery) require reliability between raters above 80%. However, the averaged rater reliability is much better at 75.2%. This might be the more relevant number in our case (see below).
I then used the ICC function from the psych package4, which also allows to perform this analysis with missing data using a slightly different algorithm.
Here are the results…
| type | ICC | F | df1 | df2 | p | lower bound | upper bound | |
|---|---|---|---|---|---|---|---|---|
| Single_raters_absolute | ICC1 | 0.1664209 | 14.57594 | 417 | 28006 | 0 | 0.1504504 | 0.1847605 |
| Single_random_raters | ICC2 | 0.1691943 | 20.03060 | 417 | 27939 | 0 | 0.1501878 | 0.1906060 |
| Single_fixed_raters | ICC3 | 0.2186656 | 20.03060 | 417 | 27939 | 0 | 0.1992761 | 0.2406691 |
| Average_raters_absolute | ICC1k | 0.9313938 | 14.57594 | 417 | 28006 | 0 | 0.9233271 | 0.9390655 |
| Average_random_raters | ICC2k | 0.9326520 | 20.03060 | 417 | 27939 | 0 | 0.9231814 | 0.9412230 |
| Average_fixed_raters | ICC3k | 0.9500764 | 20.03060 | 417 | 27939 | 0 | 0.9442062 | 0.9556591 |
Based on our design conditions, ICC2 and ICC2k apply. The individual inter-rater reliability is close to the previous method, around 16.9%. The average rating analysis looks quite a bit better at 93.3%.
Our decisions whether to accept or reject a submission are mostly based on the scoring average. Therefore, in our case the intra-class correlation (ICC) most likely applies. Here a direct quote in support of this:
In studies where all subjects are coded by multiple raters and the average of their ratings is used for hypothesis testing, average-measures ICCs are appropriate. However, in studies where a subset of subjects is coded by multiple raters and the reliability of their ratings is meant to generalize to the subjects rated by one coder, a single-measures ICC must be used. Just as the average of multiple measurements tends to be more reliable than a single measurement, average-measures ICCs tend to be higher than single-measures ICCs. In cases where single-measures ICCs are low but average-measures ICCs are high, the researcher may report both ICCs to demonstrate this discrepancy.1
We can therefore argue that despite the low single rater reliability the use of a large amount of random raters makes the process fairly reliable. It is still possible to increase the single rater reliability if we can develop a standardized framework for the reviewers that might help them apply a more consistent score across all evaluations.
To analyse the raters and look for possible hawks and doves I followed a protocol described by Bartman in 20116. There are three steps outlined:
In step one you look for ratings that are significantly lower or higher than the average. The paper mentions 3 * above or below the standard deviation of the mean. I used 2 *, which the reference considers reasonable and allowed me to have more data to work with in our dataset. Here are the inital results for hawks and doves:
| Rater.ID | count | id |
|---|---|---|
| Rater_4 | 7 | 4 |
| Rater_7 | 2 | 7 |
| Rater_8 | 1 | 8 |
| Rater_11 | 17 | 11 |
| Rater_12 | 2 | 12 |
| Rater_14 | 1 | 14 |
| Rater_15 | 16 | 15 |
| Rater_16 | 5 | 16 |
| Rater_17 | 7 | 17 |
| Rater_19 | 17 | 19 |
| Rater_20 | 7 | 20 |
| Rater_21 | 2 | 21 |
| Rater_24 | 3 | 24 |
| Rater_27 | 1 | 27 |
| Rater_28 | 3 | 28 |
| Rater_32 | 6 | 32 |
| Rater_34 | 3 | 34 |
| Rater_37 | 8 | 37 |
| Rater_38 | 1 | 38 |
| Rater_43 | 1 | 43 |
| Rater_46 | 10 | 46 |
| Rater_47 | 1 | 47 |
| Rater_48 | 2 | 48 |
| Rater_57 | 1 | 57 |
| Rater_59 | 2 | 59 |
| Rater_60 | 4 | 60 |
| Rater_68 | 1 | 68 |
| Rater.ID | count | id |
|---|---|---|
| Rater_4 | 7 | 4 |
| Rater_6 | 5 | 6 |
| Rater_7 | 1 | 7 |
| Rater_11 | 3 | 11 |
| Rater_12 | 2 | 12 |
| Rater_15 | 1 | 15 |
| Rater_17 | 2 | 17 |
| Rater_19 | 1 | 19 |
| Rater_21 | 1 | 21 |
| Rater_23 | 2 | 23 |
| Rater_26 | 4 | 26 |
| Rater_27 | 4 | 27 |
| Rater_34 | 1 | 34 |
| Rater_36 | 1 | 36 |
| Rater_38 | 3 | 38 |
| Rater_39 | 12 | 39 |
| Rater_44 | 1 | 44 |
| Rater_45 | 1 | 45 |
| Rater_47 | 3 | 47 |
| Rater_52 | 5 | 52 |
| Rater_59 | 2 | 59 |
| Rater_61 | 2 | 61 |
| Rater_66 | 4 | 66 |
You can see we had more hawk ratings than dove ratings. Surgeons are tough.
The second step compares the overall score distribution between identified hawks or doves to the distribution of all raters (page 15 of the reference). Here are the descriptive stats for the three groups:
| Total Scores | Mean | Std Deviation | Median | Minimum | Maximum | Skewedness | Kurtosis | Std Error |
|---|---|---|---|---|---|---|---|---|
| 5103 | 5.53 | 1.68 | 6 | 1 | 10 | -0.31 | -0.2 | 0.02 |
| n | mean | sd | median | min | max | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|
| 3249 | 5.46 | 1.7 | 6 | 1 | 10 | -0.35 | -0.19 | 0.03 |
| n | mean | sd | median | min | max | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|
| 2065 | 5.55 | 1.71 | 6 | 1 | 10 | -0.33 | -0.3 | 0.04 |
Here are the plots:
There is plenty of other analyses that can be done based on the collected data, I am more than happy to run more queries if you have any other ideas what to look at.
A few ideas:
ICC per committee membership
ICC per topic
ICC for abstracts accepted for podium or poster presentations or rejected
Dove and hawk calculations between raters (in progress)
score changes based on timestamp of rating (would need timestamp of review submission)
cross reference between membership interests of abstracts and abstract ratings (using clicks of the virtual meeting as a proxy, will have to wait until virtual APSA meeting is completed)
Pretty interesting stuff!
Andreas
Hallgren KA, Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. Tutor Quant Methods Psychol. 2012; 8(1): 23–34., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032, accessed 5/11/2020
Shrout PE, Fleiss JL, Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979 Mar;86(2):420-8, http://rokwa.x-y.net/Shrout-Fleiss-ICC.pdf, accessed 5/12/2020
Brueckl M, Heuer F, Package ‘irrNA’, https://cran.r-project.org/web/packages/irrNA/irrNA.pdf, accessed 5/11/2020
Revelle W, psych: Procedures for Psychological, Psychometric, and Personality Research, https://cran.r-project.org/web/packages/psych, accessed 5/12/2020
Bartman I, Smee S, Roy M, A method for identifying extreme OSCE examiners. Clin Teach. 2013 Feb;10(1):27-31