Discussion
This study describes an OSCE rater training curriculum and presents evaluation of the curriculum showing levels of rater agreement for HBB, ECEB and BAB training courses in an LMIC. Quality rater training and subsequent reliability analysis are especially important in LMIC context because of the limited quality assurance monitoring patient safety in the system and resources.28–30 Our results suggest that the moderate levels of rater agreement, coupled by notable challenges in discriminating ‘acceptable’ performance, expose a potential for either overestimating or underestimating competence. This has consequences for the individual, the training programme and the system. The challenge incurred in discriminating between borderline performance is not isolated to an LMIC context but reported universally.31–33 With overestimation of competence, training programmes may have passed clinicians who may need more training to provide safe care on the frontline. The problems of accurate discrimination of competency also affect resource utilisation: with underestimation of competence, training programmes may be directing the limited resources to clinicians who do not need extra training. Further, frontline staff frequently work short staffed when someone is away at training, so that unnecessary remediation training may exacerbate staff overload.28–30 34
In the majority of HBB, ECEB and BAB training programme reports, validation of improved caregiver competency is determined by comparing pretraining and post-training OSCE scores. Our results suggest that the existing reports describing a moderate inter-rater reliability (IRR) may be misleading without further validation of the accuracy of rater discernment of acceptable proficiency.10 15 16 Our raters achieved moderate rater agreement yet discernment of acceptable proficiency, which is the pass criterion in these training programmes, was approximately 50%. Based on our findings we would suggest including both measures of validation. Considering contexts with limited resources, it may be helpful to implement a further strategy such a global rating scale, which is common practice in HICs17–19 22–25 to provide another method of validation of participant competence.26 27 A Global Rating Scale allows the rater to evaluate how well a learner performs on a scale of 1–5, with 5 reflecting the highest level of competence.27 More than one method of validation creates more certainty that results are an accurate reflection of participant competence and/or training programme efficacy.26 With the continued high reports of maternal and neonatal mortality, it is important to be confident that these training programmes are accurate in identifying and supporting clinicians who may not be providing safe care on the frontline.
The guidelines for OSCE rater training used in this study were based on recommendations from HIC rater training experiences; these are challenging to implement in an LMIC context. Globally, good practice is for OSCE raters to have relevant content expertise, be well orientated to the OSCE checklist and use a validated rating scale.22–25 Although we strove for this, we had a limited pool of potential raters; this may have affected the challenges we noted in rater perceptions of the expected practice standard. Raters were recruited by clinician researchers based on recollections of which previous participants from recent HBB, ECEB and BAB trainings had performed well; no objective strategy was employed in their selection. This was the reason in-country faculty inserted a third categorical level of proficiency: excellent. They wanted an objective strategy to identify content experts as the future raters for such training programmes. A quality rater training curriculum includes standardised mock scenarios where raters practise with a variety of expected learner proficiency levels demonstrated and practice scored. In our study, this was one of the greatest challenges. Research clinicians’ role-playing scenarios on day 1 were challenged in demonstrating poor proficiency. In discussion, they shared they did not want participants to think they were not experts in the field. The inclusion of scripted and video capture of proficiency levels may lessen this tension and inconsistency in role-play. Despite this, the level of rater agreement improved over the 3 training days for both HBB and ECEB. The fall-off in rater agreement for BAB on day 3 was unexpected but may be in part related to the timing of these scenarios on day 3; they were the last role-plays of the day and rater fatigue may have played a role. Additionally, the greater number of differing perceptions of the practice standard (table 4) may have impacted this finding.
A solid rater curriculum incorporates a framework such as Zabar’s (figure 2) to guide rater feedback; this is especially important in a setting where the concept of rater training is novel. In our study, Zabar’s framework was simple and easy to use as evidenced by a decreased level of external coaching each day. A study strength was the achievement of a level of rater agreement similar to the few published training course reports for ECEB and HBB. In our participant group, the ‘moderate to good’ kappa for the ECEB OSCE was as reported by Kassick et al in Ghana, the only other ECEB reported study to include in-country evaluators: a regional and national evaluator.10 In the HBB OSCE, our findings demonstrated ‘fair to moderate’ kappa value which was similar to the ‘fair to good’ kappa value reported by Reisman et al in Tanzania15 whose raters included two external evaluators and one country-based evaluator. Comparable studies for kappa value results for raters scoring the BAB OSCE module are not reported. The achievement of comparable IRR to the studies using in-country and external partners provides support for the rater training curriculum, yet the inability to accurately discern acceptable proficiency (pass criteria) is concerning. To gain further insight into the relationship between faculty role-play and the inability to discern acceptable proficiency, we plan to script the acceptable proficiency level for each OSCE, coach faculty in the role-play, and repeat the curriculum and analysis.
Rater trainees were challenged by OSCE items where scores incorporated multisteps for their achievement; this was consistent with experiences described by Seto et al who also identified lower rater agreement for HBB OSCE multistep items.16 For example, in our study, one HBB OSCE ‘item’ requires the learner to ‘prepare the area for delivery’. To achieve a point and ‘pass’ this item, the learner must complete all four of the following: (1) place towels at bedside; (2) place suction at bedside; (3) place a bag and mask at bedside; and (4) place oxytocin at bedside. This ‘item’ created confusion among rater trainees; during mock session review, several participants had ‘passed’ the mock scenario learner on this item despite not having seen all steps yet having observed at least one step. To address this gap, we added subitem tracking boxes when this challenge was identified on day 1; the use of this strategy warrants further study.
Our study was limited by lack of formal training and experience in role-playing by simulated learners. Our ‘actors’ were not professionally trained (but rather research clinicians) and scenarios and levels were de novo; ideally, with more resources and time, mock scenarios would be formally scripted and/or video-captured to optimise standardisation. Additionally, time constraints necessitated working 3 long days; rater fatigue was likely. This was especially true for one pregnant rater-trainee who participated for the first 2 days then arrived with newborn in hand on day 3. Our results may have limitations in generalisability but do provide some context and learning for others interested in developing a rater training curriculum in a low-resource setting.