Participants
Twenty-five students (mean age [SD]: 21.6 [3.2], 12 [%48] women), currently enrolled in four medical (MD) programs across Canada, participated in this study (Table 1). The exclusion criteria included participation in previous trials involving the NeuroVR (CAE Healthcare) neurosurgery simulator.
Performance assessment using the ICEMS score (Fig. 1a)
Performance was assessed using the ICEMS’s composite score, which was also evaluated for alignment with the OSATS rating (Objective Structured Assessment of Technical Skills). An average score was calculated for each task, and the improvement throughout the practice tasks and the transfer of skills to the test task was assessed.
There was a statistically significant interaction between the groups and trials on the ICEMS composite score for the second training session, F(2.72, 62.47) = 15.34, p < 0.001, partial η2 = 0.40. However, there was no significant interaction for the first training session, F(4, 92) = 1.81, p = 0.134, partial η2 = 0.07. AI2EXPERT improved significantly on the ICEMS score at the fifth repetition of the task compared to the baseline, 0.39, 95%CI [0.09–0.69], p = 0.008 (mean difference, 95%CI [lower bound upper bound], p-value). AI-assisted learning demonstrated a significantly greater improvement compared to human-mediated learning in the first training session. Specifically, AI2EXPERT achieved a significantly higher score at the end of this session (0.23, 95%CI [0.002–0.45], p = 0.049) when compared to EXPERT2AI. Unexpectedly, students’ performance declined in the second session when they received human expert-mediated training after the AI session. In particular, the ICEMS score for AI2EXPERT significantly declined at the fifth repetition of the task compared to the baseline (−0.42, 95%CI [−0.67 – (−0.18)], p < 0.001). Real-time AI-mediated training significantly improved students’ performance in the second session, after their exposure to human-mediated training in the first session, which yielded no significant improvement. EXPERT2AI showed a significant increase in their score at the end of the second session (0.51, 95% CI [0.15–0.86], p = 0.003). During the second training session, EXPERT2AI had significantly higher scores compared to AI2EXPERT in both the fourth and fifth repetitions of the task with a mean difference of 0.44 (95% CI [0.11, 0.76], p = 0.011), and 0.67 (95% CI [0.43, 0.91], p < 0.001), respectively. These results demonstrate superior learning outcomes with AI-mediated training compared to human-mediated training.
Learning transfer to the test task (Fig. 1b)
EXPERT2AI achieved a higher ICEMS score in the test subpial resection task compared to AI2EXPERT at the end of the second training session, 0.06 95%CI [0.01–0.11]. p = 0.013. Both groups showed significant improvement in their scores in the second training session compared to the first, regardless of the intervention order. AI2EXPERT and EXPERT2AIs had significantly higher scores at the end of the second session when compared to their performance at the end of the first session in the test task 0.31 95%CI [0.14–0.48], p = 0.002, 0.41 95%CI [0.21–0.61], p < 0.001, respectively.
Skill retention
No statistically significant changes were observed in the ICEMS score from the end of the first session to the beginning of the second session for AI2EXPERT (0.01, 95%CI [−0.26–0.28], p = 0.96) and EXPERT2AI (0.04, 95%CI [−0.11–0.18], p = 0.60). These results may indicate that students were able to retain the information they have acquired when moving on to the second session regardless of the instruction they received.
OSATS outcomes (Fig. 2a)
The OSATS ratings on the test task at the end of each training session were analyzed and compared between groups and within groups between sessions. An average score of the two raters was calculated for each task. Superior learning outcomes were observed with the AI feedback in terms of Respect for Tissue, Hemostasis, Economy of Movement, Flow, and the Overall Score in the first session compared to the human-mediated feedback. These significant differences disappeared in the second session, when students who initially received human-mediated feedback also began receiving AI feedback. In the Overall score, students who received real-time AI feedback in the first session (median, 5.25) achieved significantly higher OSATS ratings compared to those who received in-person expert instruction (4.5), with a median difference of 0.75, z = −2.56, p = 0.01. This significant difference disappeared at the end of the second training session when the groups switched to the next feedback intervention (AI2EXPERT: 4.5 vs 5), 0.5, z = 1.71, p = 0.09. AI2EXPERT achieved significantly higher scores compared to EXPERT2AI at the end of the first training session in Respect for tissue (5.5 vs 4.5), 1, z = −2.02, p = 0.04; Hemostasis (6 vs 4), 2, z = −2.45, p = 0.01; Economy of movement (4.5 vs 3.63), 0.87, z = −3.05, p = 0.002; and Flow (5.25 vs 4.5), 0.75, z = −2.69, p = 0.006. Students in EXPERT2AI achieved significant improvement in Instrument Handling (first session: 4.5 vs 5), z = 2.35, p = 0.019, and Economy of Movement (3.63 vs 4.25), z = 2.27, p = 0.023 when they switched to real-time AI instruction. Students in AI2EXPERT experienced a significant drop in Flow when they switched to in-person human instruction (5.5 vs 5), z = −2.2, p = 0.028. There was poor agreement between the two raters, with an intraclass correlation coefficient (ICC) value of 0.07.
Groups are color-coded. Horizontal lines represent statistically significant differences (p < 0.05). Black lines represent significant differences between groups. Vertical bars represent standard error. The green area represents the ideal. For risk assessment, this means lower risk. For instrument utilization, this means no (zero) difference from the expert level. Y-axis represents the standard deviation from the mean.
Specific learning outcomes (Fig. 3)
Specific learning outcomes were assessed across five performance metrics: tissue injury risk, bleeding risk, aspirator force applied, bipolar force applied, and instrument tip separation distance, to outline potential reasons behind the differences between groups in the overall score and OSATS ratings. These performance metrics were assessed by the ICEMS continuously and an average metric score was calculated for each metric for statistical comparison.
Injury risk
In all tumor resection surgeries, avoidance of injury to adjacent structures is of paramount importance. In brain tumor resection, reducing injury risk is a critical determinant of patient neurologic outcome. Overall, AI feedback resulted in a decreased injury risk score, whereas human-mediated feedback led to an increase in this score. In the first training session, AI2EXPERT with real-time AI intervention achieved significantly lower injury risk scores in the fourth and fifth repetition of the task compared to EXPERT2AI, −0.04, 95%CI [−0.07 −0.02], p = 0.003 and −0.05, 95%CI [−0.07 to −0.03], p < 0.001, respectively. Without real-time AI intervention, there was a significant increase in the injury risk score for EXPERT2AI by the third repetition of the task which reached a mean difference of 0.07, 95%CI [0.03–0.11], p = 0.001 in the fifth repetition of the task when compared to the baseline performance. In the second training session, there was a significant decline in the injury risk score for EXPERT2AI which reached a mean difference of −0.042, 95%CI [−0.09–0.00], p = 0.043, at the fifth repetition of the task. Without AI assistance, AI2EXPERT experienced a significant increase in injury risk score from the second to the fifth repetition of the task in the second training session, 0.06, 95%CI [0.00–0.12], p = 0.032.
Bleeding risk
Bleeding avoidance is a critical skill for improving patient outcomes. There were no significant changes observed within groups in the first training session. EXPERT2AI had significantly higher bleeding risk in the fifth repetition of the task compared to AI2EXPERT group, 0.10, 95%CI [0.0–0.20], p = 0.049. In the second training session, a significant decline was seen in the bleeding risk score for EXPERT2AI which reached a mean difference of the fifth repetition of the task from baseline 0.12, 95%CI [0.02–0.23], p = 0.03. There was a significant difference in the bleeding risk score between the two groups in the fourth repetition of the task in the second training session 0.12, 95%CI [0.05–0.20], p = 0.002.
Aspirator force
The use of the surgical aspirator to remove brain tissue requires a balance where insufficient force prevents effective removal of tissue, while excessive force can lead to removal of tissue beyond the target depth. Both feedback interventions resulted in close values to expert level, which was defined as a score of zero (zero difference from the expected expert value in the ICEMS output). There was a significant decline in the aspirator force for AI2EXPERT which reached a mean difference of −0.55 95%CI [−1.08 to −0.02], p = 0.04 in the fifth repetition of the task from baseline in the first training session. There was a significant increase in the aspirator force for EXPERT2AI in the second repetition from baseline in the second training session 0.28, 95%CI [0.01 to −0.56], p = 0.04. Neither group had a significant change in aspirator force with expert feedback.
Bipolar force
In this paradigm, the bipolar forceps is used to provide visualization by retracting tissue to be aspirated or cauterization. Insufficient force application prevents adequate tissue retraction while excessive force can injure the adjacent brain. There were no significant differences within and between groups in the first training session. Expert feedback resulted in increased bipolar force utilization in the second training session, while AI feedback had the opposite effect. There was a significant decline in bipolar force score in EXPERT2AI from the third repetition of the task to the fifth repetition −0.25, 95%CI [−0.44 to −0.05], p = 0.009. There was a significant difference in the bipolar force score between the two groups at the fourth and fifth repetitions of the task in the second training session, 0.38, 95%CI [0.14–0.62], p = 0.005 and 0.53, 95%CI [0.27–0.80], p < 0.001, respectively.
Instrument coordination
The separation between instruments held in each hand is a key metric of surgical skill7: expert surgeons typically work with both instruments close together in a highly coordinated fashion. AI feedback resulted in instrument coordination significantly closer to the expert level when compared to human-mediated feedback. There was a significant decline in instrument tip separation in AI2EXPERT from baseline in the first training session that reached a mean difference of −0.76, 95%CI [−1.4–(−0.06)], p = 0.03, in the fifth repetition of the task. EXPERT2AI had no significant changes in the first training session across the five repetitions of the task. EXPERT2AI had a significantly higher instrument tip separation score in the fifth repetition of the task at the end of the first training session, 0.16, 95%CI [0.08–0.24], p = <0.001. EXPERT2AI had a significantly lower instrument tip separation distance in the fifth repetition of the task at the end of the second training session.
Cognitive load (Fig. 2b)
It is important to optimize cognitive load to maximize learning without overloading the trainees with redundancy and distractions13. Trainees’ cognitive load was measured through self-reporting questionnaires (Supplementary Data). Intrinsic load refers to the natural complexity of the task while extraneous load is linked to the unnecessary difficulty in the way the feedback information is delivered. Germane load measures the effort used to integrate new information into knowledge14. There were no significant cognitive load differences between groups in the first session. Within-group analysis during the second session revealed that students in EXPERT2AI perceived significantly a higher extraneous load (1 vs 1.67), z = 3.01, p = 0.003 and a significantly lower germane load (4.25 vs 4), z = −1.97, p = 0.048, after switching from in-person expert instruction to real-time AI feedback. In the between-group comparison, students in EXPERT2AI reported significantly higher extraneous load compared to those in AI2EXPERT in the second training session (1.67 vs 1), z = −3.74, p < 0.001, indicating that the feedback provided by the AI system caused significantly higher mental stress.
