Frame-by-Frame Analysis of a Commercially Available Artificial Intelligence Polyp Detection System in Full-Length Colonoscopies

Introduction: Computer-aided detection (CADe) helps increase colonoscopic polyp detection. However, little is known about other performance metrics like the number and duration of false-positive (FP) activations or how stable the detection of a polyp is. Methods: 111 colonoscopy videos with total 1,793,371 frames were analyzed on a frame-by-frame basis using a commercially available CADe system (GI-Genius, Medtronic Inc.). Primary endpoint was the number and duration of FP activations per colonoscopy. Additionally, we analyzed other CADe performance parameters, including per-polyp sensitivity, per-frame sensitivity, and first detection time of a polyp. We additionally investigated whether a threshold for withholding CADe activations can be set to suppress short FP activations and how this threshold alters the CADe performance parameters. Results: A mean of 101 ± 88 FPs per colonoscopy were found. Most of the FPs consisted of less than three frames with a maximal 66-ms duration. The CADe system detected all 118 polyps and achieved a mean per-frame sensitivity of 46.6 ± 26.6%, with the lowest value for flat polyps (37.6 ± 24.8%). Withholding CADe detections up to 6 frames length would reduce the number of FPs by 87.97% (p < 0.001) without a significant impact on CADe performance metrics. Conclusions: The CADe system works reliable but generates many FPs as a side effect. Since most FPs are very short, withholding short-term CADe activations could substantially reduce the number of FPs without impact on other performance metrics. Clinical practice would benefit from the implementation of customizable CADe thresholds.

However, there are still many unanswered questions regarding CADe systems. For example, many false-positive (FP) activations of up to 8% of all frames occur during examination with CADe systems [7]. The number and duration of FP activations play an important role regarding the examiners comfort in using those systems, as these activations can affect the examiners attention leading to misinterpretation of normal mucosa [8]. Therefore, an international consensus conference has identified the analysis of FP activations as an important research focus [9]. Current studies on this topic include only small numbers of cases with about 40 colonoscopy examinations and mainly investigate the cause and the clinical impact of FP activations [10,11]. However, specific data on the duration and pattern of FP activations are not available, although such information is necessary to better understand the operation of CADe systems in order to improve them. An example for improvement might be the reduction of FPs through customizable activation thresholds. In addition, previous RCTs only provide data on per-polyp sensitivity (PPS), i.e., whether a polyp was detected resulting in a yes or no answer. How stable the detection signal is over time, termed per-frame sensitivity, was not assessed as no frame-by-frame analysis of real full-length videos has been performed so far.
Therefore, the objective of this study was to analyze the FP pattern of a commercial CADe system. This was done using a frame-by-frame analysis of full-length real-life videos to determine the effects of different CADe activation thresholds on FPs. Additionally, in a patient-based analysis, we examined performance parameters such as PPS or the mean number of polyps per colonoscopy (PPC).

Study Design
Videos from 244 routine colonoscopies performed in two tertiary centers (University Hospital Ulm and Würzburg) were retrospectively analyzed. Recording took place between March 2019 and April 2020. Those colonoscopies (raw signals) were recorded using the high-definition video signal of the endoscopy processor (Olympus CV-190). For the performance analysis of a commercially available CADe system (GI Genius, Medtronic Inc., Ireland, software version of March 2020), this raw video signal was introduced into the AI system, and the output signal (with visible CADe detections) was recorded. Accordingly, a video pair consisting of raw signal and CADe signal was assembled for video analysis of each colonoscopy.

Colonoscopies
Colonoscopies were performed using the colonoscopes CF-HQ190AL and CF H180AI/AL (Olympus Co., Tokyo, Japan). All patients were prepared for the colonoscopy using a standard split-dose regimen with 2L polyethylene glycol with ascorbic acid (Moviprep, Norgine Pharma; Harefield, England). Endoscopies were performed using nurse-assisted propofol sedation [12]. Polyps were removed upon detection by cold or hot snare technique if no contraindication for resection was present. The examiners were classified due to their experience in colonoscopy between junior and senior with 2,000 performed colonoscopies as a threshold.

Video Analysis
All videos were screened by a board-certified gastroenterologist and experienced endoscopist (MB) with over 4,000 performed colonoscopies. Examinations performed for screening reasons or post polypectomy surveillance were included in the analysis. For further analysis, the following exclusion criteria were defined: inflammatory bowel disease, active gastrointestinal bleeding, poor bowel preparation defined by a Boston Bowel Preparation Scale (BBPS) lower than 5, incomplete colonoscopies, advanced neoplasia, altered gut anatomy, endoscopy only performed for an extended resection and polyposis syndrome. Included colonoscopies were analyzed in a deep frame-by-frame manner using a custommade annotation tool as previously described [13].
Analysis of Non-CADe Signal (Raw Videos) The start and the end of withdrawal and polypectomies were annotated. Each polyp was counted for the analysis. Additionally, polyps were characterized using the Paris classification and size (<5 mm, 5-10 mm, 11-20 mm, >20 mm). In a frame-by-frame analysis, each frame with a partially or completely visible polyp was annotated as a polyp frame. Frames with even small parts of a polyp visible were regarded as a polyp-containing frame. Polyp annotation stopped at the beginning of the resection (first frame with a visible instrument in the image).
Analysis of CADe Signal (AI Videos) All frames with visible bounding boxes resembling CADe detections were automatically identified by a custom-made application. Subsequently, each bounding box was classified by an experienced endoscopist (MB) as a true-positive (TP) or FP detection. It was considered TP if the bounding box had contact with the visible polyp, irrespective of how much area of the lesion was covered. Small hyperplastic polyps of the rectosigmoid were excluded from the analysis. The absence of a bounding box in a frame with a visible polyp was regarded a false negative. The absence of a box in a frame without a polyp was considered a true negative. A FP detection was defined as a detected area that was not in contact with a polyp. In case of a FP detection in a frame with a visible polyp, the term distraction was used.

Endpoints
The primary endpoint of the study was the number of FP activations per colonoscopy and the duration of FP activations. For the secondary endpoints, we analyzed further CADe performance parameters, including mean number of PPC of the CADe System, PPS, per-frame sensitivity, and first detection time (FDT) of a polyp. In addition, we investigated whether a threshold for withholding short CADe activations can be set to suppress FP activations and how this threshold alters CADe performance parameters such as PPC, PPS, or per-frame sensitivity.

Data Analysis and Statistics
FP activations were counted in their number, with each contiguous sequence of FP frames counted as one activation. In addition, the duration of FP activations was measured in frames. Each frame had a duration of 33 ms. The mean number of PPC was calculated by dividing the number of detected polyps by the number of performed colonoscopies. PPS was defined as the number of polyps detected by the CADe system in at least one frame divided by the number of polyps annotated in the raw video data. The perframe sensitivity, previously published as temporal coherence, was calculated by dividing the number of TP frames by the total number of frames where the polyp was visible in the raw signal (TP + false negative), as previously described by Zhou et al. [14]. Additionally, the per-lesion sensitivity, defined as the number of polyps in which more than half of each polyp's frame were detected by the CADe, divided by the total number of polyps, was analyzed as previously described by Misawa et al. [15]. FDT of a polyp was defined as the time interval between the first appearance of a polyp in the raw video and the first frame containing a TP-CADe activation. If the polyp was not permanently visible during this time span, frames without a visible polyp were excluded. By this method, FDT included only frames with a visible polyp. The mean withdrawal time was determined using the recorded videos and defined as the time frame between the coecum and anal canal, excluding time spent for performing biopsies or snare resection [16].
Statistical analysis was performed using Python version 3.8. The χ 2 and Fisher's exact tests were used to test for significant differences between categorical variables. Student's t test and Mann-Whitney U test were applied for continuous variables depending on their distribution pattern. A p value of <0.05 indicated statistical significance.

Baseline Characteristics
From 244 routine colonoscopies, 133 colonoscopies met the exclusion criteria. Thus, a total of 111 pairs of colonoscopy videos including the raw video signal and the CADe signal were analyzed (Fig. 1)

Primary Endpoint
Rate of FP Detections and Distracting Detections A total of 11,188 FP activations were detected in the 111 coloscopies (101 ± 88 FPs per colonoscopy). The mean duration of a FP activation was 135 ms. In relation to the withdrawal time, the FPs account for a mean of 2.48% resembling 13.61 s. Most of the FP detections consisted of one to two frames, corresponding to a period of max. 66 ms (Fig. 2). Only a minority of detections accounted for continuous detections consisting of 10 frames or more, resembling more than 330 ms. In the subgroup of colonoscopies with at least one polyp, we examined the frames with a FP CADe detection in an image with a visible polyp, termed distracting detection. Here we found that 1.6 ± 2.1% of the frames with polyps contain this distraction.

Secondary Endpoints
PPC and PPS The CADe system detected all 118 polyps that were visible in the videos, resulting in a PPS of 100%. The mean number of PPC was 1.06.

Per-Frame Sensitivity and Per-Lesion Sensitivity
The mean per-frame sensitivity of the CADe system for all 118 polyps was 47.73 ± 26.5% ( Table 2). The mean per-lesion sensitivity of the CADe system was 47.46%. In   (17) SD, standard deviation. FDT of a Polyp FDT was available for each of the 118 polyps. The mean FDT was 1,692 ± 2,052 ms with a wide range from 33.3 to 12,033 ms. In a subgroup analysis, we found the highest FDT in the polyp size group 11-20 mm with mean 2,179 ± 3,174 ms (Table 3). However, this was not significant when compared to size groups 1-5 mm and 6-10 mm. In contrast, we found a significantly higher FDT in Paris 0-IIa polyps in comparison to 0-Ip or 0-Is polyps (2,068 ± 2,413 ms vs. 522 ± 216 ms or 1,233 ± 1,247 ms, p = 0.023 and p = 0.046).

Impact of Different CADe Activation Thresholds on FPs and CADe Performance Parameters
To estimate the effect of withholding CADe activations of a defined frame length on the FPs and the CADe performance, a subgroup analysis was performed using only activations of a defined frame length or longer. Figure 3 shows graphically how withholding short activations of 1-10 frames significantly reduces the rate of FPs while having little effect on the per-frame sensitivity of the CADe system. For example, withholding activations up to a length of 10 frames representing 330 ms reduced FP activations by 92.79% (p < 0.001), while the per-frame sensitivity decreased by only 6.07% (p = 0.07). In addition, we examined whether withholding short activations influenced PPC or PPS. Up to a threshold of 3 frames (100 ms), no polyps were missed. In contrast, a threshold of 10 frames representing 330 ms resulted in 7 missed polyps. In this case, all missed polyps were of flat shape (Paris 0-IIa) and had previously low per-frame sensitivity values of <28%. PPC was not significantly affected by withholding CADe activations up to a threshold of 10 frames (p = 0.71), whereas initial significant changes in PPS occurred at a threshold of 7 frames (p = 0.02). Detailed information   Table 4. In addition, online supplementary Video 1 (for all online suppl. material, see www.karger.com/doi/10.1159/000525345) shows an example of how a threshold of 6 frames (no significant changes in PPC, PPS, or per-frame sensitivity) affects FP activations in the endoscopic view.

Discussion
The development of an AI system for polyp detection using deep learning techniques applied on a larger dataset was first described by Wang et al. [17]. Subsequently, several commercially available CADe systems have been developed for colonoscopy. In prospective RCTs, CADe systems showed a significantly higher ADR compared to expert colonoscopists [1-4, 7, 18-21]. Moreover, a recently published meta-analysis found a significant increase of ADR [6]. While prospective studies have extensively evaluated the ADR of various CADe systems, little is known about the detailed performance of CADe systems, e.g., FP rate, FP duration, or per-frame sensitivity, especially in a real-life scenario. Only a few studies about CADe systems include a single-frame analysis. However, these studies used single polyp frames, short video sequences, or videos consisting of less than 160,000 frames [14,15,17]. Thus, we present the largest frame-by-frame dataset, to our knowledge, with 111 full-length videos consisting of over 170,000 polyp frames and a total of over 1,700,000 frames. Additionally, to the best of our knowledge, our study is the first evaluating CADe performance in a frame-by-frame analysis in real-life videos.
The PPS of 100% highlights the effectiveness of CADe systems in clinical practice; however, the number of FP activations is not negligible and is higher than the previously published values [10,11]. While previous studies analyzed the cause and clinical relevance of FPs, the use of frame-by-frame analysis allowed us to determine the exact duration and distribution pattern of FPs. As shown, most FPs were shorter than 330 ms, hence they are perceived by the endoscopist only as a brief flashing of the bounding box. However, it is not yet clear whether the short activations do or do not affect the normal mucosa visualization pattern of endoscopists. Some retrospective studies suggest that FP activations result in the negligible increase of the total withdrawal time, as most of them are immediately discarded by the endoscopists [10,11]. Other studies using for example eye-tracking glasses suggest that CADe and FPs activations might have an impact on the visualization pattern of the endoscopists [8,22]. Therefore, further studies using eye tracking technology during endoscopic examinations in a prospective manner should be performed in order to analyze the influence of short FP activations on the examiner and the withdrawal time. Nevertheless, many short FPs may impair the endoscopist's concentration in the long run; certainly, they reduce the comfort of the CADe application. An option to reduce the FP rate, especially for short FP, could be withholding of short CADe activations. As shown, withholding short detections up to 10 frames length reduced the number of FP by up to 92.79% without having a significant effect on per-frame sensitivity. However, above a threshold of 3 frames representing 100 ms, this is at the expense of a few missed polyps, especially those with a flat shape. Another effect to consider should be the impact that the withholding of short CADe activations could have on FDT. Unfortunately, there are no studies that demonstrate the effect of different FDTs on the detection of polyps. However, considering the big effect in the reduction of FP activations and since there was no significant change in PPC or PPS up to a threshold of 6 frames (200 ms), an appropriate threshold for optimization of the CADe system could be in this range.
Besides PPC, PPS, and FP rate, per-frame sensitivity is another important performance parameter of CADe systems, particularly since the temporal stability of polyp detection indicates how well CADe detection works for different polyp types. The per-frame sensitivity determined in our study is lower than in previous publications [14,15,17]. However, in previous studies, only several single images of polyps or selected video sequences were used to evaluate the self-developed systems. For example, the study by Misawa et al. [15] analyzed video clips with a total of 152,560 frames. In our study, full-length real-life videos containing 1,793,371 frames were used, so the conditions for CADe detection may have been more challenging, yet more realistic. Another important reason is that small hyperplastic polyps in the rectosigmoid were excluded in our study due to clinical irrelevance, whereas these polyps, which can often be reliably identified, were included in the evaluation in previous studies.
Since flat polyps (Paris 0-IIa) and sessile serrated adenomas have higher miss rates, the effect of CADe systems on polyp detection could be substantial if these lesions were reliably detected [23]. However, our data show that in clinical practice, per-frame sensitivity and FDT tend to be worse in these polyps. These findings are consistent with previous data, reporting lower per-frame sensitivity for laterally spreading tumors and sessile serrated adenoma, showing that there is an urgent need for improvement in this point [14].
There are several limitations to our study. Since this is a retrospective analysis of previously stored videos, histologic differentiation of colonic polyps was not possible. In order to increase the relevance of the detected polyps, we excluded hyperplastic polyps in the rectosigmoid. Due to the exclusion of examinations with a BBPS score of <6 points, the mean BBPS score is 7.5 points, which is relatively high [24]. However, recently published papers on CADe performance metrics describe similarly high BBPS values [10,11]. To shorten the time-consuming deep frame analysis, we have dispensed with a detailed analysis of the FPs with respect to their cause. However, Hassan et al. [10] performed such an analysis using the same CADe system -they found bubbles, stool, and colonic folds to be the main reasons for FP activation. We also did not manually annotate each polyp-containing frame with bounding boxes. Thus, subsequent analysis of, for example, intersection over the union of the CADe boxes with ground truth was not performed.

Conclusion
This commercially available CADe system is a powerful tool to facilitate polyp detection even under daily clinical conditions, but at the expense of many FP activations. Through a frame-by-frame video analysis, we were able to show that many of these FPs are of very short duration. Withholding short-term CADe detections could substantially reduce the number of FP activations, but at higher thresholds at the expense of a few missed polyps. This applies in particular to flat polyps, which generally have poorer per-frame sensitivity values. Since we could not detect any significant change in the mean number of PPC and PPS up to a threshold of 6 frames, an appropriate threshold for optimization of the CADe system could be in this range. Nevertheless, further detailed analysis of CADe systems is needed to better understand the strengths and weaknesses of this promising technology and to further optimize the systems. A customizable CADe detection threshold that can be adjusted to the needs of the examiner would be useful in clinical practice.

Statement of Ethics
This study protocol involving retrospective analysis of data was reviewed and approved by the Ethics Committee of the University Hospital Würzburg, approval number 2021032901. According to the Ethics Committee of the University Hospital Würzburg, patients were not required to give informed consent for this retrospective analysis.