| 论文题名(中文): | 深度学习在前列腺癌数字病理诊断的临床应用研究 |
| 姓名: | |
| 论文语种: | chi |
| 学位: | 博士 |
| 学位类型: | 专业学位 |
| 学校: | 北京协和医学院 |
| 院系: | |
| 专业: | |
| 指导教师姓名: | |
| 校内导师组成员姓名(逗号分隔): | |
| 论文完成日期: | 2025-04-30 |
| 论文题名(外文): | Research on the Clinical Application of Deep Learning in Digital Pathological Diagnosis of Prostate Cancer |
| 关键词(中文): | |
| 关键词(外文): | Prostate cancer Artificial intelligence Digital pathology Diagnosis Clinical application |
| 论文文摘(中文): |
研究背景 近年来,人工智能技术的快速发展,特别是深度学习的应用,为前列腺癌的早期诊断和精准治疗带来了前所未有的机遇。目前,深度学习在前列腺癌数字病理诊断中的研究多集中于前列腺穿刺活检切片。但前列腺根治性切除术(radical prostatectomy, RP)后病理切片的临床诊断场景与前列腺穿刺活检切片有所不同。首先,两者基于病人水平的切片总量和每张切片的组织量差异较大。另外,一些接受前列腺根治性切除术的患者在手术前可能接受新辅助治疗,新辅助治疗会导致切片的肿瘤组织发生病理变化,从而增加了病理诊断的复杂性。还有的是,前列腺穿刺活检切片和前列腺根治切除术后切片的诊断任务有所不同。前列腺穿刺活检的诊断任务往往集中于如何高效确诊前列腺癌及评估肿瘤组织的长度。而在手术前已经确诊前列腺癌的前提下,前列腺根治性切除术后病理切片的诊断则更侧重于对病人主要肿瘤的诊断和量化。基于这样的区别,探索适用于前列腺根治性切除术后的病理诊断模型和验证其实际临床应用有重要的现实意义。目前深度学习在前列腺根治性切除术后病理切片中的应用仍处于临床前探索阶段,其相关研究缺乏实际临床环境下的验证与应用。基于图像级标签的弱监督学习技术,能够显著减少人工标注的时间与成本,非常适合用于大规模数据集的模型构建与推广应用。因此,将弱监督学习技术引入前列腺根治性切除术后病理切片的分析领域,并探索其潜在的应用前景和技术瓶颈,已成为亟待解决的重要课题。此外,Gleason分级分组(Gleason Grade Group, GG)是前列腺癌病理诊断中的核心环节,但由于不同病理科医生在诊断中的主观性较强,结果往往存在较大差异,这对利用人工智能技术改进诊断一致性提出了迫切需求。根据模型训练标签类型的不同,深度学习方法通常分为弱监督学习、半监督学习和全监督学习等范式。在平衡标注成本与提升前列腺癌Gleason分级分组预测性能之间,寻找高效的深度学习解决方案,是当前人工智能在前列腺癌数字病理诊断领域中的重要研究方向之一。
研究目的 评估所构建的弱监督学习模型在前列腺根治性切除术后病理诊断任务中的性能水平,并验证其在实际临床应用中的稳健性和适用性。 比较弱监督学习、不同标注比例的半监督学习以及全监督学习模型下对前列腺癌病理切片数据集上的性能表现,探讨不同监督类型的深度学习方法在前列腺癌Gleason分级分组的诊断效能水平。
研究方法 纳入了国内四家医院一共304名患者的13,245张前列腺根治性切除术后病理切片作为实验对象。基于多示例学习(multiple instance learning, MIL)网络构建弱监督学习模型多尺度傅里叶-Transformer MIL(Multiscale Fourier-Transformer MIL, MFMIL),验证该模型对以病人为单位的前列腺根治性切除术后病理切片的实际诊断任务的性能水平,包括基于切片级诊断、癌灶级诊断和可视化辅助下肿瘤定位的预测准确性。切片级诊断包括切片的良恶性二分类和Gleason分级分组。癌灶级诊断包括对主要肿瘤、次要肿瘤和有临床意义肿瘤的诊断。 以前列腺穿刺活检切片公开数据集(Prostate cANcer graDe Assessment (PANDA) Challenge dataset, PANDA挑战数据集)和多中心(multicenter, MC)前列腺根治性切除术(RP)后切片数据集(MCRP数据集)为实验对象,比较不同深度学习模型(AB-MIL、CLAM、DS-MIL、TransMIL 和 MFMIL)在弱监督、不同标注比例的半监督以及全监督学习条件下,对两种前列腺癌数据集的Gleason分级分组的诊断效能水平。采用二次加权Kappa(Quadratic Weighted Kappa, QWK)系数作为主要指标,衡量模型预测结果的准确性。
研究结果 在切片的良恶性二分类诊断中,MFMIL模型对两个内部验证集和两个外部验证集的曲线下面积(area under the curve, AUC)分别为0.921(95% CI: 0.912-0.928)、0.893(95% CI: 0.886-0.900)、0.910(95% CI: 0.902-0.919)和0.866(95% CI: 0.857-0.872)。在Gleason分级分组诊断中,模型的QWK系数在内部验证集和外部验证集分别为0.743和0.725。四个验证集中,主要肿瘤的中位面积为92.6 mm²(IQR: 54.0-131.8 mm²),而非主要肿瘤的中位面积为3.5 mm²(IQR: 1.4-11.0 mm²)。模型对患者主要肿瘤的诊断准确性为99%(100/101),对主要肿瘤最大癌灶面积所在切片的诊断准确性为96%(97/101)。对于有临床意义的次要肿瘤,诊断准确性为74.6%(94/126)。在随机选取的200张模型预测正确的阳性切片中,经过可视化处理后,模型可准确关注到97.5%(195/200)的主要肿瘤及89.5%(77/86)的次要肿瘤。 从弱监督学习到不同标注比例(10%、20%、30%、40%、50%)的半监督学习,再到全监督学习,同一模型在Gleason分级分组诊断中的QWK系数呈逐渐增加趋势。对于PANDA挑战数据集,MFMIL模型在50%标注比例的半监督学习中的QWK系数为0.882,全监督学习的QWK系数为0.914。对于MCRP 数据集,MFMIL模型在50%标注比例的半监督学习中的QWK系数为0.924,全监督学习的QWK系数为0.936。对于GG 1至GG 5的具体分级诊断中,模型对GG 1和GG 5的诊断准确性较高,而对GG 3的诊断准确性较低。
结论 弱监督学习模型在前列腺根治性切除术后病理切片的良恶性诊断中表现出较高的准确性,但在Gleason分级分组诊断中的表现仍有提升空间。在以患者为单位的诊断任务中,模型对主要肿瘤的诊断与定位方面具有较高的准确性。 从弱监督学习、逐渐递增标注比例的半监督学习到全监督学习,模型对前列腺癌Gleason分级分组诊断的准确性逐渐增加。使用50%标注的半监督学习模型的诊断准确性接近全监督学习模型。然而,目前不同监督类型的深度学习方法在Gleason分级分组的具体诊断上有一定的局限性。 |
| 论文文摘(外文): |
Background In recent years, the rapid advancement of artificial intelligence, particularly deep learning, has created unprecedented opportunities for the early diagnosis and precise treatment of prostate cancer. While current research predominantly focuses on deep learning applications for prostate biopsy slides, the clinical diagnostic workflow for radical prostatectomy (RP) specimens presents distinct challenges and considerations. First, there are substantial differences in slide volume and tissue quantity between biopsy and RP specimens. A single RP specimen typically yields a significantly larger number of slides with greater tissue coverage per slide compared to biopsy samples. Additionally, patients undergoing RP may have received neoadjuvant therapy prior to surgery, which can introduce histopathological alterations in tumor tissue, further complicating diagnostic interpretation. Moreover, the diagnostic objectives differ between biopsy and post-RP pathology. Prostate biopsy analysis primarily aims at efficient cancer detection and tumor extent assessment. In contrast, post-surgical pathology shifts focus to characterizing the index lesion—quantifying tumor burden and grading, given the confirmed presence of malignancy. Given these critical distinctions, there is an urgent need to develop and validate a dedicated deep learning model for pathological diagnosis of RP specimens—one that addresses the unique challenges of post-surgical pathology while demonstrating tangible clinical utility. Despite its potential, the application of deep learning to RP-derived pathological slides remains largely confined to preclinical exploration, with limited validation or real-world clinical implementation. Weakly supervised learning, based on image-level labels, offers a significant advantage by reducing the time and cost associated with manual annotation, making it particularly suitable for developing and deploying models on large-scale datasets. As such, leveraging weakly supervised learning techniques to analyze pathological slides obtained from radical prostatectomy specimens and investigating its potential applications and technical challenges has become a pressing issue in the field. Moreover, The determination of Gleason grade groups (GGs) for prostate cancer are a critical component of pathological diagnosis. However, due to the inherent subjectivity among pathologists, there are often significant discrepancies in diagnostic results, underscoring the urgent need for artificial intelligence to improve diagnostic consistency. Deep learning methods are generally categorized into weakly supervised, semi-supervised, and fully supervised learning, depending on the type of training labels used. Identifying effective deep learning approaches that balance annotation costs with the need to enhance the accuracy of GGs predictions represents a key research direction in the application of artificial intelligence to digital pathology.
Objectives To evaluate the performance of the established weakly supervised learning model for the pathological diagnosis of slices from radical prostatectomy specimens and to validate its robustness and applicability in real-world clinical practice. To compare the performance of weakly supervised, semi-supervised across various annotation ratios and fully supervised deep learning models on different prostate cancer datasets, and investigate the accuracy of these deep learning approaches in the GGs diagnosis of prostate cancer.
Materials and Methods A total of 13,245 pathological slices from radical prostatectomy specimens of 304 patients across four hospitals in China were selected as experimental data. Leveraging the multiple instance learning (MIL) framework, a weakly supervised learning model (Multiscale Fourier-Transformer MIL (MFMIL)) was developed to assess its accuracy in predicting pathology outcomes. The evaluation includes slice-level diagnosis, lesion-level diagnosis, and tumor localization with visualization support. Slice-level diagnosis involves identifying prostate cancer and GGs, while lesion-level diagnosis focuses on detecting index lesions, non-index lesions, and clinically significant lesions. The Prostate cANcer graDe Assessment (PANDA) Challenge dataset and the multicenter slices from radical prostatectomy (MCRP) dataset were used as experimental data. This study compares the accuracy of various deep learning models (AB-MIL, CLAM, DS-MIL, TransMIL, and MFMIL) under weakly supervised, semi-supervised across various annotation ratios and fully supervised learning conditions, for GGs diagnosis across these datasets. The Quadratic Weighted Kappa (QWK) coefficient was utilized as the primary metric to evaluate the predictive accuracy of the models.
Results In the diagnosis of prostate cancer from pathological slices, the MFMIL model achieved area under the curve (AUC) values of 0.921 (95% CI: 0.912–0.928), 0.893 (95% CI: 0.886–0.900), 0.910 (95% CI: 0.902–0.919), and 0.866 (95% CI: 0.857–0.872) across two internal validation sets and two external validation sets, respectively. For GGs of prostate cancer, the model attained QWK coefficients of 0.743 and 0.725 on the internal and external validation sets, respectively. Across the four validation sets, the median area of index lesions was 92.6 mm² (IQR: 54.0–131.8 mm²), while the median area of non-index lesions was 3.5 mm² (IQR: 1.4–11.0 mm²). The model demonstrated a diagnostic accuracy of 99% (100/101) for index lesions and 96% (97/101) for slices with the largest cancer area of the index lesion. For clinically significant non-index lesions, the prediction accuracy was 74.6% (94/126). Visualization analysis revealed that the model accurately identified 97.5% (195/200) of index lesions and 89.5% (77/86) of non-index lesions in 200 randomly selected positive slices predicted by the model. Transitioning from weakly supervised learning to semi-supervised learning with various annotation ratios (10%, 20%, 30%, 40% and 50%), and eventually to fully supervised learning, the QWK coefficient of the same model for GGs diagnosis demonstrates a steadily increasing trend. On the PANDA Challenge dataset, the MFMIL model achieved a QWK coefficient of 0.882 in semi-supervised learning with 50% annotation ratio, closely approximating the QWK coefficient of 0.914 in fully supervised learning. Similarly, on the MCRP dataset, the model attained a QWK coefficient of 0.924 in semi-supervised learning with 50% annotation ratio, comparable to the 0.936 achieved in fully supervised learning. For the specific GGs diagnosis task, the model exhibits higher diagnostic accuracy for GG 1 and GG 5, while its accuracy for GG 3 is relatively lower.
Conclusions The weakly supervised learning model has demonstrated high accuracy in distinguishing prostate cancer on slices from radical prostatectomy specimens. However, its performance in GGs diagnosis remains suboptimal and warrants further improvement. For patient-based diagnostic tasks, the model achieves high accuracy in identifying and localizing index lesions. The accuracy of the model for GGs diagnosis gradually increases from from weakly supervised learning to semi-supervised learning with various annotation ratios, and eventually to fully supervised learning. The diagnostic accuracy of the semi-supervised learning model with 50% annotation ratio is close to that of the fully supervised learning model. Nevertheless, current deep learning methodologies across different supervision paradigms exhibit limitations in precisely diagnosing of GGs. |
| 开放日期: | 2025-05-30 |