查看论文信息

免费浏览

附件下载

查看论文信息

论文题名(中文)：	构建AI模型预测肺腺癌多基因突变并联合质谱识别肿瘤组织学特征及EGFR状态的研究
姓名：	赵玲玉
论文语种：	chi
学位：	博士
学位类型：	专业学位
学校：	北京协和医学院
院系：	中日友好医院
专业：	临床医学-临床病理学
指导教师姓名：	钟定荣
校内导师组成员姓名(逗号分隔)：	钟定荣王蓓崔力方陈皇
论文完成日期：	2025-05-31
论文题名(外文)：	Constructing an AI model to predict multi-gene mutations in lung adenocarcinoma combined with PESI-MS for characterization of tumor histological features and EGFR status
关键词(中文)：	肺腺癌人工智能基因突变无进展生存期术中冰冻诊断探针电喷雾离子化质谱技术
关键词(外文)：	lung adenocarcinoma artificial intelligence gene mutation progression-free survival intraoperative frozen diagnosis PESI-MS
论文文摘（中文）：	︿第一部分基于组织病理学特征构建LUAD多基因突变预测AI模型及预后因素分析目的：本研究收集多量肺腺癌患者数据，构建数据库，应用人工智能技术建立模型，辅助病理医生及临床医生决策，为患者的个性化治疗提供科学依据，具体包括：1）整合2221例肺腺癌样本的数字化病理切片、分子病理信息、病理信息及临床信息，建立数据库，分层探讨不同的病理亚型、预后危险因素、基因突变的患者中病理及临床信息的异同；2）建立肿瘤区域识别人工智能模型；3）建立预测9种基因突变状态的人工智能模型，并使用三折交叉验证、与其他模型/病理医生对比以及外部数据测试泛化能力等方式进一步评估模型效果；4）探讨对选择分子检测的肺腺癌患者无进展生存期产生影响的独立预测因素。材料与方法：回顾性收集2015年9月-2023年4月就诊于中日友好医院1999名原发性肺腺癌患者的2221例肺腺癌样本的临床、病理资料、EGFR、KRAS、ALK、HER2、ROS1、RET、BRAF、PIK3CA、NRAS状态的9种分子病理资料以及数字化病理切片数据，使用SPSS及统计学方法分析患者数据。然后对297例样本切片肿瘤区域进行标注，建立识别肿瘤区域及危险因素的ResNet网络模型。使用2219例样本数据训练及测试自监督模型DINO提取图像特征，以及两阶段多实例模型GAMIL判断各基因突变状态。数据集以8：2比例划分训练集及测试集，将图像切分成256×256的图像块进行训练与测试。模型对比效果评价包括与不同的模型（UNI、CLAM、Inception V3）对比、外部测试数据集对比（中国医学科学院肿瘤医院256例患者数据以及TCGA数据库）、6名不同职称级别病理医生的能力对比，以及生成基因突变识别的高权重热图进行评价。最后使用Log-Rank法进行预后因素的单因素分析，Kaplan-Meier法绘制生存曲线，以及Cox比例风险回归法的多因素分析，以寻找无进展生存期的独立预测因素。结果：在2221例患者中，女性和60-69岁年龄段患者较多，男性发病略晚。不同肺腺癌亚型在年龄和性别上呈现差异，高危因素与肿瘤最大径、性别、主要亚型有关。基因分析显示不同基因突变与性别、年龄、肿瘤最大径及病理预后高危因素相关。使用该数据集及WSSS4LUAD数据集建立的ResNet模型对肿瘤区域识别的AUC值分别为0.995和0.992，其余未标注的切片上生成肿瘤区域热图显示预测区域噪声较小。提取特征基础模型DINO和分类模型GAMIL预测基因突变的AUC值为：0.825（EGFR）、0.911（KRAS）、0.987（ALK）、0.934（HER2）、0.900（包括ROS1、RET、BRAF、PIK3CA、NRAS在内的罕见基因组），敏感度为0.786-0.972，特异性为0.749-0.989，准确率为0.797-0.981。作为对比，使用UNI基础模型的GAMIL分类模型对EGFR的AUC值为0.799，而使用DINO基础模型和CLAM/Inception v3分类模型对EGFR的AUC值分别为0.679和0.665。外部测试数据集验证泛化能力中，中国医学科学院肿瘤医院的256例样本在GAMIL模型上的AUC值分别为：0.800（EGFR）、0.843（ALK）。TCGA数据库的AUC值为0.508-0.716。在病理医生与GAMIL模型进行的EGFR基因突变预测比较中，高级职称、中级职称、初级职称病理医生以及GAMIL模型的AUC值分别为0.510、0.500、0.515、0.810。生成的热图显示的高权重区域显示，EGFR，KRAS，ALK基因分别与附壁型、浸润型黏液腺癌与实体型、实体型及粘液细胞相关。在生存状态的单因素分析中，共9个因素对生存时间存在影响，而使用Cox比例风险回归模型进行的多因素分析中，仅有病理T分期（P=0.013）是无进展生存期的独立预测因子。结论：ResNet网络模型能够准确识别肿瘤区域及预后高危因素。DINO+GAMIL模型针对9种不同基因突变，均呈现出较高的AUC值。三倍交叉验证显示模型性能稳定，各类其他模型、病理医生比较实验中模型也展现出更好的效果，外部验证数据集进一步证实了模型的泛化能力。此外，基于多因素分析结果，病理T分期是无进展生存期的独立预测因子，肿瘤主要亚型、KRAS基因突变状态以及是否行靶向治疗（可能选择性偏倚）是T1期患者无进展生存期的独立预测因子。第二部分联合质谱与AI技术对术中肺腺癌组织学特征与EGFR状态的研究目的：本研究联合探针电喷雾离子化质谱技术与人工智能算法，使用肿瘤组织中代谢物变化数据，训练人工智能模型进行数据分析，实现对新鲜肿瘤组织中的影响肺腺癌术式的术中冰冻指标信息及术后靶向治疗指标的深度挖掘，具体包括：高危亚型（微乳头/实体型）的识别、肿瘤气腔播散检测及EGFR基因突变状态判读，该研究能够突破传统肺腺癌术中冰冻病理的技术瓶颈，为术中快速决策手术方案、术后制定个体化靶向治疗策略提供支撑。材料与方法：收集2023年7月至2024年10月在中日友好医院就诊并经病理确诊肺腺癌的131例患者，经质控筛选后入组99例患者，按7:3或8:2比例划分训练集与测试集。每名患者收取1灶术中冰冻新鲜肺腺癌的肿瘤及其对应癌旁组织（距肿瘤≥5cm）。对样本进行简单前处理：肿瘤及癌旁组织分别称取10mg，加入乙醇-水混合液进行研磨、离心及稀释，获得待测液。使用日本岛津DPiMS-2020快速原位离子化质谱检测仪进行代谢物分析。数据预处理通过Labsolutions软件完成，选取0.0-0.3min色谱峰积分值，筛选数据特征后，采用偏最小二乘回归降维，MetaboAnalyst® 软件VIP值法筛选特征。使用支持向量机、随机森林、多层感知器和梯度提升分类器算法，构建4种人工智能模型：区分肿瘤与癌旁组织、肺腺癌肿瘤组织的高危亚型（微乳头/实体型）的识别、气腔播散的判定、肿瘤组织EGFR基因突变状态预测。其中肿瘤气腔播散的判定模型采取非平衡数据集采样方法优化。最后通过准确率、精确度、召回率及F1-Score综合评价模型性能。结果：99例肺腺癌患者的肿瘤和/或癌旁组织代谢物数据中提取出10000+特征，筛选出77-461个高表达特征后，降维生成20个关键主成分，训练并测试4种人工智能模型。模型在区分肿瘤与癌旁组织的准确率达95%；判断肺腺癌高危亚型（微乳头/实体型）准确率为90%；肿瘤气腔播散检测准确率为100%；EGFR基因突变状态预测模型经优化后AUC值高达100%。结论：通过整合探针电喷雾离子化质谱技术与多种人工智能算法，能够在数分钟内快速获取新鲜肺腺癌与癌旁正常组织样本中代谢物数据，区分肺腺癌与癌旁正常组织，判断3种影响肺腺癌诊疗的术中冰冻指标及术后靶向治疗指标，模型准确率高，且能够突破传统形态学诊断局限，为肺腺癌临床术式的选择以及术后靶向治疗做出进一步指导。﹀
论文文摘（外文）：	︿ Part 1. Based on the histopathological features, construction of a multi-gene mutation prediction AI model for lung adenocarcinoma and analysis of prognostic factors Objective: This study collected a substantial amount of data from patients with lung adenocarcinoma to construct a comprehensive database. We applied artificial intelligence techniques to develop models that assist pathologists and clinicians in decision-making, providing a scientific basis for personalized treatment of patients. Specifically, the study includes: 1) the integration of digital pathology slides, molecular pathology information, pathological data, and clinical information from 2,221 lung adenocarcinoma samples to establish a database, which facilitates a stratified analysis of the similarities and differences in pathological and clinical information among patients with various pathological subtypes, prognostic risk factors, and gene mutations; 2) the development of an artificial intelligence model for tumor region identification; 3) the establishment of an artificial intelligence model to predict the status of nine gene mutations, further evaluating model performance through three-fold cross-validation, comparisons with other models/pathologists, and testing generalization capabilities using external data; 4) the exploration of independent predictive factors affecting the progression-free survival of lung adenocarcinoma patients selected for molecular testing. Materials and Methods: Retrospectively collecting clinical and pathological data, as well as molecular pathology information on EGFR, KRAS, ALK, HER2, ROS1, RET, BRAF, PIK3CA, and NRAS status from 2,221 lung adenocarcinoma samples of 1,999 primary lung adenocarcinoma patients treated at the China-Japan Friendship Hospital from September 2015 to April 2023, alongside digital pathology slide data, the patient data were analyzed using SPSS and statistical methods. Subsequently, 297 tumor region samples were annotated, and a ResNet network was trained and tested for tumor region identification and risk factor analysis. A total of 2,119 sample data were used to train and test the self-supervised model DINO for image feature extraction and the two-stage multi-instance model GAMIL for determining the mutation status of each gene. The datasets for all models were divided into training and testing sets in an 8:2 ratio, with images segmented into 256×256 image blocks for training and testing. Comparative performance evaluations included comparisons with different models (UNI, CLAM, Inception V3), external testing dataset comparisons (256 patients from the Cancer Hospital Chinese Academy of Medical Sciences and TCGA database), effectiveness comparisons between 6 pathologists of varying seniority levels and model, and evaluation based on generating high-weight heatmaps for gene mutation identification. Finally, the Log-Rank test was used for univariate analysis of prognostic factors, Kaplan-Meier curves were plotted for survival analysis, and a Cox proportional hazards regression model was employed for multifactor analysis to identify independent prognostic factors of progression-free survival. Results: Among the 2,221 patients, there was a higher proportion of females and patients aged 60-69 years, with males experiencing onset slightly later. Variations in age and gender were observed among different subtypes of lung adenocarcinoma, and high-risk factors were associated with tumor size, gender, and major subtypes. Genetic analysis revealed correlations between different gene mutations and gender, age, tumor size, and high-risk pathological prognostic factors. The ResNet model established using this dataset and the WSSS4LUAD dataset exhibited AUC values of 0.995 and 0.992, respectively, for tumor region identification, with tumor region heatmaps generated on unannotated slices showing minimal noise in predicted areas. The DINO feature extraction model and the GAMIL classification model predicted gene mutations with AUC values of 0.825 (EGFR), 0.911 (KRAS), 0.987 (ALK), 0.882 (HER2), and 0.900 (for rare gene sets including ROS1, RET, BRAF, PIK3CA, and NRAS), with sensitivities ranging from 0.786 to 0.972, specificities from 0.749 to 0.989, and accuracies from 0.797 to 0.981. As a comparison, the GAMIL classification model using the UNI base model achieved an AUC value of 0.799 for EGFR, while using the DINO base model and the CLAM/Inception v3 classification models for EGFR resulted in AUC values of 0.679 and 0.665, respectively. External testing dataset validations for generalization capability yielded AUC values of 0.800 (EGFR) and 0.843 (ALK) for 256 samples from the Cancer Hospital Chinese Academy of Medical Sciences, and AUC values ranging from 0.508 to 0.716 for the TCGA database. In the comparison of EGFR gene mutation prediction between pathologists and the GAMIL model, the AUC values for senior, mid-level, junior pathologists, and the GAMIL model were 0.510, 0.500, 0.515, and 0.810, respectively. High-weight regions in the generated heatmaps indicated associations of EGFR, KRAS, and ALK genes with lepidic, invasive mucinous adenocarcinoma, and acinar with solid and acinar with mucinous subtypes. Univariate analysis of survival status revealed the influence of nine factors on survival time, whereas in the multifactor analysis using the Cox proportional hazards regression model, only pathological T stage (P=0.013) emerged as an independent predictor of progression-free survival. Conclusion: The ResNet network model accurately identifies tumor regions and high-risk prognostic factors regions, while the DINO+GAMIL model demonstrates high AUC values for different gene mutations. Three-fold cross-validation reveals stable model performance, with the model outperforming various other models in comparative experiments with pathologists. External validation datasets further confirm the model's generalizability. Additionally, based on the results of multivariate analysis, pathological T stage emerges as an independent predictor of progression-free survival. Furthermore, tumor predominant subtype, KRAS gene mutation status, and the administration of targeted therapy (potentially subject to selection bias) are identified as independent predictors of progression-free survival in T1 stage patients. Part 2. Integrating Mass Spectrometry and AI for intraoperative analysis of histological features and EGFR status in lung adenocarcinoma Objective: This study combines probe electrospray ionization mass spectrometry with artificial intelligence algorithms to analyze metabolite variation data from tumor tissues. It trains artificial intelligence models to conduct in-depth analysis of information related to intraoperative frozen section indicators and postoperative targeted therapy indicators that influence surgical approaches for lung adenocarcinoma in fresh tumor tissues. Specifically, this includes the identification of high-risk subtypes (micropapillary/solid types), detection of tumor spread through air spaces, and interpretation of EGFR gene mutation status. This research overcomes the technical bottlenecks of traditional intraoperative frozen pathology for lung adenocarcinoma, providing support for rapid decision-making in surgical planning and the formulation of personalized targeted therapy strategies post-surgery. Materials and Methods: A total of 131 patients diagnosed with lung adenocarcinoma at the China-Japan Friendship Hospital between July 2023 and October 2024 were collected. Following quality control screening, 99 patients were included in the study and divided into training and testing sets in a 7:3 or 8:2 ratio. For each patient, 10 mg of tumor tissue from intraoperative frozen specimens and corresponding adjacent non-tumor tissue (≥5 cm from the tumor) were collected. The samples underwent a simple preprocessing protocol: tumor and adjacent tissues were separately ground, centrifuged, and diluted using an ethanol-water mixture to obtain the solution for analysis. Metabolite analysis was conducted using the Shimadzu DPiMS-2020 rapid in situ ionization mass spectrometer. Data preprocessing was performed using LabSolutions software, where chromatographic peak integration values from 0.0 to 0.3 minutes were selected. After filtering data features, partial least squares regression was applied for dimensionality reduction, and the VIP method in MetaboAnalyst® software was utilized for feature selection. Four artificial intelligence models were constructed using support vector machines, random forests, multilayer perceptrons, and gradient boosting classification algorithms to: distinguish between tumor and adjacent tissues, identify high-risk subtypes (micropapillary/solid types) of lung adenocarcinoma, determine tumor air cavity dissemination, and predict the EGFR gene mutation status in tumor tissues. The model for determining tumor air cavity dissemination was optimized using an unbalanced dataset sampling method. Finally, the performance of the models was comprehensively evaluated using accuracy, precision, recall, and F1-Score. Results: In this study, over 10,000 features were extracted from metabolite data of tumor and/or adjacent tissues of 99 lung adenocarcinoma patients. After screening 77-461 highly expressed features, 20 key principal components were generated through dimensionality reduction. Subsequently, four artificial intelligence models were trained and tested. The models achieved an accuracy of 95% in distinguishing between tumor and adjacent tissues, 90% accuracy in identifying high-risk subtypes of lung adenocarcinoma (micropapillary/solid types), 100% accuracy in detecting tumor air cavity dissemination, and after optimization, the EGFR gene mutation status prediction model achieved a remarkable AUC of 1. Conclusion: This study integrates probe electrospray ionization mass spectrometry technology with various artificial intelligence algorithms to rapidly acquire metabolite data from fresh lung adenocarcinoma and adjacent normal tissue samples within minutes. It enables the differentiation between lung adenocarcinoma and adjacent normal tissues, as well as the assessment of three intraoperative frozen section indicators and postoperative targeted therapy indicators that impact lung adenocarcinoma diagnosis and treatment. The model exhibits high accuracy, surpasses the limitations of traditional morphological diagnosis, and provides further guidance for the selection of clinical procedures and postoperative targeted therapy in lung adenocarcinoma. ﹀
开放日期：	2025-06-05