- 无标题文档
查看论文信息

论文题名(中文):

 基于肺结节特征建立肺癌风险预测模型——一项基于多中心前瞻性队列的研究    

姓名:

 吴峥    

论文语种:

 chi    

学位:

 硕士    

学位类型:

 学术学位    

学校:

 北京协和医学院    

院系:

 北京协和医学院肿瘤医院    

专业:

 公共卫生与预防医学-流行病与卫生统计学    

指导教师姓名:

 李霓    

论文完成日期:

 2023-05-01    

论文题名(外文):

 Prediction model of malignant probability of pulmonary nodules: a multicenter prospective cohort study    

关键词(中文):

 肺癌 肺结节 预测模型 筛查 早诊早治    

关键词(外文):

 lung cancer pulmonary nodules prediction model screening early diagnosis and early treatment    

论文文摘(中文):

背景:

我国肺癌疾病负担严重,发病及死亡率居所有恶性肿瘤前列,是亟待解决的重大公共卫生问题。肺癌早期检出生存率高,而晚期生存率接近于0,因此早诊早治至关重要。LDCT筛查能够在肺癌发病早期阶段检出肺癌结节,是降低肺癌疾病负担的有效手段。然而,LDCT存在筛查结果假阳性率高的缺点,大量患者仅为良性病变却需要承受进一步的侵入性检查并发症、辐射暴露及经济与心理负担,增加了不必要的医疗负担。国际指南推荐使用肺癌风险预测模型,降低假阳性率、精准筛选肺癌患者,但我国尚缺乏基于大人群多中心前瞻性队列的高质量证据开发、可以在实际筛查工作中推广应用的可靠模型。

目的:

为降低肺癌筛查过程中的高假阳性率,本研究计划依托大样本多中心前瞻性中国筛查队列数据,基于来自国家肺癌筛查项目、参与LDCT筛查且被检出可疑肺结节的参与者,建立基于肺结节的肺癌风险预测模型,用于精准筛选肺癌患者,降低筛查假阳性率,提高筛查效率。

方法:

本研究训练集样本来源于2013年-2018年期间被纳入国家肺癌筛查项目的参与者。通过风险评估问卷,对参与者进行风险评估;被评估为高风险的参与者将被邀请参与LDCT筛查。被评估为高风险、接受LDCT筛查,并被检出可疑结节的参与者将作为本研究的训练集。本研究验证集数据来源于一项基于医院的体检项目数据库——依托医院的医疗体检队列。该项目由在2017年期间于中国医学科学院肿瘤医院,以及2007年2月-2022年1月于山东省聊城第二医院接受LDCT筛查的参与者数据组成。任何被检出至少一个可疑肺结节的患者均被纳入此项研究。在风险评估阶段,基于标准化问卷收集参与者性别、年龄、癌症家族史、慢性呼吸系统疾病史等宏观流行病学信息与临床诊疗信息。在 LDCT筛查阶段,获取结节长径和短径、平均径、结节密度、结节边缘、结节位置、结节形状、结节是否胸膜牵连及结节钙化等肺结节影像学信息。

研究结局被定义为在筛查后6个月之内确诊肺癌。通过肿瘤登记系统、专业医师诊断、原始影像复阅及病理学结果等途径,获取并交叉验证结局信息。

开发基于肺结节的肺癌风险预测模型:选用每位参与者最具代表性的单个肺结节作为训练集。采用多因素logistic回归模型,基于患者的流行病学因素和影像学因素,估计肺癌发病风险。通过计算OR及95%CI,评价不同危险因素与肺癌发病之间的关联。通过AIC,实现模型变量的筛选。模型包括在所有可疑肺结节者中建立的模型,以及在吸烟与非吸烟者中分别建立的模型。得到的模型将在独立验证集中进行外部验证。针对吸烟者和不吸烟者,分别建立模型,为我国非吸烟者肺癌防控工作提供证据支持。

基于不同机器学习算法分别建立模型,探讨不同模型的危险因素选择策略,对比不同模型在我国人群中的预测效能,发掘可能适合我国人群的建模方法。

结果:

数据基线特征:截止2021年6月,在国家肺癌筛查项目队列中,共纳入了1016740名参与者。其中,75981名参与者接受了LDCT筛查,5165名参与者被发现可疑结节并被用于模型训练集。验证集共纳入1815名参与者。在训练集中,149 (2.9%)的参与者在入组后半年内被确诊为肺癌,假阳性率为97.1%。在验证集中,800 (44.1%) 的参与者在入组后半年内被确诊为肺癌,假阳性率为55.9%,在训练集中,参与者在基线筛查时的平均年龄为58.26 ± 7.66岁。在验证集中,参与者接受筛查时的平均年龄为52.21 ± 11.09 岁。

训练集中基线指标与肺癌发病的关联:组间对比的结果提示,更年长、消瘦的病人有着更高的恶性可能。直径更大、实性成分更低、钙化程度更低、边缘毛刺、位于肺上叶及有胸膜牵连的结节通常会与肺癌风险相联系。但没有发现非肺癌患者与肺癌患者之间性别、受教育程度、运动、吸烟与被动吸烟强度、肺癌家族史、慢性呼吸系统疾病史、肺气肿及结节形状等因素的统计学差异。

模型建立与验证:参与者的年龄和5个影像学参数(结节钙化、密度、平均径、边缘和胸膜牵连)被纳入最终模型。 模型的AUC在训练集中为0.868(95% CI: 0.839,0.894),在验证集中为0.751 (95% CI: 0.727,0.774,区分度良好;模型校准曲线提示和实际结果拟合良好。模型的灵敏度和特异度分别为70.5%和70.9%,预计可以降低68.8%的肺癌筛查假阳性率。基于吸烟者 (AUC = 0.732,95% CI: 0.686,0.778) 和非吸烟者 (AUC = 0.740,95% CI: 0.712,0.767) 的模型也取得了良好的区分度和校准度。基于吸烟者与非吸烟者的模型危险因素及效应相似。

基于不同算法的模型建立与对比:共基于决策树、随机森林、XGboost、SVM及KNN等机器学习算法模型;不同算法在危险因素选择层面存在差异;XGboost、KNN及SVM算法在内部验证中取得了显著优势,其AUC可达0.930-0.980。

结论:

本研究建立的模型区分度和校准度均良好,证实了肺结节影像学因素可以用于精准肺癌预测;本研究开发的模型有助于对可疑结节进行进一步分类和诊断,有助于提高筛查效率、降低假阳性率。

本研究首次基于我国大样本、多中心、前瞻性动态队列数据,建立了基于肺结节的肺癌风险预测模型;基于特定人群亚组,开发对应模型;并在独立数据集中进行了完善的外部验证。本研究的结果未来还将在更多肺癌筛查队列数据中进行外部验证,并利用不同算法进一步完善模型、提高模型预测性能,为我国肺癌防控提供理论方法和高质量证据。

论文文摘(外文):

Background:

A significant public health issue that requires immediate attention is lung cancer. Lung cancer has a significant disease burden in China, where it is the most common malignant tumor in terms of incidence and mortality. Early diagnosis and treatment are crucial because lung cancer has a high survival rate for early detection while having a survival rate for IV stage that is almost zero. Lung cancer disease burden can be decreased through the use of LDCT screening, which can identify lung cancer nodules in early stages. However, the high rate of false positive screening results associated with LDCT means that many patients only require over-diagnosis and over-treatment for benign nodules, adding needless medical burden. International guidelines recommend risk prediction models for lung cancer to lower the false positive rate. High-quality data based on a multicenter prospective cohort, and trustworthy models, however, are still lacking and cannot be used for practical screening in China.

Objective:

We intend to create a lung cancer risk prediction model based on pulmonary nodules in participants from the National Lung Cancer Screening Project and a sizable sample multicenter prospective Chinese screening cohort in order to decrease overdiagnosis and treatment in the process of lung cancer screening. Boost the screening process' effectiveness.

Methods:

The training set of this study was obtained from the participants who were enrolled in the National Lung Cancer Screening Program from 2013 to 2018. Risk assessment questionnaires were used to assess the risk of participants. Participants assessed as being at high risk would be invited to participate in LDCT screening. Participants who were assessed as being at high risk, underwent LDCT screening, and had suspicious nodules detected were used as the training set for this study. The validation dataset of this study was derived from a hospital-based physical examination project database. This project consisted of the data of participants who underwent LDCT screening in Cancer Hospital, Chinese Academy of Medical Sciences between 2017 and Liaocheng No.2 Hospital in Shandong Province between February 2007 and January 2022. The study included any patient who had at least one suspicious pulmonary nodule. Macro-epidemiological data and clinical data, including gender, age, family history of cancer, and history of chronic respiratory disease, were gathered during the risk assessment phase. The imaging data of pulmonary nodules, including their maximum and minimum diameter, average diameter, nodule density, nodule margin, location, and shape, as well as whether or not the nodule was causing pleural involvement, were obtained during the LDCT screening stage.

The study outcome was defined as a diagnosis of lung cancer 6 months after screening. Through the cancer registration system, expert physician diagnosis, original image review, and pathological findings, outcome data was gathered and cross-validated.

3. Development of lung cancer risk prediction model based on pulmonary nodules: the most representative single lung nodule from each participant was selected as the training set. Based on the patient's epidemiological and imaging factors, a multivariate logistic regression model was used to calculate the likelihood that the patient would develop lung cancer. In order to assess the relationship between various risk factors and the incidence of lung cancer, OR and 95%CI were calculated. By using AIC, the model variables were evaluated. All patients with suspected pulmonary nodules were included in a model, and separate models for smokers and nonsmokers were also constructed. In the separate validation set, models would undergo external validation. In order to provide evidence-based support for the prevention and control of lung cancer among non-smokers in China, we establish models for smokers and non-smokers. 

4. We developed models based on different machine learning algorithms, described how various models select risk factors, contrasted how well various models predict our population, and looked into modeling approaches that might be useful for us.

 

Results:

1. Baseline Characteristics of data: as of June 2021, a total of 1,016,740 participants were enrolled in our cohort. Of these, 75,981 participants underwent LDCT screening; 5165 participants were found to have suspicious nodules and were included in the model training set. A total of 1815 participants were included in the validation set. In the training set, 149 participants (2.9%) received a diagnosis of lung cancer within 6 months after enrollment, with a false positive rate of 97.1%. In the validation set, 800 participants (44.1%) had a diagnosis of lung cancer within 6 months after enrollment, with a false positive rate of 55.9%. In the training set, the mean age of the participants at baseline screening was 58.26 ± 7.66 years. In the validation set, the mean age at screening was 52.21 ± 11.09 years.

2. Lung cancer incidence in the training cohort and baseline characteristics were associated; the findings of the between-group comparison suggested that older, thinner patients had a higher risk of malignancy. Lung cancer risk was generally increased by nodules with larger diameters, less solid components, less calcification, spiculated margins, upper lobe locations, and pleural involvement. There were no appreciable differences between non-lung cancer patients and lung cancer patients in terms of gender, education level, physical activity, smoking and passive smoking quantity, family history of lung cancer, history of chronic respiratory disease, pulmonary edema, or nodule shape.

3. Model development and validation: participant's age and five radiographic parameters (nodule calcification, density, mean diameter, margin, and pleural involvement) were included in the final model. The AUC of the model was 0.868 (95%CI: 0.839, 0.894) in the training set and 0.751 (95%CI: 0.727, 0.774) in the validation set, showing good discrimination. The calibration curve of the model fitted well with the actual results. The sensitivity and specificity of the model were 70.5% and 70.9%, respectively. In the simulation results, the model could reduce the false positive rate of lung cancer screening by 68.8%. The model based on smokers (AUC = 0.732, 95%CI: 0.686, 0.778) and non-smokers (AUC = 0.740, 95%CI: 0.712, 0.767) also achieved good discrimination and calibration. Model risk factors and effects were similar based on smokers and non-smokers.

4. Model establishment and comparison based on different algorithms: machine learning algorithm models based on decision tree, random forest, XGboost, SVM and KNN were constructed. The selection of risk factors varies across algorithms. Internal verification has benefited greatly from the XGboost, KNN, and SVM algorithms, and their AUC can reach 0.930 to 0.980.

Conclusions:

It was confirmed that the imaging characteristics of pulmonary nodules could be used for precise lung cancer prediction when the discrimination and calibration of the model based on the aforementioned factors were good. Our study aided in the further classification and diagnosis of suspicious nodules, enhancing screening effectiveness and lowering the rate of false positives.

This study used data from a large sample, multi-center, and prospective dynamic cohort to create lung cancer risk prediction models based on pulmonary nodules, constructing corresponding models in accordance with specific population subgroups, and concluding external validation using various data. The findings of this study were externally validated in subsequent lung cancer screening cohort data in order to provide theoretical approaches and high-quality evidence for the prevention and control of lung cancer in China. In addition, various algorithms were applied to strengthen the model and boost model prediction efficiency.

开放日期:

 2023-05-26    

无标题文档

   京ICP备10218182号-8   京公网安备 11010502037788号