论文题名(中文): | Stacking算法在2型糖尿病前瞻性风险评估中的应用 |
姓名: | |
论文语种: | chi |
学位: | 硕士 |
学位类型: | 学术学位 |
学校: | 北京协和医学院 |
院系: | |
专业: | |
指导教师姓名: | |
校内导师组成员姓名(逗号分隔): | |
论文完成日期: | 2023-05-17 |
论文题名(外文): | Application of the Stacking algorithm in the risk assessment of Type 2 diabetes mellitus: a prospective study |
关键词(中文): | |
关键词(外文): | Type 2 diabetes mellitus Risk assessment model Ensemble learning Explainable machine learning |
论文文摘(中文): |
研究目的:本研究旨在探索基于Stacking方法的融合模型在构建中国45岁及以上中老年人2型糖尿病前瞻性风险评估模型中应用潜力,并使用可解释性分析方法解释复杂机器学习算法的决策过程。本研究可为Stacking方法和可解释性分析方法在中老年人2型糖尿病预防与控制中的应用提供一定的理论依据和技术支持。 研究方法:本研究数据来源于2011和2015年中国健康与养老追踪调查项目,选取其中8 063名45岁及以上中老年人为研究对象。研究对象的社会人口学特征、身体测量指标、生化指标等因素通过问卷调查、体格检查和实验室检查收集,并作为预测因子构建模型。Logistic回归分析被应用于探索拟纳入的预测因子与中老年人2型糖尿病发病的关联。预测因子分别以预测因子组合1(连续变量:年龄、收缩压、舒张压、腰围、高密度脂蛋白胆固醇、甘油三酯、糖化血红蛋白、空腹血糖;类别变量:性别、BMI、自我报告高血压情况)和预测因子组合2(类别变量:年龄、性别、BMI、高血压患病情况、腰围、高密度脂蛋白胆固醇、甘油三酯、糖化血红蛋白、空腹血糖)纳入Logistic回归分析。Python 3.7.6和R语言4.2.1软件被用于训练机器学习模型并进行内部验证。本研究采用10折交叉验证的方法划分训练集和测试集。在训练集中处理不平衡数据、训练Logistic回归、随机森林、LightGBM和Stacking融合模型。其中,采用随机搜索法结合5折交叉验证优化随机森林和LightGBM模型,将优化后的模型应用于Stacking方法。在测试集中对所有模型进行内部验证。默认截断值(0.5)下,计算受试者工作特征曲线下面积(Area Under the Receiver Operating Characteristic Curve, AUC)、灵敏度、特异度、准确度和Brier分数。模型内置特征重要度、SHAP(Shapley Additive explanations)方法和LIME(local interpretable model-agnostic explanations)方法被应用于对随机森林、LightGBM和Stacking融合模型进行全局和局部可解释性分析。其中,Stacking融合模型仅使用LIME方法进行局部可解释性分析。最后,使用网页对风险评估模型进行展示和应用。 研究结果:本研究共纳入8 063名研究对象,1 088名(约13.5%)参与者在2015年发生2型糖尿病。基于预测因子组合1的Logistic回归结果显示,年龄(HR=1.018, 95%CI:1.009~1.026, P<0.001)、BMI≥28.0kg/m2(HR=1.702, 95%CI:1.342~2.160, P<0.001)、腰围(HR=1.012, 95%CI:1.005~1.020, P=0.001)、甘油三酯(HR=1.001, 95%CI:1.000~1.002, P=0.004)、糖化血红蛋白(HR=2.221, 95%CI:1.876~2.630, P<0.001)、空腹血糖(HR=1.013, 95%CI:1.007~1.019, P<0.001)是中老年人2型糖尿病的危险因素。自我报告患高血压(HR=0.780, 95%CI:0.667~0.192, P=0.002)是中老年人2型糖尿病的保护因素。基于预测因子组合2的Logistic回归结果显示,年龄在50~64岁(HR=1.385, 95%CI:1.150~1.667, P=0.001)和≥65岁(HR=1.747, 95%CI:1.418~2.152, P<0.001)、患高血压(HR=1.330, 95%CI:1.144~1.546, P<0.001)、BMI≥28.0kg/m2(HR=1.954, 95%CI:1.587~2.407, P<0.001)、中心性肥胖(HR=1.464, 95%CI:1.232~1.739, P<0.001)、糖化血红蛋白水平5.7~6.4%(HR=1.366, 95%CI:1.195~1.562, P<0.001)、空腹血糖水平100~125mg/dl(HR=2.257, 95%CI:1.794~2.841, P<0.001)是中老年人2型糖尿病的危险因素。基于预测因子组合1和预测因子组合2分别构建机器学习模型。其中,预测因子组合1中,基于Stacking方法的集成学习模型在测试集上的平均AUC为0.662±0.046、平均灵敏度为0.593±0.063、平均特异度为0.642±0.019、平均准确度为0.635±0.021、平均Brier分数为0.232±0.005,其AUC优于Logistic回归、随机森林和LightGBM,灵敏度优于随机森林和LightGBM。预测因子组合2中,Stacking方法的AUC(AUC=0.634±0.018)和灵敏度(灵敏度=0.592±0.058)依旧优于其他机器学习模型,但各项指标均低于基于预测因子组合1构建的Stacking融合模型。选取基于预测因子组合1构建的Stacking融合模型为最优模型进行后续分析。使用模型内置变量重要度、SHAP方法对随机森林、LightGBM和Stacking融合模型进行全局可解释性分析。随机森林和LightGBM模型内置变量重要度结果显示腰围、糖化血红蛋白、空腹血糖、甘油三酯在随机森林、LightGBM模型中均具有较高的平均相对重要度,位于两个模型变量重要度排序前5位。基于SHAP方法的全局可解释性分析结果显示,随机森林和LightGBM中,腰围、糖化血红蛋白、空腹血糖、平均收缩压、平均舒张压、BMI、年龄、甘油三酯与预测为2型糖尿病病例正相关。随机森林和LightGBM和特征依赖图结果相似,腰围、糖化血红蛋白值、空腹血糖,平均收缩压、BMI与SHAP值呈非线性关联。分别对一例真阳性实例和一例真阴性实例进行局部可解释性分析。总结SHAP方法和LIME方法对随机森林、LightGBM和Stacking方法的局部可解释性分析结果可发现,对于该真阳性实例,较高的空腹血糖、糖化血红蛋白、甘油三酯可能会增加样本患2型糖尿病的风险。对于该真阴性实例,较低的腰围、糖化血红蛋白、空腹血糖和BMI不太可能增加样本患2型糖尿病的风险。以上结果与Logistic回归和全局可解释性分析的结果一致。 研究结论:综合Logistic回归和可解释性分析的结果,腰围、BMI、空腹血糖、糖化血红蛋白是中老年人未来发生2型糖尿病的重要危险因素和预测因子。尽管本研究构建的所有模型的泛化能力欠佳,但研究结果依旧能说明在变量和变量、变量和结局关联较为复杂的数据中,基于Stacking方法的融合模型更具优势。这种融合了不同机器学习算法的集成学习模型能够综合不同模型的优点,弥补各模型的确定,以得到更优的结果。使用网页的方式对模型进行展示并嵌入局部可解释性分析的结果。这种方式不仅能够便于用户忽略计算过程而直接获得预测结果,还能够提示个体具有的危险因素,以便于尽早实施预防干预措施,降低疾病风险。 |
论文文摘(外文): |
Objective To illustrate the performance of Stacking model in building a prospective risk assessment model for type 2 diabetes in the middle-aged and elderly people in China. To explain the decision process of complex machine learning algorithms using multiple explainable analysis methods. To provide the theoretical basis and technical support for the application of Stacking method and explainable analysis in the prevention and control of type 2 diabetes in the middle-aged and elderly people. Methods The data were collected from 2011 and 2015 China Health and Retirement Longitudinal Study, among which 8 063 middle-aged and elderly people were selected as subjects. Social-demographic characteristics, body measurements, biochemical indicators and other factors measured by questionnaire, physical examination and laboratory examination were collected as predictors. Logistic regression was used to explore the association between the predictors and type 2 diabetes in middle-aged and elderly people. The predictors were included as combination I (Continuous variables: age, systolic blood pressure, diastolic blood pressure, waist circumference, high-density lipoprotein cholesterol, triglyceride, glycosylated hemoglobin, fasting blood glucose. Categorical variables: gender, BMI and self-reported hypertension) and predictor combination II (Categorical variables: age, gender, BMI, hypertension prevalence, waist circumference, HDL cholesterol, triglycerides, glycosylated hemoglobin, fasting blood glucose) were included in Logistic regression analysis. Machine learning models were trained with Python 3.7.6 and R language 4.2.1 software and verified internally. A 10-fold cross validation was used to divide training set and test set. Unbalanced data were processed in the training set to train Logistic regression, random forest, LightGBM, and Stacking model. Random forest and LightGBM were optimized with the random search method combined with five-fold cross validation, and the optimized models were used in the Stacking method. Test sets were used to validate model. The Area Under the Receiver Operating Characteristic Curve (AUC), sensitivity, specificity, accuracy and Brier score were calculated with the default cut-off value (0.5). Using models’ feature importance, Shapley Additive explanations (SHAP) methods, and LIME (interpretable model-agnostic explanation) explained the decision process of the random forest, LightGBM, and Stacking model. The Stacking model only used LIME method for local interpretability analysis. Finally, a web page is used to display and apply the risk assessment model. Results A total of 8 063 subjects were included in the study, of which 1 088 (about 13.5%) participants developed type 2 diabetes in 2015. Logistic regression results based on predictor combination I showed that age (HR=1.018, 95%CI:1.009~1.026, P<0.001), BMI≥28.0kg/m2 (HR=1.702, 95%CI:1.342~2.160, P<0.001), waist circumference (HR=1.012, 95%CI:1.005~1.020, P=0.001), triglyceride (HR=1.001, 95%CI:1.000~1.002, P=0.004), glycosylated hemoglobin (HR=2.221, 95%CI:1.876~2.630, P<0.001), fasting blood glucose (HR=1.013, 95%CI:1.007~1.019, P<0.001) were risk factors for type 2 diabetes in middle-aged and older adults. Self-reported hypertension (HR=0.780, 95%CI:0.667~0.912, P=0.002) was a protective factor for type 2 diabetes in middle-aged and older adults. Logistic regression results based on predictor combination II showed that people aged 50~64 years (HR=1.385, 95%CI:1.150~1.667, P=0.001) and≥65 years (HR=1.747, 95%CI:1.418~2.152, P<0.001), hypertension (HR=1.330, 95%CI: 1.144~1.546, P<0.001) , BMI≥28.0kg/m2(HR=1.954, 95%CI:1.587~2.407, P<0.001), central obesity (HR=1.464, 95%CI:1.232~1.739, P<0.001), glycosylated hemoglobin levels were 5.7~6.4% (HR=1.366, 95%CI:1.195~1.562, P<0.001), fasting blood glucose levels were 100~125mg/dl (HR=2.257, 95%CI:1.794~2.841, P<0.001) were risk factors for type 2 diabetes in middle-aged and older adults. Machine learning models were constructed based on predictor combination I and predictor combination II, respectively. Among them, the Stacking model based on predictor combination I had an average AUC of 0.662±0.046, an average sensitivity of 0.593±0.063, an average specificity of 0.642±0.019, an average accuracy of 0.635±0.021 and an average Brier score of 0.232±0.005 on the test sets. The AUC was better than Logistic regression, random forest and LightGBM. The sensitivity was better than random forest and LightGBM. Another stacking model built by predictor combination II still had a better average AUC of 0.634±0.018 and average sensitivity of 0.592±0.058 but all metrics were lower than the stacking model based on predictor combination I. Therefore, we chose the stacking model based on predictor combination I as the optimal model to subsequent analysis. Global interpretability analysis of the random forest, LightGBM, and Stacking model using the feature importance and SHAP method. Feature importance in Random Forest and LightGBM model showed that waist circumference, glycosylated hemoglobin, fasting blood glucose and triglyceride had high average relative importance in random Forest and LightGBM model and were in the top 5 rankings of importance of the variables in both models. Global interpretable analysis based on SHAP method showed that waist circumference, glycosylated hemoglobin level, fasting glucose level, mean systolic blood pressure, mean diastolic blood pressure, BMI, age, and triglyceride level were positively correlated with positive predictive outcomes of type 2 diabetes in Random Forest and LightGBM. The results of feature dependence plots of random forest and LightGBM were similar. Waist circumference, glycosylated hemoglobin, fasting blood glucose value, average systolic blood pressure level and BMI showing a non-liner association with SHAP value. A true positive case and a true negative case were analyzed for local interpretability. By summarizing the local interpretable analysis results of SHAP and LIME methods on random forest, LightGBM, and Stacking methods, it was found that for the true positive instance, higher fasting glucose, glycosylated hemoglobin, and triglycerides contribute more to the positive prediction. For the true negative instance, lower waist circumference, glycated haemoglobin, fasting glucose and BMI were unlikely to increase the risk of type 2 diabetes in the sample. The above results were consistent with the results of Logistic regression and global interpretability analysis. Conclusion Based on the results of Logistic regression and interpretative analysis, waist circumference, BMI, fasting blood glucose and glycosylated hemoglobin are important risk factors and predictors of future type 2 diabetes in middle-aged and elderly people. Despite the poor generalization ability of all the models constructed in this study, the research results still indicate that the Stacking model is more advantageous in the complex data. This explainable learning model integrating different machine learning algorithms can synthesize the advantages of different models, make up for the determination of each model, and get better results. The model is presented using a web page and the results of a partial interpretability analysis are embedded. This method can not only facilitate users to ignore the calculation process and directly obtain the prediction results, but also remind individuals of their risk factors, so as to facilitate the early implementation of preventive interventions to reduce the risk of disease. |
开放日期: | 2023-06-30 |