- 无标题文档
查看论文信息

论文题名(中文):

 结直肠肿瘤相关DNA甲基化生物标志物的筛选、 验证及风险预测模型研究    

姓名:

 李娜    

论文语种:

 chi    

学位:

 博士    

学位类型:

 学术学位    

学校:

 北京协和医学院    

院系:

 北京协和医学院肿瘤医院    

专业:

 公共卫生与预防医学-流行病与卫生统计学    

指导教师姓名:

 代敏    

校内导师组成员姓名(逗号分隔):

 魏文强 汪红英    

论文完成日期:

 2025-04-01    

论文题名(外文):

 Identification, validation, and risk prediction modeling of DNA methylation biomarkers for colorectal neoplasms    

关键词(中文):

 结直肠癌 筛查 风险预测 DNA甲基化    

关键词(外文):

 Colorectal cancer Screening Risk prediction DNA methylation    

论文文摘(中文):

研究背景

结直肠癌是常见高发的恶性肿瘤,严重威胁居民健康。开展有效的筛查与早诊早治是降低结直肠癌发病率和死亡率的重要手段,而精准识别高危人群则是实现精准筛查的关键环节。现有的结直肠癌风险预测模型主要基于个体特征和生活方式因素构建,预测效能有限,难以满足人群精准筛查需求。近年来,白细胞DNA甲基化因其在反映宿主免疫状态、炎症反应及其与肿瘤发生机制间的关联中表现出独特优势,逐渐成为癌症风险预测生物标志物研究的重要方向。从结直肠癌发生发展自然史角度,大多数结直肠癌具有较为明确的癌前病变,如绒毛状腺瘤、高级别上皮内瘤变等。因此,有效识别结直肠肿瘤(包括结直肠癌和癌前病变)对推动其筛查与早诊早治具有重要意义。基于此,本研究拟筛选并验证与结直肠肿瘤相关的新型白细胞DNA甲基化生物标志物,并据此构建甲基化风险预测模型,以提高高危个体的识别效率,进而推动结直肠癌的精准化筛查和个体化防控策略的实施。

研究方法

本研究采用两阶段研究设计,对候选生物标志物进行筛选和验证,基于前瞻性结直肠癌人群筛查队列和临床机会性筛查队列,采用病例对照研究方法开展分析。共纳入346例符合研究条件的参与者,包括结直肠癌患者(80例)、进展期腺瘤患者(136例)及健康对照(130例)。研究在参与者结肠镜检查前采集其血液样本,并用于后续的白细胞DNA甲基化水平检测。经实验室质检后,剔除21例不合格样本,最终纳入325例合格样本用于后续分析,包括结直肠癌患者(70例)、进展期腺瘤患者(126例)及健康对照(129例)。在生物标志物筛选阶段,首先基于前瞻性结直肠癌人群筛查队列(发现集Ⅰ),采用全基因组Illumina 935K微阵列芯片检测白细胞DNA甲基化水平,分析56例进展期腺瘤和50例健康对照,以筛选进展期腺瘤特异性差异甲基化位点(Differentially Methylated Positions,DMPs)和差异甲基化区域(Differentially Methylated Regions,DMRs);其次,基于临床机会性筛查队列(发现集Ⅱ),采用简化代表性亚硫酸氢盐测序(Reduced Representation Bisulfite Sequencing,RRBS)技术检测白细胞DNA甲基化水平,对22例结直肠癌、20例进展期腺瘤及30例健康对照的合格样本进行分析,以识别结直肠癌和进展期腺瘤特异性DMRs;此外,通过系统性文献回顾,筛选已有研究报道的结直肠肿瘤相关血液DNA甲基化生物标志物。综合以上三部分结果,确定候选标志物名单。在标志物验证阶段,本研究遵循“先预实验摸索条件,再正式实验”的原则,在前瞻性结直肠癌人群筛查队列和临床机会性筛查队列的独立样本集中,采用靶向亚硫酸氢盐测序(Targeted Bisulfite Sequencing,TBS)检测白细胞DNA甲基化水平,分析48例结直肠癌、50例进展期腺瘤及49例健康对照的合格样本,最终确认可用于结直肠肿瘤风险预测的DNA甲基化生物标志物。

针对验证后的标志物,基于验证集进行模型开发。为确保变量筛选结果的稳定性和可解释性,本研究利用弹性网络正则化(Elastic Net regularization,Elastic-Net)和最小绝对收缩和选择算子(Least Absolute Shrinkage and Selection Operator,LASSO)等机器学习算法,并结合传统的Logistic逐步回归统计方法,分别进行1000次Bootstrap自助抽样以筛选预测变量特征,通过交叉验证确定性能最优的标志物组合。随后,采用Logistic回归方法构建白细胞DNA甲基化风险预测模型,并通过最佳截断值计算其灵敏度和特异度,绘制受试者工作特征曲线(Receiver Operating Characteristic Curve,ROC)并计算曲线下面积(Area Under the Curve,AUC)及其95%置信区间(Confidence Interval,CI),以评估模型的区分能力。基于此,进一步利用Logistic回归模型构建生活方式风险预测模型,并整合甲基化生物标志物与生活方式危险因素,构建联合风险预测模型。通过计算净重新分类指数(Net Reclassification Index,NRI)和综合判别改善指数(Integrated Discrimination Improvement,IDI),对各类模型的预测性能进行比较和评估。此外,采用1000次Bootstrap自助抽样对模型进行校准,计算校准后AUC值,以进一步评价模型的稳定性。同时,为深入解析白细胞DNA甲基化特征与结直肠肿瘤的关联机制,本研究对各阶段获得的DMPs和DMRs,利用美国国家生物技术信息中心(National Center for Biotechnology Information,NCBI)和加州大学圣克鲁兹分校(University of California Santa Cruz Genome Browser,UCSC)基因组数据库在线工具进行基因功能注释,并开展基因本体(Gene Ontology,GO)分析、京都基因与基因组百科全书(Kyoto Encyclopedia of Genes and Genomes,KEGG)通路分析及Enrichr富集分析,以揭示候选甲基化生物标志物的潜在生物学功能和意义。

研究结果

在生物标志物发现阶段,本研究基于全基因组Illumina 935K微阵列芯片检测,共识别出70个进展期腺瘤特异性DMPs和20个特异性DMRs;同时,通过RRBS技术进一步识别出36个结直肠癌特异性DMRs和41个进展期腺瘤特异性DMRs。此外,结合文献检索补充引入基于全基因组Illumina 850K微阵列芯片检测获得的10个结直肠癌特异性DMPs。对发现集Ⅰ和Ⅱ中所有差异甲基化区域和位点进行基因注释后发现,约65%的DMPs覆盖基因分布于OpenSea区域,约23%分布于CpG岛Shore区域,而DMRs主要分布于基因体、内含子及启动子区域。相应的富集分析显示,结直肠肿瘤相关的差异甲基化生物标志物主要涉及细胞信号传导、DNA结合与转录调控等生物学过程,显著富集于PI3K-Akt、cGMP-PKG及cAMP等癌症相关信号通路,其中覆盖启动子区域的DMRs在基因表达调控中作用尤为显著。

基于发现阶段筛选出的差异甲基化生物标志物,本研究设计了200对特异性引物,覆盖79个位点和104个区域片段,共涉及959个CpG位点。通过独立验证集的靶向甲基化测序(TBS)技术的检测,最终筛选出3个结直肠肿瘤特异性DMPs和11个特异性DMRs,作为候选白细胞DNA甲基化生物标志物。GO富集分析显示,这些候选DMPs与DMRs主要参与发育及形态发生相关的生物过程,在分子功能层面与DNA结合及转录激活(尤其是RNA聚合酶II活性)密切相关。KEGG富集分析则进一步揭示其在癌症信号通路、炎性肠病通路及骨骼胚胎发育通路中具有重要生物学意义。

在模型构建阶段,基于1000次Bootstrap自助抽样,并结合多重机器学习方法,从14个候选标志物中筛选出“纳入模型频次>700”的5个标志物,包括DMR_935K33(chr4:4859985-4860551)、DMR_RRBS2(chr3:3170109-3170139)、DMR_RRBS35(chr7:99818709-99818869)、DMR_RRBS39(chr9:139582466-139582662)和DMR_RRBS53(chr16:55866678-55866757),最终构建白细胞DNA甲基化风险预测模型。模型中,单个标志物识别进展期腺瘤的灵敏度为64.00%–90.00%,特异度为53.10%–79.60%;识别结直肠癌的灵敏度为45.80%–77.10%,特异度为61.20%–91.80%。整体模型对于结直肠肿瘤的分类效能AUC值在0.88以上,灵敏度为83.88%–88.00%,特异度为78.60%–87.76%。经Bootstrap校准后,模型效能依然稳定,AUC为0.85(95%CI:0.74–0.94),显著优于传统生活方式预测模型(AUC=0.55,95%CI:0.46–0.68)。进一步将甲基化生物标志物与生活方式因素联合建模后,模型分类效能提升至AUC=0.89(95% CI:0.83–0.94),但经校准后AUC略微下降至0.82(95% CI:0.72–0.92)。最后的基因注释分析显示,纳入模型的甲基化生物标志物覆盖多个功能基因,其中PVRIG、STAG3及CES1的高甲基化表达可使结直肠肿瘤风险增加1.67–1.94倍,而TRNT1的低甲基化表达则使结直肠肿瘤风险增加2.56倍,MSX1基因的启动子区域的低甲基化可使进展期腺瘤的发病风险增加2.13倍,提示上述标志物在肿瘤发生过程中可能具有潜在的调控作用和临床应用价值。

研究结论

本研究成功筛选并验证了3个结直肠肿瘤相关白细胞DNA甲基化差异位点和11个差异区域作为生物标志物。并在此基础上构建了由5个核心标志物组成的风险预测模型,其对于结直肠肿瘤的预测AUC达到了0.85以上,在识别结直肠癌高危人群方面具有较好的分类效能和临床适应性。未来若与成熟的风险预测工具相结合,有望进一步提升对高危个体的识别准确度。此外,模型中核心标志物的基因注释显示,STAG3、PVRIG、CES1基因的高甲基化,以及MSX1基因启动子区域和TRNT1基因的低甲基化,可能与结直肠肿瘤的发生风险增加相关。本研究为结直肠癌的精准筛查和个体化预防提供了新型生物标志物和有效风险预测手段,具有重要的推广应用前景。

论文文摘(外文):

Background

Colorectal cancer (CRC) is a common and highly prevalent malignancy that poses a serious threat to public health. Effective screening and early diagnosis are proven strategies for reducing CRC incidence and mortality. Among these, accurately identifying high-risk individuals is a critical step toward advancing precision screening for CRC. Current CRC risk prediction models are primarily based on individual characteristics and lifestyle factors, but their predictive performance remains limited and insufficient to meet the needs of population-based precision screening. In recent years, leukocyte DNA methylation has emerged as a promising focus in cancer risk prediction research, owing to its unique advantages in reflecting host immune status, inflammatory responses, and their associations with tumorigenesis. From the perspective of the natural history of CRC, most cases develop through well-defined precancerous lesions, such as villous adenomas and high-grade intraepithelial neoplasia. Therefore, effective identification of colorectal neoplasms—including both CRC and its precancerous lesions—is of great significance for promoting screening, early detection, and timely intervention. Based on this, the study aims to identify and validate novel leukocyte DNA methylation biomarkers associated with colorectal neoplasms and to develop a methylation-based risk prediction model. The ultimate goal is to enhance the efficiency of high-risk population identification and to facilitate the implementation of precision screening and personalized prevention strategies for colorectal cancer.

Methods

A two-phase study design was adopted to screen and validate candidate biomarkers using a case-control approach in both population-based CRC screening cohorts and clinical opportunistic screening cohorts. A total of 346 eligible participants were included, comprising 80 CRC cases, 136 advanced adenoma cases, and 130 healthy controls. Blood samples were collected before colonoscopy to assess leukocyte DNA methylation levels, and after laboratory sample quality inspection, 21 substandard samples were excluded, resulting in a final cohort of 325 qualified samples for subsequent analysis, including 70 patients with colorectal cancer, 126 with advanced adenomas, and 129 healthy controls.

In the biomarker discovery phase, two approaches were used: (1) In the population-based CRC screening cohort (Discovery Set I), genome-wide leukocyte DNA methylation profiling was performed using the Illumina 935K microarray in 56 advanced adenoma patients and 50 healthy controls to identify differentially methylated positions (DMPs) and differentially methylated regions (DMRs) specific to advanced adenomas. (2) In the clinical opportunistic screening cohort (Discovery Set II), reduced representation bisulfite sequencing (RRBS) was used to analyze leukocyte DNA methylation in 22 CRC cases, 20 advanced adenoma cases, and 30 healthy controls to identify CRC-specific and advanced adenoma-specific DMRs. (3) In addition, previously reported blood-based DNA methylation biomarkers associated with colorectal neoplasms were identified through a literature review. The findings from all three sources were integrated to determine the candidate biomarkers.

In the biomarker validation phase, a stepwise approach was followed, beginning with pilot experiments to optimize conditions, followed by formal validation in independent sample sets from the prospective CRC screening cohort and the clinical opportunistic screening cohort. Targeted bisulfite sequencing (TBS) was used to measure leukocyte DNA methylation levels in 48 CRC cases, 50 advanced adenoma cases, and 49 healthy controls to confirm the final DNA methylation biomarkers suitable for CRC risk prediction.

For the validated biomarkers, Elastic Net regularization, Least Absolute Shrinkage and Selection Operator (LASSO), and logistic stepwise regression were applied to the validation set, each undergoing 1,000 rounds of bootstrap resampling for robust feature selection. This approach aimed to enhance the stability and interpretability of variable selection, ensuring the identification of an optimal biomarker combination for model construction. Cross-validation was performed to determine the optimal biomarker combination for model construction. Logistic regression was then used to develop the leukocyte DNA methylation-based risk prediction model. Sensitivity and specificity were assessed using the optimal cutoff value, and model performance was evaluated by calculating the area under the receiver operating characteristic curve (AUC) and the 95% confidence interval (CI). Additionally, a lifestyle-based risk prediction model was constructed using logistic regression, and a combined model integrating methylation biomarkers with lifestyle risk factors was developed. Model performance was compared using the net reclassification index (NRI) and integrated discrimination improvement (IDI). Model calibration was assessed through 1,000 rounds of bootstrap resampling, with the calibrated AUC values used to evaluate model stability.

To further investigate the biological mechanisms linking leukocyte DNA methylation to colorectal neoplasms, gene function annotation was performed using online tools from the National Center for Biotechnology Information (NCBI) and the University of California, Santa Cruz (UCSC) Genome Browser. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG), and Enrichr pathway enrichment analyses were conducted to elucidate the biological significance of the identified biomarkers.

Results

In the biomarker discovery phase, genome-wide analysis using the Illumina 935K microarray identified 70 advanced adenoma-specific DMPs and 20 advanced adenoma-specific DMRs. RRBS technology further identified 36 CRC-specific DMRs and 41 advanced adenoma-specific DMRs. Additionally, a literature review identified 10 CRC-specific DMPs detected using the Illumina 850K microarray. Gene annotation analysis of the DMPs and DMRs identified in Discovery Sets I and II revealed that approximately 65% of DMPs were located in OpenSea regions, while around 23% were in CpG island shore regions. The DMRs were predominantly located in gene bodies, introns, and promoter regions. Enrichment analysis indicated that the differentially methylated biomarkers associated with colorectal neoplasms were primarily involved in cell signaling, DNA binding, and transcriptional regulation, with significant enrichment in cancer-related pathways such as PI3K-Akt, cGMP-PKG, and cAMP. Notably, DMRs located in promoter regions exhibited stronger regulatory effects on gene expression.

Based on the identified differentially methylated biomarkers, 200 primer pairs were designed, covering 79 individual loci and 104 regional fragments, spanning a total of 959 CpG sites. TBS validation in independent samples identified 3 CRC-specific DMPs and 11 CRC-specific DMRs, which were selected as candidate leukocyte DNA methylation biomarkers. GO enrichment analysis revealed that these DMPs and DMRs were significantly associated with biological processes related to development and morphogenesis, as well as molecular functions linked to DNA-binding transcriptional activation, particularly for RNA polymerase II activity. KEGG enrichment analysis showed that these biomarkers were predominantly involved in cancer-related pathways, inflammatory bowel disease pathways, and skeletal embryonic development pathways.

 

In the model development phase, after 1,000 rounds of bootstrap resampling and multiple machine learning methods, 5 biomarkers with model inclusion frequencies exceeding 700 were selected from the 14 candidates. These included DMR_935K33 (chr4: 4859985-4860551), DMR_RRBS2 (chr3: 3170109-3170139), DMR_RRBS35 (chr7: 99818709-99818869), DMR_RRBS39 (chr9: 139582466-139582662), and DMR_RRBS53 (chr16: 55866678-55866757), which were incorporated into the leukocyte DNA methylation risk prediction model. The sensitivity of individual biomarkers for identifying advanced adenomas ranged from 64.00% to 90.00%, with specificity between 53.10% and 79.60%. For CRC identification, sensitivity ranged from 45.80% to 77.10%, while specificity was between 61.20% and 91.80%. The overall model achieved an AUC of over 0.88, with sensitivity ranging from 83.88% to 88.00% and specificity between 78.60% and 87.76%. Model performance remained stable after bootstrap calibration, with a post-calibration AUC of 0.85 (95% CI: 0.74–0.94), significantly outperforming the traditional lifestyle-based model (AUC = 0.55, 95% CI: 0.46–0.68). Furthermore, combining methylation biomarkers with lifestyle factors improved identification performance (AUC = 0.89, 95% CI: 0.83–0.94), although post-calibration AUC slightly decreased to 0.82 (95% CI: 0.72–0.92). Further gene annotation analysis revealed that the methylation markers included in the model are associated with multiple genes. Notably, hypermethylation of PVRIG, STAG3, and CES1 was associated with a 1.67- to 1.94-fold increased risk of colorectal neoplasms development, while hypomethylation of TRNT1 was linked to a 2.56-fold increased risk of colorectal neoplasms occurrence. Additionally, hypomethylation in the promoter region of the MSX1 gene was associated with a 2.13-fold increased risk of developing advanced adenomas. These findings suggest that these methylation biomarkers may play potential regulatory roles in tumorigenesis and hold promise for future clinical applications.

Conclusion

This study successfully identified and validated 3 DMPs and 11 DMRs in leukocyte DNA as potential biomarkers associated with colorectal neoplasms, and subsequently developed a risk prediction model comprising 5 core methylation markers. The model achieved an AUC of over 0.85 for predicting colorectal neoplasms, demonstrating good discriminatory performance and clinical applicability in identifying high-risk individuals for colorectal cancer. When integrated with established risk prediction tools in the future, its accuracy in identifying high-risk populations is expected to be further improved. Notably, gene annotation of the core markers revealed that hypermethylation of STAG3, PVRIG, and CES1, as well as hypomethylation in the promoter region of MSX1 and within TRNT1, may be associated with an increased risk of colorectal tumorigenesis. The integration of this model with existing risk assessment tools may further enhance its predictive accuracy. Overall, this study provides novel biomarkers and an effective risk prediction approach for the precision screening and individualized prevention of colorectal cancer, offering promising prospects for broader application.

 

开放日期:

 2025-05-28    

无标题文档

   京ICP备10218182号-8   京公网安备 11010502037788号