- 无标题文档
查看论文信息

论文题名(中文):

 第一部分:基于多组学测序数据构建检测肿瘤体细胞突变的机器学习算法并探究 TCR-T 疗法的潜在治疗靶点;第二部分:基于转录组测序数据构建脂代谢相关基因预后模型并探究肿瘤的潜在治疗靶    

姓名:

 洪宇桁    

论文语种:

 chi    

学位:

 博士    

学位类型:

 学术学位    

学校:

 北京协和医学院    

院系:

 北京协和医学院肿瘤医院    

专业:

 临床医学-肿瘤学    

指导教师姓名:

 高亦博    

校内导师组成员姓名(逗号分隔):

 赫捷 李宁    

论文完成日期:

 2025-05-01    

论文题名(外文):

 Part 1: Construction of a machine learning algorithm to detect tumor somatic mutations based on multi-omics sequencing data and identification of potential targets for TCR-T therapy;Part2: Construction of a lipid metabolism-associated prognostic stratification based on transcriptome sequencing data and identification of tumor therapeutic targets    

关键词(中文):

 第一部分:单细胞测序 机器学习 肿瘤抗原 非小细胞肺癌 T细胞受体 第二部分:转录组测序 肿瘤分子分型 脂质代谢 SQLE 肉瘤    

关键词(外文):

 Part 1: Single-cell sequencing machine learning tumor antigen non-small cell lung cancer T cell receptor Part 2: Transcriptome sequencing lipid metabolism tumor molecular classification SQLE sarcoma    

论文文摘(中文):

第一部分:基于多组学测序数据构建检测肿瘤体细胞突变的机器学习算法并探究 TCR-T 疗法的潜在治疗靶点

研究目的:

单细胞分辨率下研究体细胞突变对于研究肿瘤中的遗传异质性和细胞可塑性以及鉴定恶性细胞的突变至关重要。然而单细胞DNA测序(scDNA-seq)过程中全基因组扩增导致的基因组丢失和伪影限制了研究的进展。因此我们旨在开发从单细胞转录组测序(scRNA-seq)数据中检测体细胞突变的机器学习算法筛选肿瘤突变,挖掘微环境中对应的肿瘤特异性T细胞和T细胞受体(TCR),并通过体外实验验证,为T细胞受体工程化T细胞(TCR-T)治疗提供理论依据。

方法:

1.申请GSA数据集后整合122例肿瘤患者的scRNA-seq数据、全外显子测序(WES)数据及临床信息等,统一处理原始测序数据生成BAM文件,进行细胞类型注释和拷贝数变异(CNV)分析以及突变调用。

2.基于人类参考基因组序列(GRCh38)及非恶性基质细胞亚群的WES数据,构建患者的个人参考基因组文库。通过挖掘数据特征,整合原始序列信息、位点内容及统计特征,完成机器学习特征工程处理。在此基础上训练开发出Mut-X机器学习算法,实现从scRNA-seq数据中检测体细胞突变,并将该算法和其他现有突变调用算法就召回率、准确率和F1分数进行对比。

3.基于10X Genomics平台联合V(D)J单细胞测序技术,对中国医学科学院肿瘤医院(CHCAMS)队列的12例NSCLC肿瘤组织和外周血进行scRNA-seq与scTCR-seq,使用UMAP图对细胞进行降维及聚类分析。利用单细胞Barcode技术将TCRα链和β链准确定位至同一T细胞,实现精准的链配对分析。通过组织来源比对识别肿瘤反应性T细胞并提取患者TCR库信息,计算出患者的高频肿瘤特异性TCR。

4.将CHCAMS队列的12例NSCLC肿瘤组织和外周血完成WES检测,利用Mut-X机器学习算法从scRNA-seq数据中检测体细胞突变并与WES数据对比后明确患者的基因突变位点。使用NetMHC-4.0预测患者可能呈递的肿瘤突变抗原肽,利用arcasHLA检测患者的HLA-I类分型。

5.获取CHCAMS队列NSCLC患者可能呈递的肿瘤突变抗原肽、HLA-I类分型及肿瘤特异性TCR,并完成TCR建库;应用CRISPR/Cas9技术构建TCR-CD4-CD8+Jurkat工程细胞荧光报告系统和HLAKO-HEK-293T细胞抗原肽呈递系统。将不同的患者来源的TCR文库转入TCR-CD4-CD8+Jurkat工程细胞,同时在特定的HLA-I类分型的HLAKO-HEK-293T细胞中表达患者特异的肿瘤突变肽序列,随后利用二者的共培养实验验证筛选出的患者肿瘤突变抗原肽和特异性TCR。

结果:

1.将整合队列122例样本进行细胞类型注释和inferCNV分析后,全部测序细胞分成恶性上皮细胞和非恶性基质细胞亚群,样本的平均测序深度为8,305到404,817个reads/cell之间,中位数为24,012 reads/cell。肿瘤细胞的中位测序深度(26,571 reads/cell)高于正常细胞(22,129 reads/cells)。

2.分析了不同测序深度下每个位点上第二高排序碱基的比例,并利用自然断点法设置了识别等位基因的阈值,选取了在非恶性基质细胞亚群中检出大于25次的给定的位点,且位点排序第二的碱基比例大于40%的碱基为患者的SNP。据此构建了患者个体化的参考基因组文库并用于筛选恶性上皮细胞中的所有潜在突变序列。

3.构建了一系列机器学习特征包括原始序列特征、位点内容特征和位点统计特征并完成了特征工程处理,构建了Mut-X机器学习算法。其中最重要的两个特征是 RBV_N(正常组织中仅包含参考基因的细胞比例)和 OCC_F(肿瘤组织中非变异等位基因涉及的细胞比例),其权重分别为 0.23009 和 0.102157。

4.将Mut-X的效能与现有的其他多种突变调用算法包括VarScan2、SAMtools、SComatic和Monovar进行了比较。结果表明Mut-X的突变召回率达0.683,略低于SAMtools(0.782),但高于VarScan2、SComatic和Monovar(0.58、0.337和0.153)。Mut-X的准确率为0.62远高于SAMtools、VarScan2、SComatic和Monovar的0.0126、0.018、0.32和0.0061。Mut-X的F1分数为0.647,明显高于SComatic、VarScan2、SAMtools和Monovar的F1分数(0.327、0.0349、0.0248和0.0116)。综合评估算法的召回率、准确率以及F1分数等方面后,Mut-X优于其他所有算法。

5.利用Mut-X检测CHCAMS队列NSCLC患者scRNA-seq数据中的体细胞突变,与WES数据中的突变进行比对后,发现82%的患者携带有TP53突变,包括R174L、R175H、R213L、R282W等热点突变;6名患者携带EGFR突变,其中非同义突变包括V786M、L858P、A871V和G874D等位点。4名患者携带KRAS突变,其中非同义突变为A11V、V14I和Q61R。

6.使用NetMHC-4.0预测了患者可能呈递的肿瘤突变抗原肽为1614个,其中肿瘤热点突变抗原肽51个。利用arcasHLA检测了患者的HLA-I类分型,包括6种HLA-A分型,18种HLA-B分型,10种HLA-C分型,并在体外合成了20/34(58.8%)的常见高频HLA-I类分子,包括5种HLA-A分子,7种HLA-B分子,8种HLA-C分子。分析了患者的scTCR-seq数据并计算出492个患者的高频肿瘤特异性TCR,在体外合成并建库。

7.将表达不同患者来源的TCR文库的TCR-CD4-CD8+Jurkat工程细胞与表达患者特异肿瘤突变肽序列的特定HLA-I类分型的HLAKO-HEK-293T细胞共培养,结果发现患者来源的TCR文库能够识别HLA*A24:02 HEK-293T工程细胞呈递TP53 R273G的突变抗原肽,其氨基酸序列为SGNLLGRNSFEVGVCACPGRDRRTE,流式细胞术检测共培养细胞中CD8阳性细胞亚群的GFP阳性细胞占比为4.72%。

结论:

1.构建了一系列创新的机器学习特征,经特征工程处理优化后训练出Mut-X机器学习算法,该算法能够从scRNA-seq数据中检测体细胞突变,且无需匹配的DNA测序数据。

2.综合评估突变召回率、准确率及F1分数,Mut-X优于现有的突变调用算法SAMtools和VarScan2,以及专门针对scRNA-seq数据开发的突变调用算法SComatic和Monovar。

3.Mut-X成功检测出NSCLC患者的体细胞突变,据此预测出可能呈递的肿瘤突变抗原肽;鉴别出患者的高频肿瘤特异性TCR。

第二部分:基于转录组测序数据构建脂代谢相关基因预后模型并探究肿瘤的潜在治疗靶点

研究目的:

肉瘤是一类具有高度异质性的结缔组织恶性肿瘤,其侵袭性较强且易发生系统转移。组织学亚型分为软组织肉瘤和骨肉瘤。尽管当前采用多模式治疗方案,诸多肉瘤亚型仍属难治性肿瘤。目前针对晚期或转移性肉瘤患者,临床主要采用手术切除、化疗、免疫治疗及靶向治疗等综合治疗方案,然而仍有很多患者无法从免疫或靶向治疗中获益。分子分型作为肿瘤精准诊疗的重要策略,其核心价值在于提升疾病诊断准确性及靶向治疗敏感性预测效能。脂质分子在细胞活动中发挥关键作用,在转化为恶性表型的细胞中,脂质代谢通常会发生许多改变,多种信号通路、蛋白及相关分子机制通过调控细胞对特定分子与代谢产物的利用,影响肿瘤复发、转移及耐药。因此有必要对患者进行分层,针对肉瘤脂质代谢的精准治疗策略亟待探索。

方法:

1.整合公共数据库的转录组测序数据集,包括TCGA-SARC、GSE63157、GSE17674与TARGET队列,以及中国医学科学院肿瘤医院(CHCAMS)队列的45名患者的转录组测序数据,均具有完整临床信息及随访数据。

2.采用脂质代谢相关基因(LMAGs)进行分子分型综合分析,通过单因素COX回归分析鉴定与患者生存相关的LMAGs。

3.为降低模型过拟合风险,通过多因素COX回归分析构建了包含关键LMAGs的新型预后模型。

4.通过整合基因表达谱与生存结局信息,将TCGA肉瘤患者划分为高、低风险亚组,系统揭示两组患者的代谢特征预后价值及免疫浸润特征差异。

5.通过生存分析及免疫组化染色在CHCAMS队列SARC患者数据中验证该风险模型的有效性。

6.最后进行体外实验验证,通过在A-673细胞系中敲除SQLE后检测细胞的增殖、克隆形成能力及凋亡现象,并且检测SQLE抑制剂特比萘芬对肿瘤细胞增殖和凋亡的影响。

结果:

1.构建了包含SQLE和TNF的两个关键脂代谢相关基因的预后模型,首先根据预后模型将TCGA-SARC患者分为高/低风险两组。

2.发现TCGA-SARC低风险组肉瘤患者生存优于高风险组,且具有免疫细胞浸润比例高、免疫检查点基因表达上调的特征。

3.该脂质代谢特征模型的预后预测能力在四个独立外部数据集包括GSE63157、GSE17674、TARGET与CHCAMS队列中均得到验证。

4.胆固醇生物合成限速酶SQLE被确认为是肉瘤的潜在治疗靶点。

5.体外实验表明敲低SQLE表达可显著抑制肉瘤细胞增殖及克隆形成能力,同时促进细胞凋亡。

6.SQLE抑制剂特比萘芬在体外实验中也表现出同样的肿瘤抑制效应,能够抑制细胞增殖并诱导肿瘤细胞凋亡。

结论:

1.本研究构建了脂代谢相关基因的新型预后模型,其在内部验证集和多个外部验证队列中均表现出良好的预后预测能力。

2.该预后模型在自测患者队列中同样得到验证。

3.体外实验为SQLE作为肿瘤潜在治疗靶点提供理论依据支持。

 

论文文摘(外文):

Part1:

Aims:

Studying somatic mutations at single-cell resolution is crucial for investigating genetic heterogeneity and cellular plasticity in tumors, as well as for identifying mutations in malignant cells. However, the genomic losses and artifacts from whole genome amplification in single-cell DNA sequencing (scDNA-seq) have restricted the progress of research. Therefore, we aim to develop a machine learning algorithm for detecting somatic mutations from single-cell transcriptome sequencing (scRNA-seq) data to screen for tumor mutations, identify corresponding tumor specific T-cells and T-cell receptors (TCRs) in the microenvironment, and validate the finding through in vitro experiments, providing theoretical basis for T-cell receptor engineered T-cell (TCR-T) therapy.

Methods:

1. After applying for the GSA dataset, integrate the scRNA-seq data, whole exome sequencing (WES) data, and clinical information of 122 tumor patients. We process the raw sequencing data uniformly to generate BAM files, and conduct cell type annotation, copy number variation (CNV) analysis, and mutation calling.

2. Based on the human reference genome sequence (GRCh38) and WES data from non-malignant stromal cell subsets, personalized reference genome libraries for patients were constructed. By mining data features and integrating raw sequence information, locus content, and statistical characteristics, machine learning feature engineering was completed. Subsequently, the Mut-X machine learning algorithm was developed and trained, enabling the detection of somatic mutations from scRNA-seq data. The performance of the Mut-X algorithm was compared with other existing mutation calling algorithms in terms of recall rate, precision, and F1-score.

3. Based on the 10X Genomics platform, combined with V (D) J single-cell sequencing technology, scRNA-seq and scTCR-seq were performed on 12 NSCLC tumor tissues and peripheral blood samples from the Cancer Hospital of the Chinese Academy of Medical Sciences (CHCAMS) cohort. UMAP maps were used to perform dimensionality reduction and clustering analysis of the cells. Utilize the single-cell Barcode to accurately locate TCR alpha and beta chains to the same T-cell, achieving precise chain pairing analysis. Identify tumor reactive T-cells through tissue source comparison, extract the information of the patient's TCR repertoire, and calculate the high-frequency tumor specific TCRs of the patients.

4. Perform WES on the tumor tissues and peripheral blood samples from 12 NSCLC patients in the CHCAMS cohort. Use Mut-X machine learning algorithm to detect somatic mutations from scRNA-seq data and compare them with WES data to identify the patient's gene mutation sites. Use NetMHC-4.0 to predict the tumor mutation antigen peptides that the patients might present, and use arcasHLA to detect HLA class I typing of the patients.

5. Obtain the tumor mutation antigen peptides that NSCLC patients in the CHCAMS cohort might present, their HLA class I typing, and tumor specific TCRs, and complete the construction of the TCR library. Constructing TCR-CD4-CD8+Jurkat engineered cell fluorescence reporter system and HLAKO-HEK-293T cell antigen peptide presentation system by CRISPR/Cas9 technology. Transfer different patient-derived TCR libraries into TCR-CD4-CD8+Jurkat engineered cells. Meanwhile,  express the patient-specific tumor mutation peptide sequences in HLAKO-HEK-293T cells with specific HLA class I typing.  Subsequently, conduct co-culture experiments using these two types of cells to verify the screened tumor mutation antigen peptides and specific TCRs of the patients.

Results:

1. After performing cell type annotation and inverCNV analysis on the 122 samples in the integrated cohort, all the sequenced cells were divided into malignant epithelial cells and non-malignant stromal cell subgroups. The average sequencing depth of the samples ranged from 8305 to 404817 reads/cells, with a median of 24012 reads/cells. The median sequencing depth of tumor cells (26571 reads/cell) was slightly higher than that of normal cells (22129 reads/cell).

2. Analyzed the proportion of the second highest ranked base at each site under different sequencing depths, and set a threshold for identifying alleles using the natural breakpoint method. The loci that were detected more than 25 times in the non-malignant stromal cell subgroup and for which the proportion of the second-ranked base at the locus was greater than 40% were selected, and the corresponding bases were considered as the SNPs of the patients. Based on this, patient-specific reference genome libraries were constructed and used to screen for all potential mutation sequences in malignant epithelial cells.

3. A series of machine learning features were constructed, including raw sequence features, site content features, and site statistical features, and feature engineering processing was completed to develop the Mut-X machine learning algorithm. Among these features, the two most important ones were RBV_N (the proportion of cells containing only reference genes in normal tissues) and OCC_S (the proportion of cells involved in non variant alleles in tumor tissues), with weights of 0.23009 and 0.102157, respectively.

4. The performance of Mut-X was compared with other existing mutation calling algorithms including VarScan2, SAMtools, SComatic, and Monovar. The results showed that the mutation recall of Mut-X reached 0.683, slightly lower than SAMtools (0.782), but higher than VarScan2, SComatic, and Monovar (0.58, 0.337, and 0.153). The presicion of Mut-X is 0.62, which is much higher than the 0.0126, 0.018, 0.32, and 0.0061 of SAMtools, VarScan2, SComatic, and Monovar. The F1 score of Mut-X is 0.647, significantly higher than the F1 scores of SComatic, VarScan2, SAMtools, and Monovar (0.327, 0.0349, 0.0248, and 0.0116). After comprehensive evaluation of the recall, presicion, and F1 score of the algorithms, Mut-X outperforms all other algorithms.

5. Using Mut-X to detect somatic mutations in scRNA-seq data of NSCLC patients in the CHCAMS cohort, compared with mutations in WES data, it was found that 82% of patients carried TP53 mutations, including hotspot mutations such as R174L, R175H, R213L, R282W. Six patients carry EGFR mutations, among which non synonymous mutations include V786M, L858P, A871V, and G874D. Four patients carry KRAS mutations, among which non synonymous mutations are A11V, V14I, and Q61R.

6. NetMHC-4.0 was used to predict 1614 tumor mutation antigen peptides that patients may present, including 51 tumor hotspot mutation antigen peptides. The HLA-I typing of patients was detected using arcasHLA, including 6 HLA-A typing, 18 HLA-B typing, and 10 HLA-C typing. 20/34 (58.8%) of common high-frequency HLA-I molecules were synthesized in vitro, including 5 HLA-A molecules, 7 HLA-B molecules, and 8 HLA-C molecules. The scTCR-seq data of patients were analyzed, and 492 high-frequency tumor specific TCRs of patients were calculated, synthesized in vitro, and used to construct a library.

7. Co-culture experiments were conducted between TCR-CD4-CD8+Jurkat engineered cells expressing different patient-derived TCR libraries and HLAKO-HEK-293T cells expressing patient-specific tumor mutation peptide sequences with specific HLA-I typing. The results showed that the patient-derived TCR library could recognize the TP53 R273G mutation antigen peptide presented by HLA*A24:02 HEK-293T engineered cells, with an amino acid sequence of SGNLLGRNSFEVGVCACPGRDRRTE. Flow cytometry analysis revealed that the proportion of GFP positive cells in the CD8 positive subpopulation of co-cultured cells detected by flow cytometry was 4.72%.

Conclusions:

1. A series of innovative machine learning features were constructed and optimized through feature engineering to train the Mut-X machine learning algorithm, which can detect somatic mutations from scRNA-seq data without the need for matching DNA sequencing data.

2. Based on a comprehensive evaluation of mutation recall, presicion, and F1 score, Mut-X outperforms existing mutation calling algorithms SAMtools and VarScan2, as well as mutation calling algorithms SComatic and Monovar specifically developed for scRNA-seq data.

3. Mut-X successfully detected somatic mutations in NSCLC patients and predicted possible tumor mutation antigen peptides based on this. Identify the patient's high-frequency tumor specific TCRs.

Part2:

Aims:

Sarcomas are highly heterogeneous connective tissue malignancies, which are highly invasive and prone to systemic metastasis. The histological subtypes are divided into soft tissue sarcoma and osteosarcoma. Despite the current adoption of multimodal treatment plans, many subtypes of sarcoma still belong to refractory tumors. At present, for patients with advanced or metastatic sarcoma, comprehensive treatment plans such as surgical resection, chemotherapy, immunotherapy, and targeted therapy are mainly used in clinical practice. However, there are still many patients who cannot benefit from immunotherapy or targeted therapy. Molecular typing, as an important strategy for precise diagnosis and treatment of tumors, has the core value of improving the accuracy of disease diagnosis and predicting the sensitivity of targeted therapy. Lipid molecules play a crucial role in cellular activity. In cells transformed into malignant phenotypes, lipid metabolism often undergoes many changes. Multiple signaling pathways, proteins, and related molecular mechanisms regulate the utilization of specific molecules and metabolites by cells, affecting tumor recurrence, metastasis, and drug resistance. Therefore, it is necessary to stratify patients and explore precise treatment strategies for sarcoma lipid metabolism.

Methods:

1. Integrate transcriptome sequencing datasets from public databases, including TCGA-SARC, GSE63157, GSE17674, and TARGET cohorts, as well as transcriptome sequencing data from 45 patients in the Cancer Hospital of the Chinese Academy of Medical Sciences (CHCAMS) cohort all of which have complete clinical information and follow-up data.

2. Use lipid metabolism-associated genes (LMAGs) for comprehensive molecular typing analysis, and identify LMAGs associated with patient survival through univariate Cox regression analysis.

3.To reduce the risk of model overfitting, a novel prognostic model containing key LMAGs was constructed through multivariate COX regression analysis.

4. By integrating gene expression profiles and survival outcome information, TCGA sarcoma patients were divided into high-risk and low-risk subgroups, and the metabolic characteristics, prognostic value, and immune infiltration differences between the two groups were systematically revealed.

5. Verify the effectiveness of the risk model in SARC patient data from the CHCAMS cohort through survival analysis and immunohistochemical staining.

6. Finally, in vitro experiments were conducted to verify the proliferation, colony forming ability, and apoptosis of A-673 cell lines by knocking out SQLE, and to investigate the effect of SQLE inhibitor terbinafine on tumor cell proliferation and apoptosis.

Results:

1. A prognostic model was constructed that includes two key lipid metabolism-associated genes, SQLE and TNF. Firstly, TCGA-SARC patients were divided into high- and low-risk groups based on the prognostic model.

2. It was found that the survival of TCGA-SARC low-risk sarcoma patients was better than that of the high-risk group, and they had the characteristics of high proportion of immune cell infiltration and upregulation of immune checkpoint gene expression.

3. The prognostic capability of the lipid metabolism feature model has been validated in four independent external datasets, including GSE63157, GSE17674, TARGET, and CHCAMS cohorts.

4. The rate limiting enzyme for cholesterol biosynthesis, SQLE, has been identified as a potential therapeutic target for sarcoma.

5. In vitro experiments have shown that knocking down SQLE expression can significantly inhibit the proliferation and colony forming ability of sarcoma cells, while promoting cell apoptosis.

6.The SQLE inhibitor terbinafine also showed the same tumor suppressive effect in vitro experiments, which can inhibit cell proliferation and induce tumor cell apoptosis.

Conclusions:

1. This study constructed a novel prognostic model for lipid metabolism related genes, which demonstrated good prognostic ability in both the internal validation set and multiple external validation cohorts.

2. The prognostic model has also been validated in the self-test patient cohort.

3. In vitro experiments provide theoretical support for SQLE as a potential therapeutic target for tumors.

开放日期:

 2025-05-30    

无标题文档

   京ICP备10218182号-8   京公网安备 11010502037788号