查看论文信息

免费浏览

附件下载

查看论文信息

论文题名(中文)：	体细胞基因突变高通量测序检测生物信息学分析参考物质的研究
姓名：	李子阳
论文语种：	chi
学位：	博士
学位类型：	学术学位
学校：	北京协和医学院
院系：	卫生部临床检验中心
专业：	临床医学-临床检验诊断学
指导教师姓名：	李金明
校内导师组成员姓名(逗号分隔)：	张瑞林贵高
论文完成日期：	2019-04-10
论文题名(外文)：	Research on bioinformatics reference stardands for somatic mutation detcetion
关键词(中文)：	癌症体细胞突变高通量测序参考物质室间质量评价
关键词(外文)：	Cancer somatic variant next-generation sequencing reference materials external quality assessment
论文文摘（中文）：	︿目前，癌症已成为我国居民死亡的主要原因之一，是严重危害我国居民健康的重大公共卫生问题。近年来，随着个体化医疗的不断发展，根据肿瘤患者的基因突变信息为患者制定个性化治疗方案的“精准医学”模式在临床肿瘤患者的治疗当中发挥着日益重要的作用。大量的肿瘤基因突变在癌症患者的诊断、治疗及预后判断中的临床应用价值已被证实。由于越来越多的肿瘤基因突变位点不断被发现，传统的单个位点的基因检测方法已不能满足临床需求。高通量测序技术的出现，使得多个基因的多个位点同时检测成为可能。高通量测序较传统的分子检测方法要复杂得多，既包括核酸提取、序列靶向富集、文库制备和测序等含多个实验步骤的“湿实验”过程，还有包含测序后的数据质量分析、参考序列比对、变异识别、注释和结果报告解读等步骤的生物信息学分析流程（即“干实验”过程），生物信息学分析流程对于高通量测序检测结果的准确性与“湿实验”一样具有决定性意义。对于临床高通量测序检测的生物信息学分析，要想获得准确可靠的生物信息学分析结果，就需要选择合适的参考物质（Reference material，RM），也称为参考数据（Reference dataset）对生物信息学分析流程进行优化、性能确认、室内质量控制（Internal Quality Control, IQC）以及定期开展室间质量评价（External Quality Assessment, EQA）。通过使用临床样本或肿瘤细胞系DNA等制备的参考数据虽然可以用于生物信息学分析流程的优化、性能确认、室内质量控制及室间质量评价，但其制备较为繁琐，成本较高，且无法包含所有的突变类型。基于测序数据编辑的计算机模拟方法制备的生物信息分析参考数据，具有制备简单、快速、成本低且不受突变类型的限制等优点。但目前已有的基于测序数据编辑的生物信息学分析参考数据模拟软件BAMSurgeon仅能对单核苷酸变异及短片段插入/缺失变异有较好的模拟效果，而不能模拟拷贝数变异、多核苷酸变异等复杂变异，并且不能对靶向测序数据的大片段结构变异进行模拟。此外，BAMSurgeon也不能对Ion Torrent测序平台的数据进行模拟。因此，缺少合适的生物信息学分析参考数据对不同临床实验室的生物信息学分析流程进行全面的性能评估。本研究中，我们开发了一款基于测序数据编辑的生物信息学分析参考数据模拟软件——VarBen。为验证VarBen软件制备的体细胞突变生物信息学分析参考数据是否可以模拟真实肿瘤样本中的体细胞突变，我们将含有真实体细胞突变的肿瘤样本测序数据与VarBen和BAMSurgeon软件制备的体细胞突变生物信息学分析参考数据进行了比较。结果表明，相比于BAMSurgeon，VarBen模拟体细胞突变的检出效果与肿瘤样本测序数据中真实体细胞突变（MB gold set）的检出效果更加相近，这一结果证明VarBen制备的生物信息学分析参考数据可模拟出接近真实肿瘤样本测序数据的体细胞突变。同时为验证VarBen软件的可靠性和稳定性，我们评估了原始测序数据基因组背景、比对软件的使用以及测序reads分割是否会对VarBen产生影响。结果证明原始测序数据的基因组背景、使用的比对软件以及原始测序reads分割不会对VarBen软件体细胞突变的模拟产生影响。综上，我们的验证实验证明了VarBen软件的可靠性和稳定性，且其制备的模拟测序数据可用作临床体细胞突变检测生物信息学分析参考数据。为全面评估临床实验室肿瘤体细胞突变生物信息分析能力，我们使用VarBen制备的生物信息学分析参考数据开展了肿瘤体细胞基因突变高通量测序检测生物信息学分析室间质量评价调研活动。我们共收到实验室提交的113个有效分析结果，实验室提交结果统计分析显示，相对于单核苷酸变异，目前临床实验室对短片段插入/缺失变异的生物信息学分析能力还有待提高，尤其是复杂插入-缺失变异和FLT基因内部串联重复（internal tandem duplication, ITD）。实验室在建立高通量测序基因突变检测生物信息学分析流程的过程中，需充分重视对生物信息学分析流程的性能确认，以保证分析结果的准确性。此外，本次室间质评也证明了VarBen制备生物信息学分析参考数据的实用性。综上所述，本研究开发了一款基于测序数据编辑的生物信息学分析参考数据模拟软件——VarBen。与目前已有模拟软件相比，VarBen解决了目前无法对拷贝数变异、多核苷酸变异、复杂插入-缺失变异等复杂变异以及靶向测序数据的大片段结构变异进行模拟的难题，且同时适用于Illumina测序平台、华大BGI测序平台和Ion torrent测序平台。基于测序数据编辑的方法可保留高通量测序“湿实验”部分文库制备及上机测序过程中产生的背景错误分布模式，从而保证模拟数据更加的接近临床真实测序数据，同时可对任意类型的突变位点进行模拟，具有制备成本低、快速、可靠等优点。通过使用VarBen制备个性化的生物信息学分析参考数据可帮助临床实验室发现其生物信息学分析流程中存在的问题，从而帮助临床实验室提高基因突变检测的准确性﹀
论文文摘（外文）：	︿ Cancer is one of the major causes of death and a major public health problem in China. With the rapid development of personalized medicine, targeted drug therapy according to their individual tumor mutation information plays an increasingly important role in the treatment of cancer. Due to a large number of gene mutations associated with cancer diagnosis and treatment have been found, the traditional single mutation detection method has been unable to meet the needs. In recent years, next-generation sequencing technology (NGS) makes it possible to detect multiple loci of multiple genes at the same time. However, the analytic phase of NGS differs most from traditional molecular diagnostic methods in that it involves multiple experimental steps and complex bioinformatics analysis. As an integral component of NGS, bioinformatics pipelines to detect genomic mutations have a significant impact on genetic test results. For clinical NGS testing, in order to obtain accurate and reliable detection results, a proper reference dataset is a prerequisite for bioinformatics pipeline developing, validation, and conduct external quality assessment (EQA) programs. Although sequencing data of real-world clinical or synthetic DNA samples can be used as reference datasets for clinical bioinformatics analysis, but it cannot simulate the full range of variant types and variant allele frequencies (VAFs) that are encountered in clinical scenarios and the costs are expensive. The utility of raw sequencing reads-editing based in silico sequence files provides a valuable resource for evaluation of bioinformatics pipelines. Because it is a straightforward, quick, and inexpensive process to introduce a range of sequence variants, in various combinations, and at various VAFs. However, existing variant simulation software, BAMSurgeon, has some limitations. BAMSurgeon cannot simulate some important sub-types of cancer driver structural variants (SVs), such as inter-chromosomal rearrangements. It also cannot add SVs to targeted sequencing data that have been routinely applied in clinical practice. Second, BAMSurgeon does not support the simulation of copy number variations (CNVs) and complex deletion-insertion variants. Third, BAMSurgeon cannot simulate flow signal information in sequencing data from the Ion Torrent system. In this study, we developed VarBen, a tool for variant simulation, to generate user-specific reference datasets based on real sequencing data which emulate the real-world environment of wet laboratory process. To evaluate the reliability and robustness of VarBen, we performed a series of proof-of-principle validation studies. First, we compared the performance of SNV and Indel calling on simulated datasets generated by VarBen, BAMSurgeon, and the curated MB gold set. The results showed that both the SNVs and Indels calling performance of the simulated data is highly comparable for the MB gold set, indicating that there was no bias in the simulated data compared with the real-world data. To exclude the influence of genomic background, aligner, and random split read division, we compared the variant calling performance of difference sequencing data, aligners, and divisions of random read splitting. The results show that our simulated variants were independent of genomic background, aligner and random split read division. We further evaluated the suitability of VarBen for targeted sequencing data. All simulated variants were correctly detected in the two targeted sequencing data generated from the Illumina and Ion Torrent platforms. Overall, these validation studies demonstrated the reliability and robustness of VarBen as an unbiased and powerful calibration tool for somatic variant simulation. To evaluate the proficiencies of somatic variant calling in laboratories utilizing NGS to detect somatic mutations, an EQA for NGS bioinformatics was implemented. In total, we received 113 submissions. This EQA study shows that Indel detection appears to be particularly challenging, with performance lagging behind than those of SNV detection， especially for complex deletion-insertion variants and FLT internal tandem duplication (ITD) variant. In summary, we developed VarBen to generate synthetic reference datasets for benchmarking somatic variant calling pipelines. VarBen has a number of benefits compared with existing variant simulation methods, including the ability to simulate complex deletion-insertion variants, large structural variants (SVs) and CNVs in both whole genome and targeted sequencing data, and the ability to handle sequencing data from a broad range of sequencing platforms, e.g., Illumina, BGI and Ion Torrent. VarBen retains the characteristics intrinsic to raw sequencing data from physical specimens, such as the distribution of quality scores and depth of coverage, which are better at emulating characteristics from real-world sequencing data. Recognizing the defects is a prerequisite to optimize the analysis pipeline. Thus, to assure a reliable test result, a customized user-specific reference dataset is essential for bioinformatics pipeline developing and validation in clinical NGS testing. ﹀
开放日期：	2019-06-11