| 论文题名(中文): | 循环肿瘤DNA中基因组结构变异与病毒整合的检测算法 |
| 姓名: | |
| 论文语种: | chi |
| 学位: | 博士 |
| 学位类型: | 学术学位 |
| 学校: | 北京协和医学院 |
| 院系: | |
| 专业: | |
| 指导教师姓名: | |
| 论文完成日期: | 2022-05-20 |
| 论文题名(外文): | Detection of structural variations and viral integrations in circulating tumor DNA |
| 关键词(中文): | |
| 关键词(外文): | Structural variation Viral integration Circulating tumor DNA |
| 论文文摘(中文): |
循环肿瘤DNA(Circulating tumor DNA, ctDNA)是从肿瘤细胞中脱落并释放到循环系统中的一种高度片段化的DNA。其片段长度较短,通常介于50个到166个碱基对(Base pair, bp)之间。尽管早在50年前就已证实循环系统中存在ctDNA,但直到二代测序(Next-generation sequencing,NGS)技术成熟,人们才实现对ctDNA分子进行碱基序列分析。该方法不仅可以用来检测ctDNA中的单核苷酸变异(Single nucleotide variants,SNVs)和插入缺失变异(Insertions and deletions,InDels),还可以用来检测基因组结构变异(Structural variations,SVs)。然而,由于ctDNA片段较短且在血浆中的浓度极低,准确检测ctDNA中的结构变异,尤其是检测大于50 bp的易位和插入缺失变异,仍然具有挑战性。 在本研究中,我们提出了一种新的针对ctDNA数据的结构变异和病毒整合检测算法,命名为Aperture。该算法有如下几个特点:(1)Aperture使用一种独特的基于k-mer的搜索策略对测序读长进行定位,该策略使用两种不同长度的k-mer和一种间隔种子序列来优化对重复序列的定位能力,实现了既高效又准确的基因组定位。(2)Aperture使用一种基于二进制标签的方法,既可以识别带有novo-kmer的断点又可以识别带有多重k-mer的断点,实现快速准确的断点检测,同时增强对含有重复序列的结构变异的检测能力。(3)Aperture根据ctDNA的建库策略进行了优化,设计了基于ctDNA簇的过滤策略,极大降低了检测的假阳性率。(4)该算法使用Java语言实现,其中的关键流程使用多线程技术进行加速以实现高效的检测。 Aperture结构变异检测算法的工作流程包含五个步骤:(1)根据参考基因组和dbSNP数据库构建一组k-mer数据库;(2)ctDNA标签序列提取、测序读长质量过滤、配对读长合并和基于k-mer搜索的读长序列比对;(3)基于二进制标签位运算的结构变异断点检测;(4)候选断点合并、过滤和结果输出。 我们使用不同类型的数据对Aperture进行了系统评估。模拟ctDNA数据测试结果表明,在0.1%到10%的滴度范围内,Aperture比其它软件拥有更高的灵敏度和特异度。我们还使用三组真实患者ctDNA数据对Aperture的检测效果进行评估,结果显示Aperture成功检测到了肺癌患者中的融合基因和肝癌患者中的HBV-TERT整合。Aperture还具有检测复杂结构变异的能力。在一项测试中,Aperture成功检测到了一个涉及重复序列的易位变异,而这一结果被大多数算法忽略。 由于融合基因和病毒整合与某些类型的肿瘤关系密切,这一成果预计可以进一步提升ctDNA在指导个体化靶向治疗和疗效检测上的价值。此外,该算法使用多线程技术,实现了较高的运行效率。出于这些原因,我们相信本研究提出的方法不仅可以用于科学研究,还可以成为临床实践中的可靠工具,并进一步推动精准医学发展。 Aperture使用Java语言实现,源代码已在GitHub上公开。 |
| 论文文摘(外文): |
Circulating tumor DNA (ctDNA) is fragmented tumor-derived DNA found in the bloodstream with small fragment sizes, between 50 and 166 bp. Mutations in ctDNA have been identified as excellent biomarkers for early detection and treatment monitoring of cancer. Although ctDNA was first described 50 years ago, base-resolution analysis of ctDNA was recently performed with next-generation sequencing (NGS) technology. Targeted deep sequencing has allowed the detection of multiple types of cancer-specific mutations in ctDNAs, such as single-nucleotide variants (SNVs), small insertions and deletions (InDels), and structural variations (SVs). However, due to the short fragment size and extremely low fraction of ctDNA, accurate detection of SVs in ctDNAs, especially for translocations and insertions/deletions larger than 50 bp, remains challenging. Here, we describe a new SV detection tool, called Aperture. It was developed to achieve sensitive detection of breakpoints introduced by SVs in ctDNA datasets. It is based on (i) a unique strategy of k-mer-based searching, which uses two different k lengths and spaced seeds to optimize the coverage of repetitive sequences at breakpoints, (ii) rapid approximation of the intersection approach to identify breakpoint junctions containing either novo-k-mers or repetitive sequences and (iii) a barcode-based filtering strategy designed for ctDNA datasets with molecular barcoding. Starting from raw sequencing data in FASTQ format, Aperture performs a k-mer-based database search involving three different libraries and implements SV breakpoint detection by rapid approximation of set intersection using binary labels. Aperture then gathers candidate reads with identical junctions or similar genomic positions to achieve fault-tolerant evidence clustering for SV detection. The final output from Aperture includes the predicted SVs, number of supporting molecules, mapping quality of both breakends and sequences of identical microhomology at breakpoints. After a performance test using simulated ctDNA data, we found that Aperture achieved much higher sensitivity and specificity than existing SV callers at dilutions ranging from 0.1% to 10%. We also applied Aperture to three real patient ctDNA datasets from different clinical settings. Aperture successfully detected druggable translocations in lung cancer patients and HBV integration in the TERT promoter in liver cancer patients, including a complex rearrangement involving repetitive sequences that was missed by most SV callers. Since some fusions and viral integrations are closely related to certain tumor types, our work may enhance the diagnostic potential of ctDNA in early cancer detection and help in treatment monitoring. In addition, Aperture runs fast and is efficient in consumptions of computational resources. For these reasons, we believe that the method proposed in this study will not only be helpful in bioinformatics community, but also offer a reliable tool in research and clinical practice and help progress precision medicine. Implemented in Java, Aperture is available as an open source tool at GitHub. |
| 开放日期: | 2022-05-28 |