查看论文信息

免费浏览

查看论文信息

论文题名(中文)：	一种用于差异基因选择的新工具的开发
姓名：	章轶男
论文语种：	chi
学位：	硕士
学位类型：	学术学位
学校：	北京协和医学院
院系：	北京协和医学院血液学研究所
专业：	临床医学-★干细胞与再生医学
指导教师姓名：	刘汉芝
论文完成日期：	2021-05-10
论文题名(外文)：	Development of a new tool for highly variable gene selection
关键词(中文)：	造血细胞基因选择单细胞转录组
关键词(外文)：	Hematopoietic stem progenitor cells HVG selection ScRNA-seq
论文文摘（中文）：	︿目的：近年来，随着下一代测序（Next-Generation Sequencing, NGS）技术的发展，基于NGS的基因组学、转录组学和表观基因组学技术得到了更好的发展，尤其突出的是，单细胞组学方法的开发和应用使各领域的研究都获得了突破性的进展。与此同时，高通量测序也产生了海量数据，在得到测序数据后，如何对数据进行处理，使之能够体现生物学意义，就成为我们关注的重点。其中，单细胞转录组测序数据分析流程主要包括质量控制、映射、标准化、高可变基因的选取以及后续分析等多个步骤，而每一步数据处理都会对下游分析产生很大的影响。在这些过程中，高可变基因的选择可以对包括降维和聚类在内的下游分析具有决定性的作用。在一个单细胞中，我们往往能检测到成千上万个基因的表达，但在这其中只有很少一部分基因具有特异性——可以用于区分细胞类型，或与细胞的生长分化直接相关。因此，如何选择最具有生物学意义的高可变基因至关重要。尽管已经有多种工具可以用于选择高可变基因，但由于对这些工具缺乏系统的性能评估，而且没有可以通用于不同数据条件的基因选择方法，高可变基因的选择仍是一个难题。造血系统参与维持机体正常生理活动，与衰老、疾病甚至肿瘤的发生发展都有密切的联系。造血干细胞(hematopoietic stem cell, HSC)是位于血液系统的一类成体干细胞，具有长期自我更新和分化潜能，能够分化成多种造血祖细胞。对造血系统干祖细胞的研究对于其他干细胞和相关疾病的研究都有重要的指导意义。除了干祖细胞以外，造血系统发育过程中有多个谱系，每个谱系的细胞组成丰富，各种前体细胞和成熟细胞各司其职，共同构建完整的造血环境。造血干祖细胞转录组相似，而不同谱系的成熟细胞的转录组却大不相同，造血系统细胞的异质性也十分适合使用单细胞转录组技术对其进行解析。一个好的算法应当能够适应各类数据，不论数据的细胞组成是单一还是多样，都能够精准的从中选择出最具生物学意义的高可变基因。本研究旨在利用造血系统各种细胞的转录组数据，对目前较为常用的各种高可变基因筛选方法进行评估，并在此基础上设计开发出一种新的工具，能够去除噪音，选择出最具生物学意义的高可变基因。内容：系统比较了九种常用的高可变基因筛选方法的性能。我们通过多轮随机抽样的方式，提出了一种新的筛选高可变基因的策略——SIEVE。通过进一步评估SIEVE处理组与未处理组的性能，证明该策略能够有效降低随机噪音，鉴定到的高可变基因集对于后续高级分析具有更稳定的保障。方法：转录组数据选自本课题组发表文章中的造血细胞表达谱数据，干祖细胞共8种细胞类型，508个细胞，成熟细胞16种细胞类型，共4411个细胞；使用Seurat（V4）对数据进行了预处理和标准化；高可变基因分析方法选择了M3Drop，scmap，scran，singleCellHaystack，Seurat和ROGUE；对不同方法从纯度、重复性、准确性三个维度进行了综合评估；对SIEVE的性能从准确性、marker基因的比例等方面进行了测试。结果：比较了不同基因选择方法的不同指标，在本文选择的方法中，singlecellHaystack表现最佳，分群后每一群的纯度极高，选择出的高可变基因用于细胞分类时分类准确性高，不同重复之间选出的高可变基因集重叠比率高，说明了这种方法的精确性和稳定性；同时，我们开发了一个R包——SIEVE，在做基因选择分析时可以与其他方法叠加使用，能够更好的去除噪音，提取出高可变基因用于后续分析。在我们的数据中，对于干祖细胞这类转录组相似的数据集，针对除SCHS以外，其他的几种我们已经列举的方法，“SIEVE”确实可以通过重复最大程度地降低了随机噪声，选定最核心的基因。此外，SIEVE可保留低表达水平的基因的信息，并且，对于重复性较低的方法而言，SIEVE能够显着提高单细胞分类的准确性。﹀
论文文摘（外文）：	︿ Purpose: In recent years, next-generation sequencing (NGS) technology has developed rapidly, people are using NGS-based genomics, transcriptomics, and epigenomics technologies to find out the mystery of individual cells, making single-cell RNA sequencing (scRNA-seq ) research has been promoted, allowing us to study the transcriptomes of thousands of single cells in complex multicellular organisms. In addition, more and more sensitive and automated methods are being continuously developed, aiming to provide better data on the basis of shortening time and operating costs. A large amount of data can be obtained after high-throughput sequencing, which can generate millions of reads or more at a time. After obtaining the data, how to process the data so that it can reflect the biological significance has become the focus of our attention. The process will have a great impact on downstream analysis. The data analysis process mainly includes quality control, mapping, normalization, selection of highly variable genes, and subsequent analysis. In these processes, the selection of highly variable genes has a great impact on downstream analysis, including dimensionality reduction and clustering. In the gene expression data set, the number of measurable genes in each sample generally reaches thousands or even tens of thousands, but, in fact, only a small part of the genes is biologically meaningful, we use them to distinguish different cell types. Therefore, how to choose the method of extraction and selection of highly variable genes is very important. Although researchers have developed more and more tools to select highly variable genes, there is still a lack of evaluation of the performance of various tools and general methods. The hematopoietic system participates in maintaining the normal physiological activities of the body, and is closely related to aging, diseases and even the occurrence and development of the tumors. Hematopoietic stem cell (HSC) is a type of adult stem cells that exist in the blood system and have long-term self-renewal and differentiation potential. They can differentiate into a variety of hematopoietic progenitor cells. The research on stem/progenitor cells of the hematopoietic system has important guiding significance for the research on other types of stem cells and various diseases. In addition to stem/progenitor cells, there are multiple lineages in the development of the hematopoietic system. Each lineage has multiple types of cells. Various precursor cells and mature cells perform their duties to jointly build a complete hematopoietic environment. The transcriptomes of hematopoietic stem and progenitor cells are similar, but the transcriptomes of mature cells of different lineages are quite different. A good gene selection algorithm should be applicable to various types of data. Regardless of whether the cell composition of the data is similar or diverse, it should be able to accurately select the most biologically significant highly variable genes, so as to facilitate the subsequent dimensionality reduction clustering. The purpose of this study is to use the transcriptome data of various cells in the hematopoietic system to evaluate various highly variable gene screening methods commonly used today, and on this basis, to design and develop a new tool that can select the most biologically significant highly variable genes and eliminate noise as much as possible. Content: Different indicators of different gene selection methods were compared. An R package-SIEVE was developed, which can be used in combination with other methods when doing gene selection analysis, which can better remove noise and extract highly variable genes for subsequent analysis. The performance of SIEVE was evaluated. Methods: The transcriptome data was selected from the hematopoietic cell expression profile data published by our group before; the data was preprocessed using SeuratV4; the highly variable gene analysis method was selected M3Drop , Scmap, Scran, singleCellHaystack, Seurat and ROGUE; evaluated different methods from the purity, repeatability, and accuracy; tested the performance of SIEVE in terms of accuracy and the ratio of marker genes. Results: The performance of nine commonly used high-variable gene screening methods was compared. It is found that SinglecellHaystack is superior to other methods in terms of repeatability and accuracy. However, this method is more inclined to select genes with high expression levels. We also proposed a new strategy-SIEVE, through multiple rounds of random sampling, which minimized random noise and determined a reliable set of highly variable genes. In addition, SIEVE can retain information about genes with low expression levels, and for methods with less reproducibility, SIEVE can significantly improve the accuracy of single cell classification. ﹀
开放日期：	2021-06-09

附件下载