- 无标题文档
查看论文信息

论文题名(中文):

 基于公共数据库和PubMed的单基因遗传性疾病基因变异数据库的建立方法和应用    

姓名:

 曹宗富    

论文语种:

 chi    

学位:

 博士    

学位类型:

 学术学位    

学校:

 北京协和医学院    

院系:

 国家人口计生委科学技术研究所    

专业:

 公共卫生与预防医学-流行病与卫生统计学    

指导教师姓名:

 马旭    

校内导师组成员姓名(逗号分隔):

 程怡民    

论文完成日期:

 2016-05-28    

论文题名(外文):

 The Construction Method and Application of the Phenotype-Gene-Variant Databases for Monogenic Disorders Based on the Public Databases and PubMed    

关键词(中文):

 公共数据库 表型基因变异数据库 遗传检测 文本挖掘 精准医学    

关键词(外文):

 public databases phenotype-gene-variant database genetic testing text mining precision medicine    

论文文摘(中文):

研究背景:医学正迈入精准医学时代,而遗传性疾病尤其是单基因遗传病是精准医学的重大方向之一。精准医学需要对个体某种或多种疾病相关的单个或多个基因的靶向区域进行遗传检测,遗传检测需要开发相应的检测产品。建立表型-基因-变异关系数据库在遗传检测中是非常必要的。依靠遗传学家以人工的方式从大量的研究报告和文献中获取这些信息非常耗时,而且极易出错。如何快速而准确地获取这些信息则成为一个亟待解决的瓶颈问题。该研究的目的是,开发一套自动化或半自动化的流程或工具,基于HPO、Orphanet、OMIM、ClinVar、Uniprot等公共数据库和PubMed文献知识库来获取特定表型相关的基因和变异,并进行初步整合,为建立表型-基因-变异数据库提供便利。

研究方法:通过数据库调查,对表型-基因-变异关系相关的公共数据库内容和质量进行评价,理清各个公共数据库之间的关系,并初步确定可供使用的公共数据库;根据确定的公共数据库,摸清其内部数据库结构、数据来源、数据类型和格式、以及与其它数据库之间存在的关联;然后设计信息抓取的框架和信息流,并基于R语言实现信息抓取,利用国际相关标准对表型-基因-变异关系进行信息提取和整合;以文本挖掘的方法,从PubMed文献知识库中提取遗传病相关的基因和变异信息;开发和编译R VarfromPDB包,并建立信息提取和整合的自动化或半自动流程;以遗传性非综合征耳聋为例,对VarfromPDB包进行测试和评估;以遗传性先天性白内障为例,探索R VarfromPDB在精准医学遗传检测中的应用。

研究结果:基于R语言开发和发布了VarfromPDB软件包,分别编写了多个函数,实现了从HPO、Orphanet、OMIM、ClinVar和Uniprot等公共数据库和PubMed文献知识库中提取遗传病相关的基因和变异信息,并基于R VarfromPDB软件包建立了一套自动化信息提取流程,实现了对提取的基因变异信息的初步整合。利用该研究的自动化流程方法提取了耳聋相关的基因和变异,并经过人工检查筛选出非综合征耳聋相关的基因和变异,与国外已有的遗传性耳聋数据库比较发现,这些基因不仅包括了国外遗传性耳聋数据库中的92个非综合征耳聋相关基因,而且还包含了额外的37个非综合征耳聋相关基因,提示利用该研究的方法来获得的单基因遗传性疾病的相关基因是可行的。更进一步,还利用该研究的方法,建立了先天性白内障基因变异数据库,并在此基础上确定了高频突变的18个基因上的26个外显子作为突变热点区域;在42个先天性白内障先证者及其他家系成员共267个体中,以直接测序法对26个外显子进行筛查,在59.52%的先证者中发现了可能致病的突变,其中包括10个已知突变和15个新发突变,在30个常染色体显性遗传家系中的检出率为76.67%。这15个新发突变包括:HSF4基因上的5个杂合性突变(c.314G>C, p.S105T; c.218G>T, p.R73L; c.233A>G, p.Y78C; IVS5 c.233-1G>A; c.187T>C, p.F63L),GJA8基因上的3个杂合性突变(c.569A>G, p.N190S; c.914G>A, p.S305N; c.9C>G, p.D3E), CRYBB2基因上的2个杂合性突变(c.463C>A, p.Q155K; c.355G>A, p.G119R),CRYAA基因上的突变 c.35G>T (p.R12L),MIP基因上的突变c.349G>C (p.A117P),CRYGC基因上的突变IVS1 c.10-1G>A,CRYGD 基因上的突变c.346delT (p.F116Sfsx29),EPHA2基因上的突变c.2870G>C (p.R957P)。

R VarfromPDB:https://cran.r-project.org/web/packages/VarfromPDB/index.html

结论:R VarfromPDB可实现以自动化或半自动化方法,从公共数据库和PubMed文献知识库中,快速获取遗传性疾病相关的基因和变异,并在先天性白内障先证者中进行探索性应用,提示该方案是可行的和高效的。其意义是通过对表型-基因-变异数据库的构建,为遗传检测产品开发和结果解读提供参考,促使已有研究证据向临床应用的转化,助力精准医学的发展。

创新性:以R语言建立了一套自动化或半自动化流程,从公共数据库和文献知识库中自动抓取遗传性疾病相关的基因和变异,以建立单基因遗传病表型基因变异数据库。

论文文摘(外文):

Background: Medicine is stepping into the era of precision medicine now. Genetic disease especially monogenic disorders are one of the main fields in precision medicine. Targeted sequencing and genotyping is one of the feasible methods to balance the requirement and cost, and many genetic testing products based on the targeted sequencing and genotyping will be developed in the next few years. The gene-variant-phenotype database for a special genetic disease or phenotype needs to be compiled in the process of product development or genetic research. However, the information of relationships is erupting with the rapid development of genomic technology every day. It is time-consuming and error-prone to capture the information from the literature manually. How to capture the information rapidly and accurately is a bottleneck to be addressed. The objective here is to develop an automated or semi-automated tool to extract the genes and variants related to a special genetic disease from the public databases and PubMed abstracts, and intergrate the information to construct the phenotype-gene-variant database.

Methods: In the first, the content and quality of the public databases were evaluated, and the relationships between the public databases clarified, and the available public databases preliminarily selected by the investigation of the public databases. According to the selected public databases, the internal database structure, data source, data types and formats are found out. Furtherly the information extracting framework and data flow were designed and performed with R language. The information was extracted from the different databases one by one and intergrated finally. Text mining was employed to capture the information from PubMed abstacts. The genes were extracted based on a dictionary-based method from HUGO Gene Nomenclature Committee database, and the variants were captured using the regular expressions by the nomenclature recommendations of HGVS (Human Genome Variation Society). The tool was tested and evaluated by comparing with Hereditary Hearing loss Homepage (DeafnessDB). Finally, the congenital cataract causing genes and vatiants were compiled by the tool, and candidate exons were directly sequenced to screen in the 42 Chinese probands and related individuals with congenital cataract.

Results: Here, we developed a new R package VarfromPDB to capture the genes and variants related to a genetic disorder from multiple public databases including HPO (Human Phenotype Ontology), Orphanet, OMIM (Online Mendelian Inheritance in Man), ClinVar, UniProt (Universal Protein Resource) . The gene-variant-phenotype relationships can be integrated easily by phenotypes in HPO, approved symbols in HGNC (HUGO Gene Nomenclature Committee) and variant nomenclature recommendations by HGVS (Human Genome Variation Society). What’s more, more information can be captured by text mining from PubMed abstracts. VarfromPDB can capture all the 92 nonsyndromic genes in DeafnessDB and 37 additional genes. Furtherly, the congenital cataracts database was compiled, and determined the 26 high frequent mutated exons in 18 genes to perform genetic testing. The possible pathogenic mutations were identified in 59.52% of probands with congenital cataract, which include 10 known mutations and 15 novel mutations. The detected rate is 76.67% in the autosomal dominant inheritance families. These novel changes include five heterozygous mutations (c.314G>C, p.S105T; c.218G>T, p.R73L; c.233A>G, p.Y78C; IVS5 c.233-1G>A; c.187T>C, p.F63L) in HSF4, three heterozygous mutations (c.569A>G, p.N190S; c.914G>A, p.S305N; c.9C>G, p.D3E) in GJA8, two heterozygous mutations (c.463C>A, p.Q155K; c.355G>A, p.G119R) in CRYBB2, c.35G>T (p.R12L) in CRYAA, c.349G>C (p.A117P) in MIP, IVS1 c.10-1G>A in CRYGC, c.346delT (p.F116Sfsx29) in CRYGD, and c.2870G>C (p.R957P) in EPHA2.

VarfromPDB can be reached at https://cran.r-project.org/web/packages/VarfromPDB/index.html.

Conclusions: R VarfromPDB can be used to extract the genes and variants related to a genetic disease rapidly with an automation or semi-automation mode, and then was tried to explore the application on congenital cataracts. The solution is feasible and very efficient. It is helpful to compile the phenotype-gene-variant database, and provide the information for the development and interpretion of a genetic testing product, and accerate the development of the precision medicine.

开放日期:

 2016-05-28    

无标题文档

   京ICP备10218182号-8   京公网安备 11010502037788号