- 无标题文档
查看论文信息

论文题名(中文):

 生物医学本体支持的元数据异质性研究与标准化应用    

姓名:

 张璐璐    

论文语种:

 chi    

学位:

 硕士    

学位类型:

 学术学位    

学校:

 北京协和医学院    

院系:

 北京协和医学院基础医学研究所    

专业:

 生物医学工程(工)-生物医学工程    

指导教师姓名:

 杨啸林    

校内导师组成员姓名(逗号分隔):

 彭屹 王志刚 张正国    

论文完成日期:

 2019-05-01    

论文题名(外文):

 Study on Metadata Heterogeneity and its Standardized Application Supported by Biomedical Ontology    

关键词(中文):

 元数据管理 通用数据元素 本体 语义网 机器学习    

关键词(外文):

 metadata management common data element ontology semantic web machine learning    

论文文摘(中文):

摘要
      背景:数据已经成为生物医学发展的重要驱动力,实现数据到知识转化的一个关键环节在于增强数据的机器可理解性。通用数据元素(common data element,CDE)的使用是提高机器对元数据理解的重要手段。随着生物医学领域可共享的数据越来越多,纳入到通用数据元素库中的数据元素也在迅速增长,探讨如何提升通用数据元素的质量对于促进数据的整合和共享具有重要的意义。
      方法:一方面,本研究根据ISO/IEC 11179 标准建立了具有语义支持的CDE 表示模型,并在模型的基础上,构建了可共享的、可重用的和具有语义支持的通用数据元素库。在本部分研究中,首先根据《国民体质与健康数据库》初步确定了库中的数据条,通过复用caDSR 中的CDE 以及新建方式形成CDE 数据集;然后基于模型实现了CDE 的OWL 表示,并且借助于语义网工具实现了CDE 的质量检查;最后利用图数据库来存储文件,并提供SPARQL 复杂查询
功能。
       另一方面,本研究进行了生物医学领域元数据之间异质性的研究,建立了元数据之间可兼容性自动化的预测模型。在本部分研究中,首先从国际上使用广泛的公共数据库NCI caDSR 中,选取了与临床试验关联度较高的流行病调查的数据元素,根据构建的CDE 表示模型提取了数据元素的必要组分,在NCIT(National Cancer Institute Thesaurus)的支持下利用基于本体的语义
相似度计算方法计算出每两个关联数据元素对应必要组分之间的相似度值。最后,基于CDE 组分之间的相似度值,利用支持向量机(support vector machine,SVM)对相关数据元素之间的兼容性进行了预测。
       结果:本研究构建了数据元素通用表示模型。此模型以ISO/IEC 11179 元数据标准的核心组分为基础,规定了利用本体术语实现语义标准化的方式,定义了这些核心组分之间的关系,并为数据元素分配唯一标识符,以OWL 格式表示。利用此模型实现了《国民体质与健康数据库》中的数据元素的图数据库存储和检索。在caDSR 数据库元数据异质性研究中,结果显示元数据的概念层存在较大的异质性。即使在人工判别认为可以实现数据统一的数据元素间,在概念层的定义上也存在有明显异质性。通过SVM 实现了数据元素是否可以整合进行了判断,模型对于可直接整合、人工干预后可整合和不可整合三组判断的总体准确率为81.67%。
       结论:本研究建立了符合FAIR 准则的数据元素通用表示模型,并以此为基础围绕《国民体质与健康数据库》数据元素,建立了可参考的通用数据元素库,为解决数据异质性造成的数据整合和共享的问题提供了一个初步的可行方案。针对目前CDE 数据库中数据元素异质性严重,本研究构建了CDE 可兼容性的预测模型,为用户使用现有的CDE 提供了工具支持。通过本研究,将为提升元数据质量,进而提升数据质量提供技术和工具的支持。

论文文摘(外文):

Background:Data has become an important driving force of biomedical development. The key to realize data-to-knowledge transformation is to strengthen machine readability. The use of common data element (CDE) is an important means to improve the machine’s understandability for metadata. With the growth of shared data in the biomedical field, the number of data elements stored in open database is also increased rapidly. It is of great significance to study how to use the common data element effectively to promote data integration and sharing.

Methods:On one hand, we established a CDE representation model with semantic support by the help of ISO/IEC 11179 standard and constructed a reusable common data element database based on this model. In this part of the study, the data entries in the database are preliminarily determined according to the National physique and Health Database, and the CDE sets was constructed by reusing CDEs in caDSR and newly building. Then all CDEs were transformed into OWL  format. With the help of semantic network tool, the quality inspection of CDEs is realized. Finally, the graph database is used to store the files, and the complex query function of SPARQL is provided.

On the other hand, the heterogeneity of metadata in biomedical field was studied, and we established a prediction model of compatibility between two related metadata. First, we selected data elements in epidemic investigation from public database NCI caDSR, and extracted the essential components of the data elements according to CDE representation model. Second, we calculated the similarity between components of each two data elements with the support of NCIT(National Cancer Institute Thesaurus) using ontology-based semantic similarity calculation method. Finally, a prediction model of the compatibility between data elements was built by using of support vector machine (SVM) model based on the semantics similarity between CDE components.

Results:In this study, we first built the representation model of data elements, which based on the ISO/IEC 11179 metadata standard. The semantic standardization method using ontologies was specified in this model and relationships between essential components of CDE were also defined. An unique identifier is needed for every CDE and the OWL format was used to represent the final file. Data elements from National physique and Health Database was stored in graph database and can be retrieved based on the representation model. The results show that heterogeneity between common data elements are apparent in the definition of metadata in caDSR database, especially in the conceptual domain. So, a SVM model was built to predict the interoperability between CDEs. After parameter optimization, the total accuracy can up to 81.67% in three category classification.

Conclusions:In this study, a common representation model for data elements which meets the FAIR criterion is established. And we built a referenced common data element database using data items in National physique and Health Database based on the representation model, which provides a feasible solution to overcome the difficulties in data integration caused by data heterogeneity. In view of the serious heterogeneity of data elements in the current CDE database, a prediction model of CDE compatibility is constructed in this study, which provides technical support for users to use the existing CDEs. Our study will provide technical supports and tools for improving both metadata quality and data quality.

开放日期:

 2019-05-31    

无标题文档

   京ICP备10218182号-8   京公网安备 11010502037788号