论文题名(中文): | 融合医学知识组织体系的主题模型优化研究 |
姓名: | |
论文语种: | chi |
学位: | 硕士 |
学位类型: | 学术学位 |
学校: | 北京协和医学院 |
院系: | |
专业: | |
指导教师姓名: | |
校内导师组成员姓名(逗号分隔): | |
论文完成日期: | 2024-05-03 |
论文题名(外文): | Research on Topic Model Optimization by Integrating Medical Knowledge Organization System |
关键词(中文): | |
关键词(外文): | Topic Model Knowledge Organization System Topic hierarchy Natural Language Processing |
论文文摘(中文): |
主题模型是图书情报领域实现对学科领域主题自动识别的主要方法。然而,随着各领域研究的不断深入、快速发展和交叉融合,现有的主题模型难以满足大数据背景下科技情报智能化分析的需求,因此如何利用科技文献高效准确深入地识别领域研究主题并立体展现领域知识结构,具有重要现实意义。 本研究主要解决当前基于科技文献主题识别方法主题可解释性差和主题辨识度低的问题,针对当前主题识别模型只能获得领域缺乏语义关联且扁平主题分布的缺陷,考虑在主题模型中融入医学知识组织体系的语义和层级结构信息对模型实施优化,通过对主题的分层分类,实现对主题下混杂词语语义关联的梳理和层级结构的建立,从而提升主题的可解释性和辨识度。 对于上下文嵌入聚类主题模型,首先进行融入语义信息的主题识别,模型利用MeSH词表的入口词实现对同义异形词归并。其次进行融合语义层次结构的主题分层分类,对于MeSH词表的包含词,利用词表中主、副主题词所属类别和对应概念的语义依赖关系以及层级结构对主题分层分类;对于未登录词,在余弦相似度的基础上引入相对距离,在球面空间重新聚类,对主题自动分层分类;最后依照MeSH词表中主副主题词组配的语义信息和文档对主题的共现约束对主题合并。 对于神经主题模型,首先进行融入语义信息的主题识别,模型利用MeSH词表的入口词实现对词语所属类别的准确划分。其次进行融合语义层次结构的主题分层分类,将MeSH词表的层级结构与SawETM模型的层级结构结合构建一个清晰完善的语义层次结构;利用新层级结构中的语义概念和上下位关系指导主题的层次化建模。 选择肿瘤免疫治疗领域作为实证研究对象,从定量和定性两个角度评估模型优化的效果。结果表明,上下文嵌入聚类模型经优化后,基于PubMedBERT+SK框架的优化模型综合性能最优,主题连贯性和主题多样性的多个指标较未融入医学知识组织体系前有所提升,其中Cv、PUW和WE-IRBO指标提升较为显著,分别提升了0.118、0.253和0.099;模型较已有的三种分层主题模型性能也有所提升,其中表示主题连贯性的指标提升较为显著,UMass、NPMI和Cv分别较三种模型中的最优指标提升了1.517、0.013和0.109;模型能够较为清晰地识别出该领域的知识结构,识别出的主题与未融入医学知识组织体系的模型相比,能够得到主题语义信息更为多样、主题语义关联更加紧密、主题层级结构更加清晰独立的主题;模型适用于跨学科研究人员对领域整体进展和知识结构的快速把握。 神经主题模型经优化后,主题连贯性和主题多样性的多个指标较未融入医学知识组织体系前有所提升,其中UMass、Cv、PUW和WE-PD指标提升较为显著,分别提升了0.481、0.111、0.203和0.12;较融入WordNet通用词表也有所提升,其中UMass和PUW指标提升较为显著,分别提升了0.481和0.263;较已有的三种分层主题模型性能也有所提升,其中表示主题连贯性的指标提升较为显著,NPMI和Cv指标分别较三种模型中的最优指标提升了0.161和0.161;模型能够较为清晰地识别出该领域的知识结构,识别出的主题与未融入医学知识组织体系的模型相比,能够实现对原有模型识别主题下混杂词语的清晰组织,从而得到更细粒度、具有重叠隶属关系且更符合领域认知的主题;模型适用于帮助科研人员识别多学科交叉的复杂领域主题,有助于厘清学科复杂的知识结构和内在关联。 |
论文文摘(外文): |
Topic model is the main method to realize automatic identification of subject areas in the field of library intelligence. However, with the continuous deepening, rapid development and cross-fertilization of research in various fields, the existing theme model is difficult to meet the demand for intelligent analysis of scientific and technological intelligence in the context of big data, so how to use scientific and technological literature to efficiently, accurately and in-depth identification of research themes in the field and three-dimensional presentation of the knowledge structure of the field is of great practical significance. This study mainly solved the problems of poor topic interpretability and low topic recognition based on scientific and technological literature topic recognition methods. Aiming at the defects of the current topic recognition model, which can only obtain domains that lack semantic associations and have a flat topic distribution, we considered optimizing the model by integrating the semantic and hierarchical structure information of the medical knowledge organization system into the topic model, so as to achieve the optimization of the topic through the hierarchical classification of the topic and the establishment of semantic associations and hierarchical structure for the mixed words under the topic, thus improving the interpretability and recognition of topics. Through the hierarchical classification of topics, the semantic association of topics could be sorted out and the hierarchical structure could be established, so as to improve the interpretability and recognition of topics. For context-embedded clustering topic modeling. Firstly, theme identification incorporating semantic information was carried out, and the model utilized the entry terms of the MeSH to realize the subsumption of synonyms. Secondly, the topic hierarchical classification incorporating semantic hierarchical structure was carried out. For the words in the MeSH, the semantic dependency relationship between the categories to which the main and subtopic words belong and the corresponding concepts as well as the hierarchical structure were utilized to classify the topics hierarchically; for the unregistered words, the relative distances were introduced on the basis of cosine similarity to regroup the topics in the sphere space to automatically classify the topics hierarchically; and finally the semantic information and semantic information of subtopic words grouped according to the main and subtopic words of MeSH were used to identify the topics and to classify them automatically. Finally, the topics were merged according to the semantic information of the main-sub topic word pairing in the MeSH and the co-occurrence constraints of the documents on the topics. For the neural topic model, the first step was to recognize the topic by incorporating semantic information, and the model utilized the entry terms of the MeSH to realize the accurate classification of the category to which the words belong. Secondly, a hierarchical classification of topics incorporating semantic hierarchies was carried out, combining the hierarchical structure of the MeSH and the hierarchical structure of the SawETM model to construct a clear and perfect semantic hierarchy; the semantic concepts and contextual relationships in the new hierarchical structure were used to guide the hierarchical modeling of topics. The field of tumor immunotherapy was chosen as an empirical study to assess the effect of model optimization from both quantitative and qualitative perspectives. The results showed that the context-embedded clustering model was optimized, in which the model optimized based on the PubMedBERT+SK framework had the best comprehensive performance, and the multiple related indexes of topic coherence and topic diversity were improved compared with those before they were not integrated into the medical knowledge organization system, among which the indexes of Cv, PUW, and WE-IRBO were improved more significantly, and the model could identify five major research directions and their subdivided directions more clearly. The Cv, PUW and WE-IRBO indicators were improved significantly, by 0.118, 0.253 and 0.099 respectively. The performance of the model was also improved compared with the three existing hierarchical topic models, in which the improvement of the indexes indicating topic coherence was more significant, with UMass, NPMI and Cv improved by 1.517, 0.013 and 0.109, respectively, compared with the optimal indexes in the three models. The model was able to identify the domain knowledge structure more clearly, and the identified themes were able to obtain themes with more diversified theme-semantic information, closer theme-semantic associations, and a clearer and more independent theme hierarchical structure compared with that before integration into the Medical Knowledge Organization System. The model was suitable for interdisciplinary researchers to quickly grasp the overall progress and knowledge structure of the field. After the optimization of the neural topic model, several related indicators of topic coherence and topic diversity were improved compared with those before the integration into the medical knowledge organization system, among which the indicators of UMass, Cv PUW, and WE-PD were more significant, with an increase of 0.481, 0.111, 0.203, and 0.12, respectively; and the indicators of UMass and PUW were improved compared with the model integrating the WordNet general word list. UMass and PUW indicators were more significant, respectively, 0.481 and 0.263. The performance of the model was also improved compared with the three existing hierarchical topic models, among which the metrics indicating topic coherence were more significant, NPMI and Cv metrics were improved by 0.161 and 0.161, respectively, compared with the optimal metrics in the three models. The model identified the domain knowledge structure more clearly, and the identified themes, compared with the model before integrating into the medical knowledge organization system, were able to achieve the original model to identify the themes, the clear organization of mixed words under the theme, the clear organization of semantic relations between words, and the clear organization of semantic relations between words, and the clear organization of semantic relations between words and the clear organization of semantic relations between words. Compared with the model before it was integrated into the medical knowledge organization system, the identified topics realized clear organization of mixed words under the topic, sorting of semantic relationship between words and three-dimensional presentation of flat structure of topics, so as to get more fine-grained topics with overlapping affiliation and more in line with the domain cognition. It was suitable for helping researchers to identify the topics of multidisciplinary cross-complex domains, and it helped to clear up the complex knowledge structure and intrinsic correlation of the discipline. |
开放日期: | 2024-06-06 |