论文题名(中文): | 基于大规模医学本体的术语标准化方法研究 |
姓名: | |
论文语种: | chi |
学位: | 博士 |
学位类型: | 学术学位 |
学校: | 北京协和医学院 |
院系: | |
专业: | |
指导教师姓名: | |
论文完成日期: | 2025-05-01 |
论文题名(外文): | Research on term normalization methods based on large-scale medical ontology |
关键词(中文): | |
关键词(外文): | Medical Concept Normalization Medical Entity Linking Medical Entity Disambiguation Pre-trained Language Model |
论文文摘(中文): |
医疗健康领域正经历着数字化转型,海量医学文本数据不断积累。这些数据中蕴含着丰富的医学知识,但要充分发挥其价值,实现医疗信息的互认、互通与共享,必须建立完善的医学术语标准体系。然而,中文医学术语标准体系的发展相对不足,中文医学术语的标准化问题亟待解决。为了积极响应国家建设中文医学术语标准体系的规划部署,助力中文医学术语标准体系建设,本论文致力于探索中文医学术语标准化的可行路径,开发面向大规模医学术语的标准化算法并进行实践应用。 本论文首先探索了中文医学术语标准化的实现路径。现有医学本体的比较结果表明,统一医学语言系统(Unified Medical Language System, UMLS)在概念数量和语义类型等方面具有显著优势,能够较为完整地覆盖中文医学领域中的常用概念。因此,本论文选择UMLS作为中文医学术语标准化的目标本体,构建了包括超过三万条术语和多种语义类型的标准化评测数据集。鉴于UMLS以英文为主,需要解决中文术语到英文概念的跨语言映射难题。我们设计了基于多源翻译和基于跨语言预训练模型映射的两种技术方案,系统评估现有医学术语标准化算法,包括基于字形的算法、基于语义表征的算法,以及字形语义融合的方法。结果表明,结合多源翻译技术,采用TF-IDF BoW模型和SAPBERT模型的分数线性组合方法表现最好,能够有效实现中文医学术语向UMLS的映射,表明该标准化路径具有可行性。 为了绕过翻译工具的限制,降低翻译过程带来的的信息损失,本论文进一步开发了翻译增强的跨语言实体链接模型TeaBERT。该模型通过对比学习方法,利用UMLS中的英文同义医学术语及其中文翻译对概念进行自对齐预训练,在不依赖外部翻译工具的情况下,实现了中文医学术语到英文概念的直接映射。实验结果显示,TeaBERT在CHPO、ICD-10和RealWorld-v2三个评测数据集上分别达到了92.54%、87.14%和84.77%的ACC@5准确率,在跨语言医学术语标准化任务上取得当前领域内的最优性能。为了提升模型的实用性,我们通过PCA白化方法优化了模型的内存占用和推理速度,使其更适合实际场景的部署应用。 进一步,本论文基于TeaBERT模型设计了人机交互的术语编码智能标注流程,开发了“TeaBERT术语标准化检索”标注平台,实践了中文医学术语标准化方法。我们以中文呼吸病学疾病表型术语作为测试数据,对纯人工标注和算法辅助标注进行对比,同时检验本论文提出的中文医学术语标准化方法的有效性。实验结果表明,基于UMLS的跨语言标准化方法可有效编码86.7%的测试术语。相比于纯人工标注,TeaBERT模型的加入将标注速度提高了3.2倍,标注一致性提升了26%。 为了解决医学术语标准化过程中面临的语义歧义问题,本论文发展了语境感知的术语-概念双编码器LinCBERT。该模型利用PMC语料库的上下文信息和UMLS的同义术语知识,实现了概念与术语及其上下文的交叉对齐。在NCBI-disease、BC5CDR-disease、BC5CDR-chemical和Medmentions四个标准化基准数据集上的实验中,LinCBERT的ACC@5评分分别为95.52%、94.01%、97.79%和78.13%,均取得最优表现。为了深入评估模型的语义消歧能力,我们构建了UMLS-PMC Homonyms标准化测试集。在该数据集上,LinCBERT的ACC@1准确率较其他基线模型提升超过50%,表明该模型具有出色的语义消歧能力。 综上,本论文提出了基于大规模医学本体的术语标准化可行方案和实现算法,解决了标准化过程中面临的跨语言映射和语义消歧问题,检验了本论文标准化算法在中文医学术语标准体系建设中的实用价值,助力中文医学术语标准体系建设。 |
论文文摘(外文): |
The healthcare domain is undergoing significant digital transformation, resulting in the rapid accumulation of biomedical text data. These data contain abundant medical knowledge; however, to fully harness their value and achieve interoperability, mutual recognition, and sharing of medical information, it is essential to establish a comprehensive medical standardized terminology system. Currently, the development of Chinese medical terminology system remains relatively insufficient, making the normalization issue of Chinese medical terms an urgent challenge. Following national policies advocating the construction of Chinese medical standardized terminology systems, this study aims to explore feasible approaches for Chinese medical concept normalization and to conduct innovative research in algorithms. This study first investigates feasible approaches for normalizing Chinese medical terms. Through an analysis of existing medical ontologies, we identified the UMLS as a comprehensive biomedical ontology capable of adequately covering common concepts used in Chinese medical research and clinical practice. Based on this insight, we selected UMLS as the target ontology for Chinese medical terms standardization and constructed evaluation datasets containing over 30,000 terms across multiple semantic types. To address the cross-lingual mapping challenge from Chinese terms to English concepts, we designed two approaches—one based on multi-source translation techniques and another leveraging cross-lingual pre-trained language models (PLMs). Subsequently, we evaluated existing medical standardization algorithms, including string-based methods, semantic-based methods, and hybrid methods integrating both string and semantic information. Experimental results indicated that with multi-source translation techniques, the linear combination of TF-IDF BoW and SAPBERT effectively achieved accurate mapping from Chinese medical terms to UMLS concepts, demonstrating the feasibility of this normalization approach. Building upon the exploration of normalization approaches, this study proposes a translation-enhanced cross-lingual mapping model, named TeaBERT. Employing contrastive learning, TeaBERT utilizes English synonyms from UMLS and their Chinese translations for self-alignment pre-training, enabling direct mapping from Chinese medical terms to English concepts without relying on external translation tools. Experimental evaluations on three benchmark datasets—CHPO, ICD-10, and RealWorld-v2—demonstrated that TeaBERT achieved state-of-the-art results, with ACC@5 scores of 92.54%, 87.14%, and 84.77%, respectively. To enhance the practical utility of the model, we further optimized its memory usage and inference efficiency through PCA whitening, making it more suitable for real-world deployment. Furthermore, to implement our proposed Chinese medical terms normalization approach, we designed an interactive human-computer annotation workflow for cross-lingual entity linking based on the TeaBERT model. A web-based platform named “TeaBERT Terminology Standardization Search” was also developed. Using Chinese respiratory disease phenotype terms as test data, we compared manual annotation alone with algorithm-assisted annotation and evaluated the effectiveness of the proposed Chinese medical terms normalization approach. The experimental results demonstrated that the UMLS-based cross-lingual normalization approach effectively encoded 86.7% of the tested Chinese medical terms. Integrating the TeaBERT model into the annotation workflow improved annotation speed by a factor of 3.2 and enhanced annotation consistency by 26%. Additionally, to address the semantic disambiguation challenge in medical entity linking algorithms, this research introduces a context-aware term-concept bi-encoder named LinCBERT. Leveraging contextual information from the PMC corpus and synonym knowledge from UMLS, LinCBERT achieves cross-alignment between concepts and terms with their contexts. Experiments conducted on four standard benchmark datasets—NCBI-disease, BC5CDR-disease, BC5CDR-chemical, and Medmentions—demonstrated that LinCBERT consistently achieved superior performance, with ACC@5 scores of 95.52%, 94.01%, 97.79%, and 78.13%, respectively. To further evaluate the semantic disambiguation capability of the models, we specifically constructed the UMLS-PMC Homonyms dataset. LinCBERT significantly outperformed baseline models on this dataset, with ACC@1 accuracy improvements exceeding 50%, validating its strong semantic disambiguation capability. In summary, this study addresses crucial issues in the Chinese medical term normalization. Firstly, it explores a feasible approach for normalizing Chinese medical terms; secondly, it validates the practical value of entity linking algorithm in building standardized Chinese medical terminology systems; finally, it achieves breakthroughs in the entity linking algorithm research, effectively resolving the semantic disambiguation challenge. |
开放日期: | 2025-06-11 |