- 无标题文档
查看论文信息

论文题名(中文):

 基于大语言模型的临床数据挖掘主题文献 知识单元识别方法研究    

姓名:

 段一凡    

论文语种:

 chi    

学位:

 硕士    

学位类型:

 学术学位    

学校:

 北京协和医学院    

院系:

 北京协和医学院医学信息研究所    

专业:

 图书情报与档案管理-医学信息学    

指导教师姓名:

 钱庆    

校内导师组成员姓名(逗号分隔):

 吴思竹    

论文完成日期:

 2025-04-11    

论文题名(外文):

 A Large Language Model–Based Approach for Knowledge Unit Recognition in Clinical Data Mining Literature    

关键词(中文):

 临床数据挖掘 大语言模型 信息抽取 知识单元识别 文献知识单元    

关键词(外文):

 Clinical Data Mining Large Language Models Information Extraction Knowledge Unit Identification Literature Knowledge Units    

论文文摘(中文):

随着医学人工智能技术的飞速发展与临床研究复杂性的不断增加,临床数据挖掘已逐渐成为推动医学研究创新与临床决策优化的重要支撑手段。国际顶级期刊如Nature、Lancet、Cell等大量发表了与临床数据挖掘相关的研究成果,这些论文蕴含着丰富的临床数据挖掘知识和实践经验。高效地从文献中快速准确地识别和提炼临床数据挖掘研究的核心知识单元,不仅能促进创新医学经验的有效传播,更能加速临床研究成果向实际应用的转化,具有显著的学术价值与现实意义。

然而,目前临床数据挖掘主题文献的数量快速增长,并且与其他领域相比存在很多不同类型的知识内容,科研人员高效阅读和理解文献的细粒度的知识存在一定难度。同时现有文献知识单元识别方法主要聚焦于科技论文特定位置的知识单元识别或通用型文献知识单元识别,不论在识别粒度和方法上均存在局限性,缺乏精准的面向临床数据挖掘主题文献知识单元自动识别方法。因此,迫切需要针对临床数据挖掘领域提出一种精细化的细粒度知识单元自动识别方法,以满足临床研究的实际需求。

针对上述问题,本研究提出了一种基于大语言模型COT提示工程-QLoRa监督微调-数据增强的临床数据挖掘主题文献面向全文的知识单元识别方法。具体包括以下四个方面的主要工作:

第一,基于对临床数据挖掘任务构建流程和主题文献的内容组织相关国际权威指南的系统分析和归纳,构建了一套适用于本研究场景的细粒度知识单元组织体系。该体系全面涵盖了研究目标制定、数据采集、数据预处理、特征工程、模型训练、模型评估与验证的临床数据挖掘完整生命周期,并进一步抽象出人群、数据、方法、工具4个一级知识单元及14个二级知识单元,界定了临床数据挖掘主题文献中的核心知识单元组织体系。

第二,针对本研究场景现有标注语料稀缺的低资源状况,研究基于大语言模型设计了一种融合提示工程、监督微调和数据增强技术的模型优化方法,以提高大语言模型对细粒度知识单元的识别性能。具体采用了In-Context-Learning(ICL、上下文学习)、Chain-of-Thought(COT、思维链)、SELF-CONSISTENCY-COT(COT-SC、基于自验证的思维链)三种提示策略的对比实验优化基座模型的推理能力,接着在推理效果最佳的COT提示工程策略的基础上,研究对比了LoRA、QLoRA和Freeze三种高效参数微调方法的监督微调后的模型识别效果进一步提高模型的任务适应能力,并确定了微调效果最佳的QLoRa算法。最后基于性能最优的COT提示工程策略和QLoRa监督微调策略,利用ChatGPT-3.5模型生成语义丰富的训练样本扩充数据规模实现了数据增强方法。

第三,研究主要基于缺血性心脏病相关领域的临床数据挖掘主题文献进行方法实现和验证,使用Llama3-8B、Qwen2-7B、Baichuan2-7B、DeepSeek-R1-8B和Granite3.0-8B五个大语言模型进行对比测试,主要采用ROUGE-1、ROUGE-2、ROUGE-L指标对模型的识别效果进行评估。最终研究提出了一套基于COT提示工程策略-QLoRa监督微调算法-数据增强的知识单元识别方法。

第四,为进一步验证方法的鲁棒性、可应用性和可理解性,本研究在所提出识别方法表现最优的Granite3.0-8B模型设计了4个维度的模型效果评估实验,分别为利用DeepSeek-R1进行测试集的基准测试、慢性阻塞性肺疾病主题文献的模型跨病种文献识别测试、单篇文献全文识别精度测试实验和模型识别结果可理解性评估。实验结果表明,本文提出的识别方法在细粒度知识单元识别任务中取得了良好的识别性能和泛化能力,能够有效提升研究人员对临床数据挖掘主题文献的快速理解和应用能力。

本研究的创新点在于,区别于传统文献知识单元识别在识别范围上仅聚焦于摘要等片段或在识别粒度上仅聚焦于个别知识单元类型,提出了一个面向临床数据挖掘主题文献全文的细粒度知识单元识别方法,同时本研究构建了面向临床数据挖掘主题文献的细粒度知识单元组织体系,提出了基于临床数据挖掘主题文献细粒度知识单元数据集,为后续开展相关主题的研究提供了数据支撑和理论支撑。本研究对提示工程、数据增强和监督微调等技术在大语言模型信息抽取任务中的适用性展开了深入探索,并提出了基于大语言模型COT提示工程策略-QLoRa监督微调算法-数据增强的临床数据挖掘文献知识单元自动识别方法,验证了大语言模型应用于实体识别和信息抽取领域的有效性,为大语言模型的实际应用提供了方法论参考和实证支持。

论文文摘(外文):

With the rapid advancement of artificial intelligence technologies in medicine and the growing complexity of clinical research, clinical data mining has emerged as a pivotal approach for driving medical innovation and optimizing clinical decision-making. Leading journals, including Nature, Lancet, and Cell, have extensively published significant findings in the field of clinical data mining, offering abundant insights and practical experiences. Efficiently extracting and summarizing core knowledge units from these publications not only facilitates the dissemination of innovative medical practices but also accelerates the translation of research findings into clinical applications, underscoring its substantial academic value and practical relevance.

However, the exponential growth of literature in clinical data mining has introduced considerable challenges due to the complexity and diversity of detailed knowledge content specific to this domain, complicating researchers' efforts to efficiently comprehend and utilize such fine-grained information. Current methods for knowledge unit extraction primarily focus on identifying content from specific sections of scientific literature or employ generalized extraction approaches, both of which exhibit limitations in terms of precision and granularity. Thus, there is an urgent need for sophisticated, domain-specific, fine-grained automated identification methods tailored explicitly for clinical data mining literature.

To address these challenges, this study introduced a novel approach to fine-grained knowledge unit extraction from clinical data mining literature utilizing large language models (LLMs). The primary contributions of this research encompass four distinct aspects:

First, through systematic analysis and synthesis of internationally recognized guidelines related to clinical data mining task construction and literature content organization, this study formulated a fine-grained knowledge unit organizational framework tailored to clinical data mining contexts. This framework comprehensively covered the complete lifecycle of clinical data mining, including research goal formulation, data collection, data preprocessing, feature engineering, model training, and model evaluation and validation. It further abstracted four primary categories—population, data, methods, and tools—and fourteen secondary categories, effectively structuring the core knowledge units prevalent in clinical data mining literature.

Second, addressing the challenge of limited annotated corpora, the study designed a model optimization workflow integrating prompt engineering, supervised fine-tuning, and data augmentation techniques to enhance LLM performance in fine-grained knowledge unit identification. Specifically, comparative experiments using mainstream prompt engineering strategies—including In-Context Learning (ICL), Chain-of-Thought (COT), and SELF-CONSISTENCY-COT (COT-SC)—are conducted to optimize the reasoning capabilities of base models. Subsequently, based on the most effective prompting strategy, supervised fine-tuning methods such as LoRA, QLoRA, and Freeze were comparatively analyzed to further improve task-specific adaptability. Finally, data augmentation leveraging ChatGPT-3.5 generated semantically rich synthetic training samples to expand dataset scale and enhance model robustness.

Third, the proposed method was empirically validated in the context of ischemic heart disease-related clinical data mining literature, utilizing five prominent LLMs—Llama3-8B, Qwen2-7B, Baichuan2-7B, DeepSeek-R1-8B, and Granite3.0-8B. Evaluation was performed using ROUGE-1, ROUGE-2, and ROUGE-L metrics, culminating in the development of an optimized extraction method combining the COT prompting strategy, QLoRa supervised fine-tuning, and data augmentation.

Fourth, to robustly assess model performance, applicability, and interpretability, four-dimensional evaluation experiments were conducted using the Granite3.0-8B model, identified as optimal. These experiments included benchmark testing against DeepSeek-R1, cross-disease applicability assessment on literature from chronic obstructive pulmonary disease (COPD), accuracy validation for single-document extraction, and interpretability assessment of extraction outcomes. Results demonstrated the method's strong performance and generalizability in fine-grained knowledge unit identification tasks, substantially enhancing researchers' ability to rapidly comprehend and utilize clinical data mining literature.

This study innovatively departed from traditional approaches limited to extracting knowledge units from abstract sections or general research outlines, instead proposing a comprehensive, full-text-based fine-grained knowledge extraction method specific to clinical data mining literature. Additionally, it established a specialized fine-grained knowledge unit organizational framework and corresponding dataset, providing critical theoretical and data-driven support for future research. Moreover, the extensive exploration and validation of prompt engineering, data augmentation, and supervised fine-tuning underscored the applicability of the COT prompting strategy, QLoRa supervised fine-tuning algorithm, and data augmentation within LLM-based information extraction tasks. These findings confirmed the efficacy of employing large language models in entity recognition and information extraction, offering a robust methodological reference and empirical validation for practical LLM applications.

开放日期:

 2025-06-12    

无标题文档

   京ICP备10218182号-8   京公网安备 11010502037788号