论文题名(中文): | 基于深度学习的病原菌毒力因子预测方法研究 |
姓名: | |
论文语种: | chi |
学位: | 博士 |
学位类型: | 学术学位 |
学校: | 北京协和医学院 |
院系: | |
专业: | |
指导教师姓名: | |
校内导师组成员姓名(逗号分隔): | |
论文完成日期: | 2020-05-01 |
论文题名(外文): | A deep learning model for the prediction of bacterial virulence factors |
关键词(中文): | |
关键词(外文): | Bacterial Infectious Disease Deep Learning Convolutional Neural Network Virulence Factors |
论文文摘(中文): |
传染病仍然是公共卫生安全的严重威胁,也是世界范围内引起人类死亡的主要 原因之一。近年来,尽管病毒性传染病危害巨大,但细菌性传染病同样不容忽视。 病原细菌的致病机制研究始终是微生物学研究的重点之一。病原菌的致病性是由它 所编码的毒力因子决定的,研究毒力因子是阐明病原菌致病机制的关键。同时,毒 力因子与病原菌的变异和进化密切相关,新发和再发的细菌性传染病通常是由毒力 因子的变异或水平转移引起的。因此,研究毒力因子也是细菌性传染病预防和控制 的基础。 近年来随着测序技术的进步和微生物基因组学的发展,大量的细菌基因组被测 定。如何从海量的细菌基因组数据中快速准确的识别和预测潜在的毒力因子,成为 病原生物学相关的生物信息学研究的重要内容。目前主要的方法是基于相似性比对 发现序列相对较为保守的毒力因子。同时,基于传统的机器学习算法,还可以针对 特定的毒力因子类别进行预测。然而,目前还没有针对各个类别的毒力因子进行非 序列相似性依赖的预测方法。并且,传统机器学习算法需要依赖先验知识进行手工 提取特征,存在一定的局限性。而作为机器学习的一个新兴分支,深度学习算法能 够从原始数据中自主学习关键特征,进而实现准确预测,近年来已在生物医学领域 的多个方面成功应用。 因此,本课题依托实验室长期维护的国际上最大的病原菌毒力因子数据库 (VFDB),针对当前毒力因子识别和预测方法中存在的序列相似性依赖和需要人工 提取特征等关键科学问题,以新兴的深度学习算法为基础,探索开发基于深度学习 的病原菌毒力因子识别和预测新方法。 首先,考虑到数据量对于机器学习模型训练的影响,我们不仅收集了来自 VFDB 的 24,739 条毒力因子序列数据,而且利用 NCBI 的完整基因组序列数据进行了扩 充,整理产生出包括 3,446 类 160,495 条毒力因子序列的综合性毒力因子数据集。 然后,我们基于深度学习算法构建了专门用于毒力因子序列分类的卷积神经网络模 型 VFNet。并通过结果可视化和性能指标比较分别验证了 VFNet 结构的合理性和数 2
据集扩充的必要性。随后,我们将 VFNet 与另外两种新近发表的深度学习模型以及 四种传统的机器学习算法进行了综合性能比较。在样本充足的数据集上(各类毒力 因子序列数>10),与另外两种深度学习模型相比,VFNet 不仅准确率最高而且速度 最快;而与传统的机器学习算法相比,通过综合自主学习的特征和手工提取的特征, VFNet 也取得了最佳的分类效果,准确率达到 0.9831,F1 值达到 0.9803。另一方 面,在样本不足的数据集上(各类毒力因子序列数≤10),通过有效的使用迁移学习 技术并结合手工提取的特征,VFNet 依然取得了最好的效果,比四种传统的机器学 习算法在准确率和精度等各项性能指标方面提高 1%-13%,F1 值提高 1%-16%。最 后,通过与已知蛋白家族结构域的比较分析,我们验证了 VFNet 模型能够学习到关 键的序列结构域特征,从而为其在毒力因子数据上的准确分类效果提供了一定程度 的生物学解释。此外,我们还探索分析了高相似度序列、序列的基因组来源信息对 VFNet 模型分类性能的影响,进一步验证了 VFNet 模型具有很好的泛化能力。 总之,一方面,本课题构建了包含 16 万多条序列、目前最大规模的病原菌毒力 因子数据集,为开展毒力因子预测相关研究提供了良好的数据基础。另一方面,本 课题首次将深度学习技术应用于各个类别的病原菌毒力因子分类研究,成功创建了 卷积神经网络模型 VFNet,并证实了该模型在毒力因子分类方面相对于其他机器学 习方法的显著优势。本课题的研究成果,为后续进一步开展非序列相似性依赖的各 个类别病原菌毒力因子识别和预测实用软件的开发奠定了基础。 |
论文文摘(外文): |
Bacterial infectious diseases pose a significant threat to public health worldwide. Despite the recent advances in the prevention, diagnosis and treatment of bacterial infection, deciphering the molecular basis of pathogenic bacteria remains as one of the interesting focuses of current microbiology. The pathogenicity of pathogenic bacteria depends on the virulence factors encoded in their genomes. Virulence factors (VFs) refer to the genetic elements that enable the microbes to establish infection and cause diseases in the hosts. As emerging bacterial infectious diseases are usually caused by variant clones that acquired additional VFs via horizontal gene transfer, a better understanding of bacterial VFs is also critical for more effective prevention and control of bacterial infectious diseases. With the recent development of next-generation sequencing technology and microbial genomics, a great number of bacterial genomes have been determined. How to identify and predict potential VFs efficiently and accurately from a great number of bacterial genomes becomes a challenging task of bioinformatics. Sequence similarity based alignment is the most popular approach for the detection of potential VFs from closely related sequences. Traditional machine learning based methods are also used to predict some categories of bacterial VFs, such as effectors of secretion systems. However, there is no sequence alignment independent method for the identification and prediction of various bacterial VFs available so far. Traditional machine learning methods heavily rely on prior knowledge to extract predefined features for initial model training, whereas the deep learning method, a new branch of machine learning, can learn expressive features from the raw data automatically. Indeed, deep learning methods have been successfully applied in many aspects of biomedical field in recent years. In this study, we firstly extracted the bacterial VF dataset from the virulence factors database (VFDB), which covers 24,739 bacterial VF-related genes from 32 genera of bacterial pathogens. In order to collect more training data, we further expanded the VFDB 4
dataset with all complete genomes of the 32 bacterial genera available from NCBI to build a comprehensive dataset that consists of 160,495 sequences from 3,446 VF categories. Then, we constructed a convolutional neural network model named VFNet to successfully classify bacterial VFs, and verified the rationality of the model structure and the necessity of our data expansion. Finally, we compared our VFNet model with two newly published deep learning models and four traditional machine learning algorithms. On the dataset with sufficient samples (the number of samples in each class is more than 10), VFNet acquired the highest accuracy and the fastest model training speed as compared with the other two deep learning models. Further, by combining predefined features, VFNet achieved the highest accuracy of 0.9831 and F1-score of 0.9803 when compared with the traditional machine learning algorithms. On the dataset with insufficient samples (the number of samples in each class is no more than 10), VFNet also achieved the best classification performance by using the transfer learning technology, which achieved an accuracy 1%13% and an F1-score 1%-16% enhancements over the best traditional machine learning algorithms. In addition, we proved that VFNet has the ability to recognize the conserved protein domains, which provided a certain degree of biological interpretation for the accurate classification of VFNet. We also explored the impact of high sequence similarity and genome origin of sequences on the classification performance of VFNet to further verify the good generalization ability of our model. In summary, we constructed the largest bacterial VF dataset including more than 160,000 sequences, which provides a good resource for future research on VF prediction. Furthermore, as the first attempt to apply deep learning algorithms to classify all categories of bacterial VFs, our convolutional neural network model (VFNet) showed significant advantages compared with other machine learning methods. Our results present here form a solid basis for further development of sequence alignment free applications for successful identification and prediction of various bacterial VFs. |
开放日期: | 2020-06-09 |