生物技术通报 ›› 2025, Vol. 41 ›› Issue (9): 345-356.doi: 10.13560/j.cnki.biotech.bull.1985.2025-0082

• 研究报告 • 上一篇    

基于计算文献的大豆耦合性状知识发现研究

关陟昊1(), 单治易2,3, 熊赫1, 赵瑞雪1,4()   

  1. 1.中国农业科学院农业信息研究所,北京 100081
    2.中国科学院文献情报中心,北京 100190
    3.中国科学院大学经济与管理学院,北京 100190
    4.国家新闻出版署农业融合出版知识挖掘与知识服务重点实验室,北京 100081
  • 收稿日期:2025-01-19 出版日期:2025-09-26 发布日期:2025-09-24
  • 通讯作者: 赵瑞雪,女,博士,研究员,博士生导师,研究方向 :农业信息技术应用、知识组织与知识服务;E-mail: zhaoruixue@caas.cn
  • 作者简介:关陟昊,女,博士研究生,研究方向 :农业智能知识服务;E-mail: gzhzjk445@outlook.com
  • 基金资助:
    科技创新2030——新一代人工智能重大项目(2021ZD0113700)

Computational Literature-based Knowledge Discovery for Soybean Coupling Traits

GUAN Zhi-hao1(), SHAN Zhi-yi2,3, XIONG He1, ZHAO Rui-xue1,4()   

  1. 1.Agricultural Information Institute of Chinese Academy of Agricultural Sciences, Beijing 100081
    2.National Science Library, Chinese Academy of Sciences, Beijing 100190
    3.Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190
    4.Key Laboratory of Knowledge Mining and Knowledge Services in Agricultural Converging Publishing, National Press and Publication Administration, Beijing 100081
  • Received:2025-01-19 Published:2025-09-26 Online:2025-09-24

摘要:

目的 提出一种基于计算文献的自动化大豆耦合性状发现模型,以精准识别田间试验前的大豆耦合性状并解析其遗传网络,弥补传统实验室方法研究在该领域的不足,为大豆育种研究提供更高效的知识发现途径。 方法 首先,依据权威的领域本体构建大豆语料的标注策略,以语义三元组的形式表示大豆耦合性状知识;其次,构建基于领域词典的语义三元组抽取模型,利用AdaPU(adapted positive-unlabeled learning)算法和R-BERT(pre-trained transformer encoder for relation extractio)算法自动化抽取大豆育种相关文献中的知识实体及其调控关系,得到大豆性状的遗传调控网络;最后挖掘网络中存在的耦合性状连通子网和性状节点间的可达路径,利用文献回溯的方法进行验证并进行遗传机制分析。 结果 该研究所提模型的准确率为79.41%,召回率为88.52%,F1 score为83.72%,获得唯一的大豆性状知识三元组776个,包含33个基因概念、119个蛋白质概念和96个性状概念,其中478个为“相关关系”,264个为“上调关系”,34个为“下调关系”。研究发现6个耦合性状连通子网,挖掘到139条性状耦合路径,揭示了大豆不同性状间的复杂关联及其潜在的分子机制。 结论 证实基于大规模文献进行性状知识发现的可行性,通过自动化模型挖掘并验证了大豆耦合性状及其遗传调控网络,为大豆育种领域的研究人员提供潜在的多效基因和性状关联信息,有效支撑育种实验设计和假设生成。

关键词: 知识抽取, 知识发现, 语义三元组, 大豆育种, 耦合性状

Abstract:

Objective This study aims to develop an automated soybean trait discovery model using computational literature. The goal is to accurately identify soybean coupling traits before field trials and analyze their genetic networks. This approach addresses the limitations of traditional laboratory methods and offers a more efficient way to discover knowledge for soybean breeding research. Method Firstly, the annotation strategy of soybean corpus was constructed according to the authoritative domain ontology, and the soybean coupling trait knowledge was represented in the form of Subject-Predication-Object (SPO) semantic triples. Secondly, a semantic triplet extraction model was constructed based on the domain dictionary. Adapted positive-unlabeled learning (AdaPU) algorithm and R-BERT (pre-trained Transformer encoder for relation extraction) algorithm were used to automatically extract the knowledge entities and their regulatory relationships in soybean breeding literature, and the genetic regulatory networks of soybean traits were obtained. Finally, the coupling trait connected subnetworks and the reachable paths between trait nodes in the network were mined, and the literature review method was used to verify the results and analyze the genetic mechanism. Result Experimental results show that the knowledge extraction model achieved an accuracy of 79.41%, a recall rate of 88.52%, and an F1 score of 83.72%. A total of 776 unique soybean trait knowledge triples were identified, encompassing 33 gene concepts, 119 protein concepts, and 96 trait concepts. Among these, 478 triples represented “associated with” relationships, 264 “up-regulation” relationships, and 34 “down-regulation” relationships. Within the soybean trait knowledge network, six coupling traits connected subgraphs were discovered, and 139 trait coupling paths within the largest connected subnetwork. Conclusion This study identified the feasibility of trait knowledge discovery based on large-scale literature. By deeply mining knowledge units, it uncovers the underlying coupling traits and their associated molecular mechanisms, providing plant breeding researchers with potential pleiotropic genes and coupling traits for experimental design, thereby enhancing the efficiency of hypothesis generation in scientific research. The structured trait knowledge generated by this study contributes to the development of knowledge graphs in the field of soybean breeding and serves as a reliable knowledge foundation for domain-specific large language models, facilitating the development and application of AI agent.

Key words: knowledge extraction, knowledge discovery, SPO, soybean breeding, coupling traits