Biotechnology Bulletin ›› 2025, Vol. 41 ›› Issue (9): 345-356.doi: 10.13560/j.cnki.biotech.bull.1985.2025-0082

Previous Articles    

Computational Literature-based Knowledge Discovery for Soybean Coupling Traits

GUAN Zhi-hao1(), SHAN Zhi-yi2,3, XIONG He1, ZHAO Rui-xue1,4()   

  1. 1.Agricultural Information Institute of Chinese Academy of Agricultural Sciences, Beijing 100081
    2.National Science Library, Chinese Academy of Sciences, Beijing 100190
    3.Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190
    4.Key Laboratory of Knowledge Mining and Knowledge Services in Agricultural Converging Publishing, National Press and Publication Administration, Beijing 100081
  • Received:2025-01-19 Online:2025-09-26 Published:2025-09-24
  • Contact: ZHAO Rui-xue E-mail:gzhzjk445@outlook.com;zhaoruixue@caas.cn

Abstract:

Objective This study aims to develop an automated soybean trait discovery model using computational literature. The goal is to accurately identify soybean coupling traits before field trials and analyze their genetic networks. This approach addresses the limitations of traditional laboratory methods and offers a more efficient way to discover knowledge for soybean breeding research. Method Firstly, the annotation strategy of soybean corpus was constructed according to the authoritative domain ontology, and the soybean coupling trait knowledge was represented in the form of Subject-Predication-Object (SPO) semantic triples. Secondly, a semantic triplet extraction model was constructed based on the domain dictionary. Adapted positive-unlabeled learning (AdaPU) algorithm and R-BERT (pre-trained Transformer encoder for relation extraction) algorithm were used to automatically extract the knowledge entities and their regulatory relationships in soybean breeding literature, and the genetic regulatory networks of soybean traits were obtained. Finally, the coupling trait connected subnetworks and the reachable paths between trait nodes in the network were mined, and the literature review method was used to verify the results and analyze the genetic mechanism. Result Experimental results show that the knowledge extraction model achieved an accuracy of 79.41%, a recall rate of 88.52%, and an F1 score of 83.72%. A total of 776 unique soybean trait knowledge triples were identified, encompassing 33 gene concepts, 119 protein concepts, and 96 trait concepts. Among these, 478 triples represented “associated with” relationships, 264 “up-regulation” relationships, and 34 “down-regulation” relationships. Within the soybean trait knowledge network, six coupling traits connected subgraphs were discovered, and 139 trait coupling paths within the largest connected subnetwork. Conclusion This study identified the feasibility of trait knowledge discovery based on large-scale literature. By deeply mining knowledge units, it uncovers the underlying coupling traits and their associated molecular mechanisms, providing plant breeding researchers with potential pleiotropic genes and coupling traits for experimental design, thereby enhancing the efficiency of hypothesis generation in scientific research. The structured trait knowledge generated by this study contributes to the development of knowledge graphs in the field of soybean breeding and serves as a reliable knowledge foundation for domain-specific large language models, facilitating the development and application of AI agent.

Key words: knowledge extraction, knowledge discovery, SPO, soybean breeding, coupling traits