CN-121996779-A - Method and system for searching approximate questions of K12 subject based on large language model

CN121996779ACN 121996779 ACN121996779 ACN 121996779ACN-121996779-A

Abstract

The invention discloses a method and a system for searching approximate questions of K12 subjects based on a large language model, and relates to the field of approximate question searching. And obtaining a question sequence and constructing a knowledge tree. Based on the knowledge tree, a large language model is modified. Based on the topic sequence, classifying and cutting are carried out to obtain a plurality of sub topics and corresponding topic class values. And obtaining a knowledge point sequence based on the plurality of subtopics and the corresponding topic class values through the modified large language model. And based on the knowledge point sequence, performing approximate question retrieval to obtain retrieval questions. The method has the technical effects of being capable of modeling a structure facing knowledge points, having weighting and vectorizing retrieval capabilities and performing controllable approximate question retrieval.

Inventors

LI WENFENG
ZHOU JIN
SONG CHUNWEN
ZHANG LI

Assignees

北京简单科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260203

Claims (10)

1. A method for retrieving approximation questions of K12 subject based on a large language model, comprising: acquiring a question sequence, wherein the question sequence represents a sequence after being encoded as a sentence of a question; Constructing a knowledge tree; Modifying a large language model based on the knowledge tree; classifying and cutting based on the topic sequence to obtain a plurality of subtopics and corresponding topic class values, wherein the topic class values represent the hierarchy of the subtopics on a knowledge tree; obtaining a knowledge point sequence based on the plurality of subtopics and the corresponding topic class values through the modified large language model, wherein the knowledge point sequence comprises a plurality of arranged knowledge points; and based on the knowledge point sequence, performing approximate question retrieval to obtain retrieval questions.
2. The method for retrieving approximation questions based on K12 subject of large language model as claimed in claim 1, wherein said modifying large language model based on said knowledge tree comprises: The large language model comprises a normalization structure, a feedforward neural network and a multi-head attention mechanism; constructing n ith feedforward neural networks and n-1 ith knowledge tree control structures according to the number of layers of the knowledge tree; The input of the i+1 feed-forward neural network is the output of the i-th feed-forward neural network and the output of the i+1 knowledge tree control structure; And replacing the feedforward neural network in the large language model with the connected plurality of ith feedforward neural networks and the n-1 ith knowledge tree control structures to obtain a modified large language model.
3. The method for searching the approximate questions based on the K12 subject of the large language model as set forth in claim 2, wherein the ith knowledge tree control structure is used for storing subtopics corresponding to the nodes of the ith hierarchy in the knowledge tree, wherein i is a positive integer greater than or equal to 1 and less than or equal to n; The output of the first knowledge tree control structure serves as an input to the first feed forward neural network.
4. The method for retrieving a topic of K12 subject based on a large language model according to claim 1, wherein the classifying and cutting based on the topic sequence to obtain a plurality of sub-topics and corresponding topic class values comprises: traversing the topic vector, searching the keywords, and obtaining the keyword positions; Extracting features from words in the topic vectors to obtain a plurality of topic feature vectors, wherein the topic feature vectors represent the features of the words in the topic vectors; Detecting the characteristics of the words related to the keywords based on a plurality of topic feature vectors to obtain topic keyword feature matrixes, wherein the topic keyword feature matrixes represent the integral characteristics of the words related to the keywords in the topic vectors; Inputting the feature matrix of the topic keyword into a one-dimensional convolutional neural network, and detecting the level of the corresponding knowledge point on the knowledge tree to obtain a topic class value; And arranging a plurality of words corresponding to the topic keyword feature matrix according to the sequence to obtain the subtopic.
5. The method for retrieving approximation questions based on K12 subject of large language model as claimed in claim 1, wherein the modified training method for large language model comprises: Acquiring a plurality of training subtopics, corresponding training topic class values and a labeling knowledge point sequence, wherein the labeling knowledge point sequence represents the knowledge points corresponding to the labeled training subtopics, and 1 output value corresponds to the number of segmented subtopics Inputting the multiple subtopics into the corresponding ith feedforward neural network in the large language model according to the topic class value to obtain a training knowledge point sequence; the labeling knowledge point sequence and the training knowledge point sequence are of fixed length; calculating the loss of the training knowledge point sequence and the labeling knowledge point sequence to obtain a first loss value; Obtaining losses by using values of subscripts corresponding to a plurality of training subtopics in the training knowledge point sequence and the labeling knowledge point sequence, and obtaining a second loss value; adding the first loss value and the second loss value to obtain a loss value; and training the modified large language model according to the loss value.
6. The method for retrieving a large language model based K12 subject approximation problem according to claim 5, wherein the obtaining a knowledge point sequence based on the plurality of subtopics and the corresponding topic class values by the modified large language model comprises: and inputting the plurality of subtopics into the corresponding i-th feedforward neural network in the large language model according to the topic class value to obtain a knowledge point sequence.
7. The method for searching for approximate questions based on K12 subject of large language model as claimed in claim 1, wherein the searching for approximate questions based on the knowledge point sequence to obtain search questions comprises: Detecting similar knowledge points through the knowledge tree to obtain a similar knowledge point set, wherein the similar knowledge point set comprises a plurality of similar knowledge points, and the similar knowledge points represent knowledge points similar to 1 knowledge point in a knowledge point sequence; 1 knowledge point in the knowledge point sequence corresponds to 1 similar knowledge point set; The method comprises the steps of obtaining a storage question and a corresponding storage knowledge point sequence, wherein the storage question represents a question stored in a database, and the storage knowledge point sequence represents a knowledge point sequence corresponding to the storage question; Constructing a similar knowledge point matrix by the knowledge point sequence and a corresponding similar knowledge point set, wherein the columns of the similar knowledge point matrix correspond to the subscript of the knowledge point sequence, and the rows represent the knowledge points and the similar knowledge points corresponding to the knowledge point sequence; judging whether the similar knowledge point matrix and the knowledge points of the corresponding subscripts of the stored knowledge point sequences are similar or not; If the search results are the same, the topics corresponding to the stored knowledge point sequences are used as search topics.
8. The method for approximate question retrieval based on K12 discipline of large language model as in claim 4, wherein said detecting features of words associated with the existence of keywords based on a plurality of question feature vectors to obtain a question keyword feature matrix comprises: acquiring adjacent first keywords, second keywords and third keywords, wherein the second keywords are keywords among the first keywords and the third keywords; And superposing a plurality of topic feature vectors corresponding to the words between the first keyword and the third keyword as a topic keyword feature matrix.
9. The method for retrieving approximation questions based on K12 subject of large language model as claimed in claim 1, wherein the knowledge tree obtaining method comprises: acquiring a plurality of stored knowledge points, wherein the stored knowledge points represent stored knowledge points of K12 disciplines; and constructing a knowledge tree according to the association relation of the stored knowledge points and the stored knowledge points, wherein the intermediate nodes of the knowledge tree represent composite knowledge points, and the leaf nodes represent atomic knowledge points.
10. A large language model-based approximation question retrieval system of K12 discipline, comprising: the acquisition module is used for acquiring a question sequence, wherein the question sequence represents a sequence after being encoded as a sentence of a question; The knowledge tree module is used for constructing a knowledge tree; A large language model module for modifying a large language model based on the knowledge tree; The segmentation module is used for classifying and cutting based on the topic sequence to obtain a plurality of subtopics and corresponding topic class values, wherein the topic class values represent the hierarchy of the subtopics on a knowledge tree; The knowledge point sequence module is used for obtaining a knowledge point sequence based on the plurality of subtopics and the corresponding topic class values through the modified large language model, wherein the knowledge point sequence comprises a plurality of arranged knowledge points; And the approximate question detection module is used for carrying out approximate question retrieval based on the knowledge point sequence to obtain retrieval questions.

Description

Method and system for searching approximate questions of K12 subject based on large language model Technical Field The invention relates to the field of approximate question retrieval, in particular to a method and a system for retrieving approximate questions of K12 subjects based on a large language model. Background In the K12 online education, the intelligent operation system and the personalized learning platform, the approximate problem which has the same investigation knowledge points, similar problem solving methods or matched difficulty is quickly and accurately found for a problem, and the method is a core and challenging task. In the K12 educational scenario, a teacher and an intelligent question-setting system often need to quickly find an approximate question for a target test question so as to realize the multiplexing of one-to-three and personalized exercises and teaching and research. Traditional approaches to approximate topic search rely primarily on keyword matching or text retrieval methods based on word frequency/inverted index (e.g., TF-IDF/BM 25). Similarity is measured only from word frequency statistics, and the same knowledge point can be covered by multiple language expressions, stem cells or interdisciplinary expressions. When the keywords are not coincident, the existing method is easy to miss, so that the recall rate is obviously insufficient, and the recall is insufficient due to the expression diversity. And a topic typically associates multiple knowledge points, and these knowledge points have a primary and secondary relationship. The keyword method cannot carry out weighting processing on the core and the secondary knowledge points, so that the search result is unreasonable in order, and knowledge structure and weight are difficult to embody. Therefore, there is a problem that the search result is inaccurate. Disclosure of Invention The invention aims to provide a method and a system for searching approximate questions of K12 discipline based on a large language model, which are used for solving the problems in the prior art. In a first aspect, an embodiment of the present invention provides a method for retrieving an approximation problem of a K12 subject based on a large language model, including: acquiring a question sequence, wherein the question sequence represents a sequence after being encoded as a sentence of a question; Constructing a knowledge tree; Modifying a large language model based on the knowledge tree; classifying and cutting based on the topic sequence to obtain a plurality of subtopics and corresponding topic class values, wherein the topic class values represent the hierarchy of the subtopics on a knowledge tree; obtaining a knowledge point sequence based on the plurality of subtopics and the corresponding topic class values through the modified large language model, wherein the knowledge point sequence comprises a plurality of arranged knowledge points; and based on the knowledge point sequence, performing approximate question retrieval to obtain retrieval questions. Optionally, the modifying the large language model based on the knowledge tree includes: The large language model comprises a normalization structure, a feedforward neural network and a multi-head attention mechanism; constructing n ith feedforward neural networks and n-1 ith knowledge tree control structures according to the number of layers of the knowledge tree; The input of the i+1 feed-forward neural network is the output of the i-th feed-forward neural network and the output of the i+1 knowledge tree control structure; And replacing the feedforward neural network in the large language model with the connected plurality of ith feedforward neural networks and the n-1 ith knowledge tree control structures to obtain a modified large language model. Optionally, the ith knowledge tree control structure is used for storing subtopics corresponding to nodes of the ith hierarchy in the knowledge tree, wherein i is a positive integer which is greater than or equal to 1 and less than or equal to n; The output of the first knowledge tree control structure serves as an input to the first feed forward neural network. Optionally, the classifying and cutting based on the topic sequence to obtain a plurality of sub topics and corresponding topic class values includes: traversing the topic vector, searching the keywords, and obtaining the keyword positions; Extracting features from words in the topic vectors to obtain a plurality of topic feature vectors, wherein the topic feature vectors represent the features of the words in the topic vectors; Detecting the characteristics of the words related to the keywords based on a plurality of topic feature vectors to obtain topic keyword feature matrixes, wherein the topic keyword feature matrixes represent the integral characteristics of the words related to the keywords in the topic vectors; Inputting the feature matrix of the topic keyword into a one-dimens