CN-121979989-A - Search enhancement generation method, system, equipment and storage medium based on process rewarding optimization

CN121979989ACN 121979989 ACN121979989 ACN 121979989ACN-121979989-A

Abstract

The application relates to a search enhancement generation method, a system, equipment and a storage medium based on process rewarding optimization, which comprises the steps of disassembling a query question input by a user to obtain a sub-query question of a current sub-query, searching the sub-query question of the current sub-query in a preset database to obtain a plurality of recall documents, calculating result rewards of the current sub-query, and repeatedly executing the acquisition of intermediate answers of the sub-query question of the current sub-query, the calculation and correction judgment of the result rewards and the disassembly of the query question until a preset query termination condition is met and outputting target answers of the query question when the result rewards of the current sub-query are larger than a preset rewards threshold. According to the scheme provided by the application, the query questions are disassembled, the result rewards corresponding to the query questions are calculated, and the query process is optimized based on the rewards, so that the answers of the questions which are produced in the middle can be verified and optimized, the error answers are accumulated in the subsequent query process, and the query accuracy is improved.

Inventors

ZHAO XING

Assignees

深圳市跨越速运有限公司

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. A process reward optimization based search enhancement generation method, comprising: disassembling the query questions input by the user to obtain sub-query questions of the current sub-query; searching the sub-query problem of the current sub-query in a preset database to obtain a plurality of recall documents; Calculating the result rewards of the current sub-query according to the intermediate answers of the sub-query questions, the sub-query questions and the recall documents obtained by searching the recall documents; If the result rewards of the current sub-queries are smaller than the preset rewards threshold, correcting the current sub-queries until the result rewards of the current sub-queries are larger than the preset rewards threshold; When the result rewarding of the current sub-inquiry is larger than a preset rewarding threshold value, continuing to disassemble the inquiry problem by combining the intermediate answers of the current sub-inquiry to obtain a sub-inquiry problem of the next sub-inquiry, wherein the sub-inquiry problem is used as the sub-inquiry problem of the current sub-inquiry; And in the inquiring process, constructing a target optimization function containing the result rewards of each sub-inquiry so as to maximize the expected rewards of each sub-inquiry as a target and optimize the intermediate answers of each sub-inquiry.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises, The objective optimization function is as follows: Wherein, the Average value of sum of result rewards corresponding to all sub-queries, Representation solution Is the maximum value of (2); To follow the strategy Searching all possible paths for obtaining a target answer, and taking an average; the result rewards obtained by executing the sub-queries at the time t+k correspond to the result rewards of the current sub-queries; The searching state at the moment is represented and comprises the current reasoning context, the history sub-query and the intermediate answer, wherein T is the ending moment of the query process of the query problem; as a discount factor, the number of times the discount is calculated, K represents the step number difference from the current time t to the future time t+k.
3. The method of claim 1, wherein correcting the current sub-query comprises: determining a search path corresponding to the intermediate answer of the current sub-query as an error path, and re-searching after performing path pruning on the search path to obtain the intermediate answer of the current sub-query; Or searching the current sub-query question in a preset database again, and searching to obtain an intermediate answer of the current sub-query based on a plurality of newly obtained recall documents.
4. The method of claim 2, wherein the outcome rewards include a relevance rewards and a confidence rewards, wherein, The relevance rewards are calculated according to the following formula: in the formula (i), For the purpose of the correlation rewards, To sub-query questions at the current sub-query, Retrieving the retrieved recall document for the current sub-query; The confidence rewards are calculated according to the following formula: in the formula (i), For the confidence level rewards, ( ) As a function of the entropy, Is in a state of Probability distribution of the lower output.
5. The method of claim 1, wherein the predetermined query termination condition comprises: each sub-query question obtained by disassembling the query question obtains an intermediate answer, and the result rewards of each sub-query are larger than the preset rewards threshold; Or the number of sub-query steps reaches a preset maximum value.
6. The method as recited in claim 1, further comprising: If the result rewards of the current sub-queries are larger than a preset rewards threshold, determining a search path corresponding to the intermediate answers of the current sub-queries as a dominant search path.
7. The method of claim 1, wherein the retrieving the sub-query question of the current sub-query in a preset database to obtain a plurality of recall documents comprises: vectorizing the sub-query problem to obtain a first query vector; retrieving a plurality of second query vectors with similarity larger than a preset similarity threshold value from the preset vector database; And carrying out similarity sorting on the second query vector, and outputting a preset number of recall documents after format conversion.
8. A process reward optimization based search enhancement generation system, comprising: The system comprises a search enhancement module, a big model, a target optimization function and a target optimization function, wherein the search enhancement module is used for obtaining a target answer of a query question, the big model is used for resolving the sub-query question of each sub-query according to the query question input by a user, searching based on a recall document returned by the search enhancement module to obtain the intermediate answer of the sub-query question of the current sub-query, resolving the sub-query question of the next sub-query according to the fact that the result reward of the current sub-query fed back by the reward calculation module is larger than a preset reward threshold, or correcting the intermediate answer of the current sub-query according to the fact that the result reward of the current sub-query fed back by the reward calculation module is smaller than the preset reward threshold; The retrieval enhancement module is used for retrieving in a preset database according to the sub-query problem fed back by the large model and returning a plurality of recall documents related to the sub-query problem to the large model; And the rewarding calculation module is used for calculating whether the result rewarding is larger than a preset rewarding threshold value for each sub-query, combining the sub-query questions of the sub-query, and the corresponding recall document and intermediate answers, and feeding back the result rewarding to the large model.
9. An electronic device, comprising: processor, and A memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1-7.
10. A computer readable storage medium having stored thereon executable code which when executed by a processor of an electronic device causes the processor to perform the method of any of claims 1-7.

Description

Search enhancement generation method, system, equipment and storage medium based on process rewarding optimization Technical Field The application relates to the technical field of artificial intelligence, in particular to a search enhancement generation method, a system, equipment and a storage medium based on process rewarding optimization. Background The technology generally comprises two processing stages, namely a retrieval stage converts a user query into a vector, performs semantic search in a pre-constructed knowledge base to obtain a relevant text segment, and a reinforcement generation stage splices a retrieval result with an original query to input the retrieval result as an additional context into the large language model to generate a final answer. The processing mode effectively relieves the illusion problem generated by the model depending on internal parameterized knowledge, and is suitable for solving the problem of the actual description class. However, in a complex problem scene, errors are easily generated in the answers of the large model search, particularly when a plurality of sub-questions needing to be connected in series or logically combined are involved, the retrieval precision can be attenuated along with the increase of the complexity of the questions, and an intermediate verification link is lacking, and the unidirectional retrieval-generation path enables the error answers of the previous steps to directly influence the subsequent reasoning steps, so that the quality of the finally generated answers is obviously reduced. Disclosure of Invention In order to solve or partially solve the problems in the related art, the application provides a search enhancement generation method, a system, equipment and a storage medium based on process rewards optimization. The first aspect of the application provides a search enhancement generation method based on process rewarding optimization, which comprises the steps of disassembling a query question input by a user to obtain a sub-query question of a current sub-query, searching the sub-query question of the current sub-query in a preset database to obtain a plurality of recall documents, calculating result rewards of the current sub-query according to intermediate answers of the sub-query question obtained by searching the recall documents, the sub-query question and the recall documents, correcting the current sub-query until the result rewards of the current sub-query are larger than a preset rewarding threshold value if the result rewards of the current sub-query are smaller than the preset rewarding threshold value, continuously disassembling the query question according to the intermediate answers of the current sub-query question to obtain the sub-query question of the next sub-query question, repeatedly executing acquisition, result rewards calculation and judgment of the sub-query question until preset query termination conditions are met, outputting target rewards of the query question, and constructing a target function with the maximum expected result of each sub-query in each sub-step of optimization function when the result rewards of the current sub-query are larger than the preset rewards threshold value. With reference to the first aspect, in a possible implementation manner of the first aspect, the objective optimization function is: Wherein, the Average value of sum of result rewards corresponding to all sub-queries,Representation solutionIs the maximum value of (2); To follow the strategy Searching all possible paths for obtaining a target answer, and taking an average; the result rewards obtained by executing the sub-queries at the time t+k correspond to the result rewards of the current sub-queries; The searching state at the moment is represented and comprises the current reasoning context, the history sub-query and the intermediate answer, wherein T is the ending moment of the query process of the query problem; as a discount factor, the number of times the discount is calculated, K represents the step number difference from the current time t to the future time t+k. With reference to the first aspect, in one possible implementation manner of the first aspect, the correcting the current sub-query includes determining a search path corresponding to the intermediate answer of the current sub-query as an error path, and after performing path pruning on the search path, searching again to obtain the intermediate answer of the current sub-query, or searching again the current sub-query in a preset database, and searching to obtain the intermediate answer of the current sub-query based on a plurality of newly obtained recall documents. With reference to the first aspect, in a possible implementation manner of the first aspect, the result reward includes a relevance reward and a confidence reward, wherein the relevance reward is calculated according to the following formula: in the formula (i), For the purpose of the c