Search

CN-122019719-A - Language model training method, language model and research report generating method

CN122019719ACN 122019719 ACN122019719 ACN 122019719ACN-122019719-A

Abstract

The invention provides a language model training method, a language model and a research report generating method, which belong to the technical field of artificial intelligence, and comprise the steps of performing cold start fine tuning on a model based on a reference track containing an explicit action label, and constructing a single-model multi-role circulation framework; and carrying out the writing strengthening training based on the multidimensional quality score, thereby improving the long-term generation quality. The invention innovatively internalizes multi-agent interaction logic into context reasoning of a single model, solves the problems of search and writing capability cutting and lack of effective supervision signals for long-term generation in the traditional method by a staged course learning strategy, effectively suppresses model illusion, and remarkably improves logic depth, evidence traceability and content credibility of a deep research report.

Inventors

  • XIA PENG
  • XU FEIYANG
  • LI XIN
  • WANG SHIJIN
  • LIU CONG
  • HU GUOPING

Assignees

  • 科大讯飞股份有限公司

Dates

Publication Date
20260512
Application Date
20260129

Claims (19)

  1. 1. A method for training a language model, comprising: acquiring a cold start training data set, wherein the cold start training data set comprises a plurality of multi-agent interaction samples, each multi-agent interaction sample comprises query input and a corresponding reference track, and the reference tracks comprise explicit action labels for identifying different character behaviors; the method comprises the steps of obtaining a search task training data set, wherein the search task training data set comprises a plurality of question-answer samples, each question-answer sample comprises a question input and a standard answer label, taking the question input as a model input, taking the standard answer label as a first feedback signal, and training the pre-training model to obtain a search optimization model; And training the search optimization model by taking the complex query sample as model input and taking a scoring result determined by a long-term report generated by the model as a second feedback signal to obtain a target language model.
  2. 2. The language model training method of claim 1, wherein the acquiring a cold start training dataset comprises: acquiring a multi-round interaction record aiming at any query request; extracting task planning content, search query instructions, search result content and final reply content from the interaction records; adding a planning label before the task planning content, adding a search label before the search query instruction, adding an observation label before the search result content, and adding a writing label before the final reply content; And splicing all the parts of content added with the labels into the reference track according to the time sequence of the interaction record.
  3. 3. The language model training method of claim 2, wherein the stitching the tagged portions of the content into the reference trajectory comprises: Placing the planning label in front of the search label in the reference track; immediately following the search tag, the observation tag forms a search observation pair; Splicing the writing label after the searching observation pair; Wherein the reference track comprises at least one searching observation pair.
  4. 4. The language model training method of claim 2, wherein the final reply content extracted from the interaction record comprises extracting an article outline and a text draft; the step of splicing the content of each part after the label is added into the reference track, and the method further comprises the following steps: Adding an outline tag before the article outline; adding a draft label in front of the text draft; And placing the article outline with the outline label in front of the text draft with the draft label in the reference track.
  5. 5. The language model training method according to claim 2, wherein training the pre-training model to obtain a search optimization model comprises: injecting the question input into the pre-training model to generate a search prediction sequence comprising search query instructions and search result content; Identifying the search result content wrapped by the observation tag in the search prediction sequence; When calculating the strategy gradient for updating the model parameters, masking the mark positions corresponding to the search result contents, and calculating the gradient only based on the part except the search result contents in the search prediction sequence.
  6. 6. The language model training method of claim 2, wherein the training the search optimization model further comprises: detecting whether the long text prediction result contains a reference anchor point or not when the model generates the writing label and the subsequent long text prediction result; verifying whether the reference anchor points to search result content in the observation tag; And if the long-term prediction result does not contain the reference anchor point or the reference anchor point does not point to the content of the search result, reducing the value of the second feedback signal.
  7. 7. The language model training method of claim 1, wherein the step of using the standard answer label as the first feedback signal comprises: extracting a final answer segment in a search prediction result generated by the pre-training model; calculating an accurate matching score or F1 score between the final answer segment and the standard answer label; The exact match score or the F1 score is determined as the first feedback signal.
  8. 8. The language model training method of claim 1, wherein the scoring result determined by the long term report generated for the model is used as a second feedback signal, comprising: sampling and generating a plurality of candidate long-term reports by utilizing the search optimization model aiming at the same complex query sample; inputting the complex query sample and the candidate long text report into a preset evaluation model; Generating a relative dominance score for the candidate long-term report based on a preset scoring rule using the assessment model; The relative dominant fraction is taken as the second feedback signal.
  9. 9. The language model training method of claim 8, wherein the generating the relative dominance score for the candidate long term report comprises: Selecting one from all the candidate long-term reports as a reference report, and taking the rest as a comparison report; Inputting the reference report and the comparison report into the evaluation model simultaneously for pair-wise comparison, and judging the winning probability of the comparison report relative to the reference report; And normalizing the winning probability to obtain the relative dominance score.
  10. 10. The language model training method of claim 8, wherein the preset scoring rules cover at least one of content comprehensiveness, logic depth, instruction compliance, and text readability; The content is comprehensively used for representing whether the candidate long-term report covers all sub-questions of the complex query sample; the logic depth is used to characterize whether the candidate long-term report includes an inferential analysis or trend prediction; the instruction compliance is used for representing whether the candidate long-text report meets a preset format requirement or not; the text readability is used for representing the language fluency and paragraph structure of the candidate long text report.
  11. 11. The language model training method of claim 1, wherein training the search optimization model comprises iteratively performing the following training steps until a preset cutoff condition is reached: Controlling the search optimization model to sample and generate a group of long-text prediction result sequences aiming at the same complex query sample; calculating the second feedback signal corresponding to each long text prediction result in the long text prediction result sequence; Calculating an average of all the second feedback signals as a baseline; calculating the difference value of each second feedback signal relative to the base line to obtain an advantage value; and updating the model parameters of the search optimization model by utilizing the dominance values.
  12. 12. A language model, characterized in that the language model is trained based on the language model training method according to any one of claims 1 to 11; The language model includes a plurality of functional character modules sharing a same set of model parameters: the planning module is used for receiving query input and generating planning labels and task decomposition information; the search interaction module is used for generating a search tag and a query word based on the planning tag and receiving an external search result encapsulated in the observation tag; the writing generation module is used for generating a long-term report containing writing labels and reference anchor points based on external search results in the observation labels; The programming module, the searching interaction module and the writing generation module are in a single context window, and end-to-end data interaction is realized through streaming of the explicit action tag sequence.
  13. 13. The language model of claim 12, further comprising: The criticizing feedback module is configured to audit the long text report generated by the writing generation module to generate criticizing labels and modification opinions; The authoring generation module is further configured to regenerate a revised teletext report based on the modified opinion in response to the criticizing tab.
  14. 14. The language model of claim 12, wherein the search interaction module is further configured to: generating the search labels continuously for multiple times in the context window, and correspondingly receiving a plurality of observation labels to form a search observation sequence which is alternately arranged; the composition generation module is configured to perform a comprehensive analysis based on all search result data in the search observation sequence.
  15. 15. A research report generation method, comprising: receiving a natural language query request of a user; invoking the target language model trained by the language model training method of any one of claims 1 to 11; The explicit action tag stream output by the target language model is obtained in a single context window in response to the query request, wherein the explicit action tag stream comprises a planning tag, a search tag, an observation tag and a writing tag which are sequentially generated; responding to the search tag, calling an external search engine to obtain a search result, packaging the search result in the observation tag and feeding back the search result to the target language model; And extracting a long-term report generated by the target language model after the writing label as the research report output.
  16. 16. The research report generation method of claim 15 wherein the long term report comprises a reference anchor point, the research report generation method further comprising: analyzing the reference anchor point and establishing hyperlink mapping between the reference anchor point and the source of the search result in the observation tag; The hyperlink map is provided to support tracing when outputting the study report.
  17. 17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the language model training method of any one of claims 1 to 11 or the research report generation method of any one of claims 15 to 16 when the computer program is executed by the processor.
  18. 18. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the language model training method of any one of claims 1 to 11 or the research report generation method of any one of claims 15 to 16.
  19. 19. A computer program product comprising a computer program which, when executed by a processor, implements the language model training method of any one of claims 1 to 11 or the research report generation method of any one of claims 15 to 16.

Description

Language model training method, language model and research report generating method Technical Field The invention relates to the technical field of artificial intelligence, in particular to a language model training method, a language model and a research report generation method. Background With the improvement of the capability of a large language model (Large Language Model, LLM), the deep research type application requirements for search-analysis-writing are increasing, the application requirement model can autonomously perform task planning, multi-round information search and evidence reading understanding aiming at the problem of openness and complexity like a human researcher, and finally, a long research report with strict logic, detailed content and compliance is written. To meet the above requirements, the prior art generally adopts a technical scheme of search enhancement generation (RETRIEVAL AUGMENTED GENERATION, RAG) or Multi-agent collaboration (Multi-AGENT SYSTEM). The RAG scheme is usually used for directly splicing relevant documents into prompt words to generate answers at one time after the relevant documents are retrieved, and the multi-agent cooperation scheme is used for deploying a plurality of models with independent functions (such as a planning model, a searching model and a writing model), carrying out flow arrangement through a message queue or middleware, and carrying out division cooperation on the models to complete complex tasks. In model training, supervised fine tuning (Supervised Fine Tuning, SFT) or human feedback-based reinforcement learning (Reinforcement Learning from Human Feedback, RLHF) is typically employed to enhance the general conversational capabilities of the model. However, the above prior art solutions still have limitations in practical applications. On one hand, the traditional multi-agent system depends on the engineering combination of a plurality of independent models, so that the system is large in resource consumption, long in reasoning link and high in delay, and because the parameters of each model are independent, the end-to-end joint optimization of the whole process of the retrieved writing is difficult to carry out through uniform gradient descent. On the other hand, under the existing training paradigm, the retrieval capability and the long text writing capability of the model are often difficult to be combined, the model is difficult to learn when to stop retrieval or how to remove counterfeits in a complex scene by simple retrieval enhancement training, and a long report generated by the model is easy to have the problems of loose logic structure, lack of evidence support of core discussion points or hard cover of retrieval content due to lack of a fine quality supervision signal for training generated by the long text, and high-quality deep research results are difficult to be produced. Disclosure of Invention The invention provides a language model training method, a language model and a research report generating method, which are used for solving the defects that in the prior art, a system is complex and difficult to optimize end to end due to dependence on multi-model engineering combination, retrieval and writing capabilities are difficult to consider due to conflict of training targets and lack of long text quality supervision signals, report generating quality is low, realizing unified construction of multi-role capability in a single model, and remarkably improving logic depth and content credibility of a deep research report. The invention provides a language model training method, which comprises the following steps: acquiring a cold start training data set, wherein the cold start training data set comprises a plurality of multi-agent interaction samples, each multi-agent interaction sample comprises query input and a corresponding reference track, and the reference tracks comprise explicit action labels for identifying different character behaviors; the method comprises the steps of obtaining a search task training data set, wherein the search task training data set comprises a plurality of question-answer samples, each question-answer sample comprises a question input and a standard answer label, taking the question input as a model input, taking the standard answer label as a first feedback signal, and training the pre-training model to obtain a search optimization model; And training the search optimization model by taking the complex query sample as model input and taking a scoring result determined by a long-term report generated by the model as a second feedback signal to obtain a target language model. According to the language model training method provided by the invention, the method for acquiring the cold start training data set comprises the following steps: acquiring a multi-round interaction record aiming at any query request; extracting task planning content, search query instructions, search result content and fi