CN-121982614-A - Training method and related device for large video model

CN121982614ACN 121982614 ACN121982614 ACN 121982614ACN-121982614-A

Abstract

The application discloses a training method and a related device for a large video model, which relate to the technical field of video identification and comprise the following steps: and acquiring training video data frames to obtain image frames, and inputting preset prompt words, user questions and the image frames into an image large model to obtain a thinking chain and answers to questions. The video large model is cold started based on the thinking chain and the answers to the questions, so that the video large model has the capability of outputting the thinking chain, and space, time disorder data and thinking chain data are generated by combining training video data and questions. And (3) inputting the three types of data into the model to obtain corresponding output, calculating the accuracy of each output, obtaining a spatial and temporal accuracy rewarding value, combining the thinking chain consistency rewarding value, training the model through a group relative strategy optimization algorithm, and obtaining the reasoning video big model. The application can enable the trained video big model to have the capability of thinking reasoning based on the thinking chain.

Inventors

JIANG LE
CHENG HONGQIANG

Assignees

亚信科技（中国）有限公司

Dates

Publication Date: 20260505
Application Date: 20260130

Claims (10)

1. A method for training a video large model, comprising: performing frame acquisition operation on the training video data to obtain image frames; Inputting a preset prompt word, a user question and the image frame into an image large model to obtain a thinking chain and a question answer, wherein the preset prompt word is used for guiding the analysis operation of the image large model on the image frame, and the user question is a natural language question which is presented by a user and related to the training video data; performing cold starting operation on the video large model to be trained based on the thinking chain and the answers to the questions, wherein the cold starting operation is used for enabling the video large model to be trained to have the capability of outputting the thinking chain; Generating spatial disorder data, time disorder data and thinking chain data based on the training video data and the user questions, wherein the spatial disorder data is video data lacking partial image data, the partial image data is image data with similarity with the user questions being greater than a threshold value, and the time disorder data is video data with video frames in the training video data being disordered in sequence; respectively inputting the training video data, the spatial disorder data and the time disorder data into the video large model to be trained to obtain normal output data, the spatial disorder output data and the time disorder output data; Respectively calculating the data accuracy rates of the normal output data, the spatial disorder output data and the time disorder output data, obtaining a spatial accuracy rewarding value and a time accuracy rewarding value based on the data accuracy rates, and obtaining a thinking chain consistency rewarding value based on the thinking chain data; training the video big model to be trained by adopting a group relative strategy optimization algorithm based on the spatial accuracy rewarding value, the time accuracy rewarding value and the thinking chain consistency rewarding value to obtain an inference video big model.
2. The method for training a large video model according to claim 1, wherein the performing a frame acquisition operation on the training video data to obtain an image frame comprises: If the length of the training video data is greater than K seconds, performing frame acquisition operation on the training video data according to a first acquisition rate, wherein the first acquisition rate is the ratio of the number of image frames acquired from the training video data to the total number of image frames included in the training video data; And if the length of the training video data is not more than K seconds, performing frame acquisition operation on the training video data according to a second acquisition rate, wherein the second acquisition rate is a ratio of the number of image frames acquired from the training video data to the total number of image frames included in the training video data, and the second acquisition rate is lower than the first acquisition rate.
3. The method of training a video macro model according to claim 1, wherein generating spatially-cluttered data based on the training video data and the user question comprises: dividing each video frame of the training video data into D-block sub-images; Respectively calculating cosine similarity of each sub-image of each video frame and the text features of the user questions; and removing the sub-images with the cosine similarity larger than a threshold value to obtain the spatial disorder data.
4. The training method of a video large model according to claim 1, wherein the calculating the data correctness of the normal output data, the spatial shuffle output data, and the temporal shuffle output data, respectively, includes: and respectively comparing the normal output data, the spatial and time staggered output data with standard output data of training data to obtain data accuracy rates of the normal output data, the spatial and time staggered output data.
5. The method for training a video macro model according to claim 1, wherein the obtaining a spatial accuracy rewards value and a temporal accuracy rewards value based on the data accuracy rate and obtaining a mental chain consistency rewards based on the mental chain data comprises: If the data accuracy rate of the normal output data is larger than that of the spatial disorder output data, generating a spatial accuracy rewarding value; generating a time accuracy rewarding value if the data accuracy of the normal output data is larger than that of the time disorder output data; Identifying whether repeated content and logic errors exist in the thinking chain data by adopting a large language model; And generating a thought chain consistency reward value if the repeated content and the logic error do not exist.
6. The method for training a video big model according to claim 1, wherein the training the video big model to be trained by using a group relative strategy optimization algorithm based on the spatial accuracy rewards value, the time accuracy rewards value and the thinking chain consistency rewards value to obtain an inference video big model comprises: respectively calculating the normalized difference values of the spatial accuracy rewards value, the time accuracy rewards value and the thinking chain consistency rewards value and the average value of the respective rewards value; training the video big model based on the standardized difference value and the KL divergence constraint function.
7. The method for training a large video model according to claim 2, wherein said performing a frame acquisition operation on said training video data at a first acquisition rate comprises: If the frame rate of the training video data is less than 1 frame, acquiring all image frames of the training video data; And if the frame rate of the training video data is not less than 1 frame, acquiring all image frames of the training video data in a mode of acquiring 1 frame per second.
8. A computer program product comprising computer readable instructions which, when run on an electronic device, cause the electronic device to implement the method of training a video big model as claimed in any of claims 1 to 7.
9. An electronic device comprising at least one processor and a memory coupled to the processor, wherein: The memory is used for storing a computer program; the processor is configured to execute the computer program to enable the electronic device to implement the training method of the video big model according to any of the claims 1 to 7.
10. A computer storage medium carrying one or more computer programs which, when executed by an electronic device, enable the electronic device to implement a method of training a video big model as claimed in any of claims 1 to 7.

Description

Training method and related device for large video model Technical Field The application relates to the technical field of video recognition, in particular to a training method and a related device for a large video model. Background Along with the development of computer vision and deep learning technology, the large model is widely applied in the fields of video understanding, video question answering and the like. In the prior art, video data is usually subjected to frame extraction and feature alignment and then is input into a video big model for analysis and content generation, the training target of the video big model is mainly focused on the accuracy of a final answer, and an effective thinking reasoning training mechanism is not constructed for a video scene in the conventional training method of the video big model. Therefore, a training method for a large video model is needed. Disclosure of Invention In view of the above problems, the present application provides a training method and related device for a large video model, so as to achieve the purpose of effectively building a thinking reasoning training mechanism for a video scene. The specific scheme is as follows: the first aspect of the application provides a training method for a video large model, comprising the following steps: performing frame acquisition operation on the training video data to obtain image frames; Inputting preset prompt words, user questions and image frames into the image large model to obtain thinking chains and question answers, wherein the preset prompt words are used for guiding the analysis operation of the image large model on the image frames, and the user questions are natural language questions about training video data, which are proposed by users; Performing cold start operation on the large video model to be trained based on the thinking chain and the answers to the questions, wherein the cold start operation is used for enabling the large video model to be trained to have the capability of outputting the thinking chain; generating spatial disorder data, time disorder data and thinking chain data based on training video data and user questions, wherein the spatial disorder data is video data lacking partial image data, the partial image data is image data with similarity with the user questions being greater than a threshold value, and the time disorder data is video data with video frame sequences being disturbed in the training video data; respectively inputting training video data, spatial disorder data and time disorder data into a large video model to be trained to obtain normal output data, spatial disorder output data and time disorder output data; Respectively calculating the data accuracy rate of normal output data, spatial disorder output data and time disorder output data, obtaining a spatial accuracy rewarding value and a time accuracy rewarding value based on the data accuracy rate, and obtaining a thinking chain consistency rewarding value based on thinking chain data; training the video big model to be trained by adopting a group relative strategy optimization algorithm based on the spatial accuracy rewarding value, the time accuracy rewarding value and the thinking chain consistency rewarding value to obtain an inference video big model. Optionally, performing a frame acquisition operation on the training video data to obtain an image frame, including: if the length of the training video data is greater than K seconds, performing frame acquisition operation on the training video data according to a first acquisition rate, wherein the first acquisition rate is the ratio of the number of image frames acquired from the training video data to the total number of image frames included in the training video data; If the length of the training video data is not more than K seconds, frame acquisition operation is carried out on the training video data according to a second acquisition rate, wherein the second acquisition rate is the ratio of the number of image frames acquired from the training video data to the total number of the image frames included in the training video data, and the second acquisition rate is lower than the first acquisition rate. Optionally, the process of generating the spatial clutter data based on the training video data and the user questions includes: dividing each video frame of the training video data into D block sub-images; and removing the sub-images with the cosine similarity larger than a threshold value to obtain the spatial disorder data. Optionally, calculating the data accuracy of the normal output data, the spatial shuffle output data and the temporal shuffle output data respectively includes: And respectively comparing the normal output data, the spatial disorder output data and the time disorder output data with standard output data of training data to obtain data accuracy of the normal output data, the spatial disorder output data and the time disorder