CN-122002079-A - Football match highlight generation method based on cloud collaboration multi-mode large model
Abstract
The invention belongs to the technical field of artificial intelligence, and particularly relates to a football match gathering generation method based on a cloud collaborative multi-mode large model, which comprises the following steps that edge ends perform standardized processing on football match multi-view video uploaded by users; the method comprises the steps that an edge end calls a large model to conduct event identification on a standardized multi-view video, a goal occurrence time of a goal event in each video is extracted, a corresponding target time stamp and a corresponding frame number are output, the edge end cuts the standardized multi-view video according to the target time stamp to obtain a front segment and a rear segment, the edge end extracts multi-view key frames from the standardized multi-view video according to the target time stamp and uploads the multi-view key frames to a cloud end, the cloud end processes the multi-view key frames through the key model to generate a lens transporting segment to return to the edge end, the edge end uses FFemg tools to splice the front segment, the lens transporting segment and the rear segment to obtain a football match video, and time delay and calculation cost of overall flow reasoning are remarkably reduced.
Inventors
- TANG BIDI
- Du yulu
Assignees
- 重庆邮电大学
Dates
- Publication Date
- 20260508
- Application Date
- 20260206
Claims (7)
- 1. A football match centralized generation method based on a cloud collaborative multi-mode large model is characterized by comprising the following steps: S1, carrying out standardized processing on multi-view videos of football match uploaded by a user by an edge end, wherein the multi-view videos of football match comprise N paths of videos under different view angles, and N is more than or equal to 2; s2, calling a large model by the edge end to perform event identification on the standardized multi-view video, extracting the occurrence time of a goal event in each path of video, and outputting a corresponding target time stamp and a frame number; s3, cutting the standardized multi-view video by the edge end according to the target time stamp to obtain a front fragment and a rear fragment; S4, extracting multi-view key frames from the standardized multi-view video by the edge end according to the target time stamp, and uploading the multi-view key frames to the cloud end, wherein the multi-view key frames comprise key frames of N paths of standardized video; S5, the cloud processes the multi-view key frame through a key model to generate N-1 lens transporting sections, wherein the key model comprises a diffusion type image to video generation network, a text encoder and a variation self-encoder; S6, splicing the front segment, the N-1 lens transporting segments and the rear segment by using FFemg tools at the edge end to obtain the football match video.
- 2. The method for generating football match highlights based on the cloud collaborative multi-mode large model according to claim 1, wherein the step S1 of performing standardization processing on the multi-view video includes: S11, reading metadata of N paths of videos and executing a step S12, wherein the metadata comprise packaging formats, coding parameters, frame rates, resolutions and pixel formats, and the coding parameters comprise a video encoder, an audio sampling rate and duration; S12, uniformly transcoding the packaging format and the coding parameters of the N paths of videos into preset standards by adopting an FFmpeg tool, and then executing a step S13; S13, unifying the frame rate of the N paths of videos to a target frame rate through normalization processing, enabling the resolution of the N paths of videos to be uniform through scaling and filling operation, and then executing step S14; s14, performing time length alignment processing on the N paths of videos to generate a standardized multi-view video.
- 3. The method for generating the football match highlight based on the cloud collaborative multi-mode large model according to claim 1, wherein the step S2 of calling the large model by the edge end to perform event recognition on each path of standardized video comprises the following steps: S21, converting the current path of standardized video into a frame sequence arranged in time sequence based on a frame extraction strategy with a fixed sampling rate, and reserving time information corresponding to each frame so as to establish a mapping relation between the frames and the time stamps; s22, constructing an identification prompt word for guiding the large model to focus on key visual evidence of a goal event and restricting the large model to output an identification result in a structural form; s23, inputting the frame sequence and the recognition prompt word into a large model together for reasoning to obtain an output result, wherein the output result at least comprises two types of information of whether a goal event occurs or not and the occurrence time of the goal; S24, taking the occurrence time of goal as a target time stamp, and extracting a corresponding frame number.
- 4. The method for generating football match highlights based on the cloud collaborative multi-mode big model according to claim 1, wherein the step S3 includes: s31, determining clipping time windows of the 1 st path standardized video and the N th path standardized video: The clipping time window of the 1 st path standardized video is [ T goal,1 -T pre ,t goal,1 ], wherein T goal,1 represents a target time stamp of the 1 st path standardized video, and T pre represents a front-end duration; The clipping time window of the Nth standard video is [ T goal,N ,t goal,N +T post ], wherein T goal,N represents the target time stamp of the Nth standard video, and T post represents the post-time length; When t goal,A <T pre is exceeded, the starting time of the cutting section of the No. 1 standardized video is corrected to be the starting point of the No. 1 standardized video, namely t goal,A -T pre =0, and when t goal,B +T post is exceeded the total time length of the No. N standardized video, the ending time of the cutting section of the No. N standardized video is corrected to be the end point of the No. N standardized video; S32, based on a cutting time window, cutting the 1 st path of standardized video and the N path of standardized video by adopting an FFmpeg tool to obtain a front fragment and a rear fragment.
- 5. The method for generating a football match highlight based on a cloud collaborative multi-modal large model according to claim 1, wherein the process of extracting key frames of an N-th = 1, 2..: In the nth standardized video, a frame with highest definition is selected as a key frame K n in a window [ t goal,n -△, t goal,n + ], wherein t goal,n represents a target timestamp of the nth standardized video, and delta represents a preset time interval.
- 6. The method for generating football match gathering based on the cloud collaborative multi-mode large model according to claim 1, wherein n=1, 2, N-th standardized video and n+1th standardized video are divided into a group of samples to obtain N-1 groups of samples, the cloud uses a key frame K n of the N-th standardized video and a key frame K n+1 of the n+1th standardized video as a start frame and an end frame respectively for each group of samples, and a corresponding lens-carrying segment is generated by adopting a key model, comprising: s51, generating a positive condition vector and a negative condition vector through a text encoder; s52, the variation self-encoder generates an initial latent variable sequence according to the initial frame and the end frame, and establishes time sequence boundary constraint from the initial frame to the end frame in a latent space according to the positive condition vector and the negative condition vector; S53, the diffusion type image-to-video generation network receives the initial latent variable sequence and iterates the initial latent variable sequence to generate a final latent variable sequence; S54, decoding the final latent variable sequence by a decoder of the variable self-encoder to obtain a lens transporting segment.
- 7. The football match gathering generation method based on the cloud collaborative multi-mode large model according to claim 6, which is characterized in that, The video generation network includes: UNETLoader-1 for loading high_noise video diffusion to generate a main network; UNETLoader-2 for loading video diffusion of low_noise to generate a main network; LoraLoaderModelOnly-1 for injecting high_ noise LoRA into the video diffusion generated main network loaded by UNETLoader-1; LoraLoaderModelOnly-2 for injecting low_ noise LoRA into the video diffusion generated main network loaded by UNETLoader-2; The algorithm executor includes: KSampler-1 for receiving the initial latent variable sequence output by the AVE and calling the network output by LoraLoaderModelOnly-1 to generate an intermediate latent variable sequence; KSampler-2 for receiving the intermediate latent variable sequence outputted by KSampler-1 and calling the network outputted by LoraLoaderModelOnly-2 to generate a final latent variable sequence; The text encoder adopts a CLIP type encoder, and specifically comprises: CLIPLoader, for loading CLIP models; CLIPTextEncode-1, configured to receive a forward prompt word, call a CLIP model loaded in CLIPLoader to encode the forward prompt word, and generate a forward condition vector; CLIPTextEncode-2, configured to receive a negative prompt word, call the CLIP model loaded in CLIPLoader to encode the negative prompt word, and generate a negative condition vector; The variable self-encoder includes: VAELoader for loading a VAE model; WanFirstLastFrameToVideo, for accessing VAELoader loaded VAE model, the core function is to generate continuous transition frames between the initial and final frames according to the appointed initial and final frames, and construct corresponding initial latent variables; VAEDecode, a VAE decoder for access VAELoader decodes the final latent variable sequence output by KSampler-2.
Description
Football match highlight generation method based on cloud collaboration multi-mode large model Technical Field The invention belongs to the technical field of artificial intelligence, and particularly relates to a football match gathering generation method based on a cloud collaborative multi-mode large model. Background In recent years, the live broadcast and short video broadcast of a sports event rapidly develop, and the requirements of users for obtaining key event fragments such as football match goal and the like are higher in real-time performance, high frequency and individuation. In order to improve the content production efficiency, the industry is tending to adopt an automation technology to perform event detection, fragment cutting and content synthesis on the event video so as to realize the automatic generation of the event gathering and the wonderful playback. At present, the mainstream technology is generally used for analyzing based on the whole field event video, positioning key events such as ball feeding and the like, and intercepting fragments before and after the events according to preset rules for splicing. With the rise of a large visual model, the strong scene semantic understanding capability of the large visual model is hopeful to further improve the robustness of event identification. However, in practical landing, a plurality of challenges still exist, namely firstly, event videos are usually collected from multiple machine positions and multiple view angles, different view angles of videos have differences in coding parameters, frame rate, resolution and duration, and time axes are not strictly synchronous, so that identified event moments are difficult to map to other view angles accurately, further, quality of segment clipping and content synthesis is affected, secondly, if a cloud high-computing model is completely relied on to infer a full video stream, problems of high bandwidth occupation, high processing delay and cost rise are caused, and if an edge lightweight model is completely adopted, missed detection or false detection is easy to occur under complex scenes such as lens shielding, rapid switching and playback lenses, and effective balance between real-time performance and accuracy is difficult to realize. In addition, the traditional gathering generation mostly adopts a cutting-splicing mode, and the expression form is single. The lens transporting effect such as bullet time can obviously improve the visual expressive force at key moment, but the traditional implementation depends on a multi-camera array or a three-dimensional reconstruction technology, and the deployment is complex and the cost is high. The generation type video technology is emerging, a new thought is provided for generating dynamic lens transporting segments based on key frames, but the existing scheme still has the difficulty of engineering landing in the aspects of main body consistency, time sequence stability, natural connection with real video segments and the like, and lacks an integrated design cooperated with an edge side event detection flow. Disclosure of Invention In order to solve the problems, the invention provides a football match gathering generation method based on a cloud collaborative multi-mode large model, which comprises the following steps: S1, carrying out standardized processing on multi-view videos of football match uploaded by a user by an edge end, wherein the multi-view videos of football match comprise N paths of videos under different view angles, and N is more than or equal to 2; s2, calling a large model by the edge end to perform event identification on the standardized multi-view video, extracting the occurrence time of a goal event in each path of video, and outputting a corresponding target time stamp and a frame number; s3, cutting the standardized multi-view video by the edge end according to the target time stamp to obtain a front fragment and a rear fragment; S4, extracting multi-view key frames from the standardized multi-view video by the edge end according to the target time stamp, and uploading the multi-view key frames to the cloud end, wherein the multi-view key frames comprise key frames of N paths of standardized video; S5, the cloud processes the multi-view key frame through a key model to generate N-1 lens transporting sections, wherein the key model comprises a diffusion type image to video generation network, a text encoder and a variation self-encoder; S6, splicing the front segment, the N-1 lens transporting segments and the rear segment by using FFemg tools at the edge end to obtain the football match video. The invention has the beneficial effects that: The method comprises the steps of carrying out unified standardization and time sequence alignment on multi-view game videos, guaranteeing accurate mapping among different view angles at the goal moment, improving cutting and synthesizing stability, arranging a lightweight goal identificat