CN-120812352-B - Video advertisement intelligent insertion method, system and medium based on semantic segmentation and multi-mode fusion

CN120812352BCN 120812352 BCN120812352 BCN 120812352BCN-120812352-B

Abstract

The invention discloses an intelligent video advertisement inserting method and system based on semantic segmentation and multi-mode context fusion. The method comprises the steps of automatically identifying a space neutral region available for advertisement embedding in a video through pixel-level semantic mask and multi-frame time sequence analysis, judging background stability by combining indexes such as structural similarity and brightness change, constructing a multi-mode context modeler, fusing a figure emotion vector, a video rhythm vector and a scene structure vector, generating a top-down Wen Qingjing vector through a cross attention mechanism, and outputting an insertion suitability score. When the score reaches a set threshold and the neutral region meets the condition, the system executes advertisement generation and visual fusion operation to realize natural advertisement insertion consistent with the style of the original video content. The method is suitable for various content platforms such as short videos, live broadcasting and the like, and experiments show that the click rate and the play completion rate can be obviously improved, and the jump rate is reduced.

Inventors

XU GUOHUA
ZHENG XINYU

Assignees

北京娱广科技有限公司

Dates

Publication Date: 20260508
Application Date: 20250731

Claims (9)

1. The intelligent video advertisement inserting method based on semantic segmentation and multi-mode fusion is characterized by comprising the following steps of: S1, performing frame extraction processing on an input video stream to obtain an image frame sequence; s2, carrying out pixel level identification on the image frame by utilizing a semantic segmentation model to generate a semantic mask image Identifying non-critical semantic regions; S3, identifying an advertisement free region meeting space and time continuity based on pixel density of the idle region in the continuous frame and by combining time sliding window statistics, structural similarity and brightness change analysis, wherein the identification of the advertisement free region meeting space and time continuity specifically comprises the following steps: According to preset key semantic tag set Judging the semantic mask map Pixels in the key semantic tag set that do not belong to the key semantic tag set, a binary mask is generated Wherein Representing an area available for advertisement insertion; Calculating the binary mask Duty cycle in whole frame The formula is: Wherein Constructing a time sliding window with the length of T for the whole frame image size, and counting the satisfaction of the window If at least m=10 frames satisfy the condition, determining that the time period corresponding to the time sliding window with the length of T is a continuous neutral candidate area; performing background stability analysis on the continuous neutral candidate region, and calculating the current frame And the last frame Structural similarity SSIM and luminance variation within candidate regions Only when And brightness is changed In the time-course of which the first and second contact surfaces, confirming the area as an advertisement free area; s4, extracting multi-mode context characteristics including emotion vectors, video rhythm vectors and image structure vectors generated by facial expressions of the characters aiming at frames corresponding to the neutral areas; S5, inputting the multi-mode features into a fusion network to perform context modeling, and generating an insertion suitability score; and S6, when the score exceeds a set threshold and the frame is a legal neutral region, executing advertisement generation and visual fusion operation, and embedding the advertisement into the corresponding frame.
2. The method of claim 1, wherein the semantic segmentation model is The labels in the output mask map at least comprise three semantic tags, namely characters, commodities and backgrounds, and are used for defining non-key areas.
3. The method of claim 1, wherein the neutral zone identification is based on Counting at least one of time sliding window of frame length Frames meeting pluggable zone density threshold wherein 。
4. The method of claim 1, wherein the emotion vector is a seven-dimensional probability vector corresponding to recognition results of seven emotion categories.
5. The method of claim 1, wherein the multi-modal fusion employs a cross-attention transformer structure, outputting 25-dimensional context semantic vectors for insertion score prediction.
6. A video advertisement insertion system for performing the method of any one of claims 1-5, comprising the following modules: the video preprocessing and framing module is used for receiving an original video stream and executing frame extraction processing to generate a continuous image frame sequence; the target detection and semantic segmentation module is used for executing target recognition and semantic segmentation on each frame of image and outputting a pixel level mask map; The neutral point identification module is used for analyzing the pixel ratio of the non-key area based on the mask map, and identifying the advertisement insertion neutral area meeting the conditions by combining the time sliding window and the background stability; the context modeling and insertion intention evaluation module comprises an expression recognition, rhythm analysis and knot construction mold module and is used for extracting multi-mode context characteristics and fusing the multi-mode context characteristics to generate an insertion suitability score; The dynamic advertisement generating and fusing module is used for selecting or generating advertisement content based on the neutral area and the context information and fusing advertisement images into video frames; The real-time rendering and putting control module is used for controlling the advertisement insertion frequency, the presentation mode and the final rendering and output operation of the user equipment.
7. The video advertisement insertion system of claim 6, wherein the neutral point identification module comprises the following sub-modules: The video input module is used for receiving the original video data and analyzing the frame rate and the resolution; The frame extraction module is used for converting the video data into an image frame sequence; The semantic segmentation module is used for carrying out pixel-level semantic segmentation on each frame of image based on the pre-training Mask R-CNN model and outputting a semantic Mask image ; A neutral candidate mask generation module for identifying non-critical semantic regions from the semantic mask map and generating a binary mask ; A region area calculation module for calculating the binary mask Pixel duty cycle of the pluggable region ; A continuous frame window judging module for judging whether there is at least one time sliding window with length T Frame satisfaction Thereby determining a continuous neutral candidate region; a background stability analysis module for calculating structural similarity in the neutral region And brightness variation To judge the background variation stability; And the neutral region confirming module is used for carrying out joint logic judgment on the results from the region area calculating module, the continuous frame window judging module and the background stability analyzing module, and outputting a final advertisement which can be inserted into the mask frame.
8. The video advertisement insertion system of claim 6, wherein the context modeling and insertion intention assessment module comprises: the facial expression recognition module is used for extracting facial image characteristics based on the pre-training model and outputting emotion vectors containing seven basic emotion confidence degrees for describing the emotion state of the current person; The dynamic rhythm analysis module is used for generating a multidimensional rhythm vector reflecting the rhythm characteristics of the picture according to the target motion condition among the image frames, the energy change of the audio signal, the moving speed of the person and the mute proportion; the scene structure modeling module is used for extracting texture complexity, definition, scene crowding degree, color information richness and left and right brightness symmetry of the image and describing a visual structure; The multi-mode fusion module carries out semantic fusion on three vectors of emotion, rhythm and structure through a cross attention mechanism to obtain comprehensive context representation of the current frame; The insertion suitability score prediction module is used for evaluating the context fit degree of the advertisement insertion executed at the current moment based on the fused context representation vector and outputting a score value representing the insertion suitability; And the insertion triggering control module triggers the advertisement insertion action on the premise that the scoring value reaches a preset threshold value and the current frame is marked as a neutral area.
9. A computer readable storage medium having stored thereon a computer program which, when run in a processor, causes the computer to perform all the steps of the method according to any of claims 1-5.

Description

Video advertisement intelligent insertion method, system and medium based on semantic segmentation and multi-mode fusion Technical Field The invention relates to the field of artificial intelligence, computer vision and digital advertisement fusion, in particular to an intelligent advertisement insertion method based on video content understanding, which is suitable for online content scenes such as live video, short video streaming media and the like. Background With the explosive development of the short video platform and the live broadcast platform, the traditional advertisement forms (such as patch advertisement and forced insertion advertisement before pause) are gradually considered to disturb the experience by users, so that the advertisement skipping rate and the shielding rate are high, and the advertisement effect conversion is affected. Meanwhile, the current mainstream video advertisement insertion mode still depends on fixed time points or platform manual setting, the insertion position cannot be dynamically adjusted according to actual picture content, the context sensing capability is lacked, and attention breakage or emotion objection of a user is easily caused. In recent years, visual recognition algorithms such as object detection (e.g., YOLO, DETR) and semantic segmentation (e.g., deepLab, mask R-CNN) have made significant breakthroughs, making video frame level understanding possible. However, how to effectively combine this type of visual technology with ad insertion logic to build a set of "natural, undisturbed" ad placement mechanisms, there is still a lack of systematic solutions. Disclosure of Invention The invention provides an intelligent analysis and advertisement insertion method for video content, which is characterized in that target detection and semantic segmentation are carried out on video frames, characters, objects, background elements and semantic neutral areas are identified, scene rhythm analysis is combined, optimal insertion points are automatically identified, advertisement content matched with scenes is dynamically generated, and immersive advertisement insertion is realized. To overcome at least one technical problem existing in the related art. According to a first aspect of embodiments of the present disclosure, there is provided a video advertisement intelligent insertion method based on semantic segmentation and multimodal fusion, which is characterized by comprising the following steps: S1, performing frame extraction processing on an input video stream to obtain an image frame sequence; s2, carrying out pixel-level identification on the image frame by utilizing a semantic segmentation model, generating a semantic mask map, and identifying a non-key semantic region; s3, based on the pixel density of the empty region in the continuous frames, combining the statistics of the time sliding window and the structural similarity degree ) Brightness change [ ]) Analyzing and identifying advertisement neutral areas meeting space and time continuity; s4, extracting multi-mode context characteristics including emotion vectors, video rhythm vectors and image structure vectors generated by facial expressions of the characters aiming at frames corresponding to the neutral areas; S5, inputting the multi-mode features into a fusion network to perform context modeling, and generating an insertion suitability score; and S6, when the score exceeds a set threshold and the frame is a legal neutral region, executing advertisement generation and visual fusion operation, and embedding the advertisement into the corresponding frame. Preferably, the semantic segmentation model is Mask R-CNN, and the output Mask graph of the semantic segmentation model at least comprises three semantic tags of characters, commodities and backgrounds, and the semantic tags are used for defining non-key areas. Preferably, the neutral region identifies a time sliding window based on a length of T frames, and counts frames in which not less than M frames satisfy the pluggable region density threshold, wherein t=15, m=10. Preferably, the background stability determination requires that the structural similarity SSIM between successive frames is equal to or greater than 0.90 and the rgb brightness variation Δi is equal to or less than 5%. Preferably, the emotion vector is a seven-dimensional probability vector, and corresponds to the recognition result of seven emotion categories. Preferably, the multi-modal fusion employs a Cross-attention transformer (Cross-Attention Transformer) structure, outputting a 25-dimensional context semantic vector for insertion score prediction. In another aspect, the present invention provides a video advertisement insertion system for performing a video advertisement intelligent insertion method based on semantic segmentation and multimodal fusion, which is characterized by comprising the following modules: the video preprocessing and framing module is used for receiving an