CN-119893159-B - Artificial intelligence-based foreground object barrage shielding prevention method

CN119893159BCN 119893159 BCN119893159 BCN 119893159BCN-119893159-B

Abstract

The invention discloses an artificial intelligence-based foreground object bullet screen shielding prevention method which comprises the following steps of S10, sampling an original video at a preset frequency to obtain a video stream image, S20, processing the video stream by using a shot switching detection algorithm to judge whether shot switching occurs in a current video frame, S30, processing the current image by using a face detector to obtain current image object face information, S40, processing the current image by using an instance divider to obtain current image object instance division information, S50, processing the current image by using a semantic divider to obtain current image object human semantic division information, wherein the human semantic division information is a one-dimensional feature map which is the same as the original image in width and height, the value on each pixel represents the probability that the pixel is a human body, S60, processing the current image by using a depth estimator to obtain current image depth estimation information, and S70, and sending the information to a post-processing module for processing.

Inventors

Ke Shicheng
TIAN JIANGUO
HE XUFENG
GU YUEXIN
WANG JINGQI

Assignees

杭州当虹科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20241127

Claims (10)

1. The foreground object barrage shielding prevention method based on artificial intelligence is characterized by comprising the following steps of: S10, sampling an original video at a preset frequency to obtain a video stream image, wherein the original video is a video which needs to be subjected to foreground human barrage shielding prevention; S20, processing the video stream by using a shot switching detection algorithm, judging whether shot switching occurs in the current video frame, and resetting a post-processing module cache if the shot switching occurs in the current frame, wherein the post-processing module cache comprises target tracking information, main target depth information of a previous frame and whether mask output exists in the previous frame; S30, processing the current image by using a face detector to obtain target face information of the current image; S40, processing the current image by using an instance divider to obtain object instance division information of the current image; S50, processing the current image by using a semantic divider to obtain human semantic division information of a target of the current image, wherein the human semantic division information is a one-dimensional feature map which is the same as the original image in width and height, and the value on each pixel represents the probability that the pixel is a human body; s60, processing the current image by using a depth estimator to obtain current image depth estimation information, wherein the depth estimation information is a one-dimensional characteristic diagram which is the same as the original image in width and height, and the value on each pixel represents the depth information of an object on the pixel; S70, sending the face information of the current image target, the segmentation information of the instance of the current image target, the human semantic segmentation information of the current image target and the depth estimation information of the current image into a post-processing module for processing, wherein the post-processing module comprises a foreground mask selection algorithm and a time sequence mask compensation algorithm, the foreground mask selection algorithm is used for screening out the instance which needs to be output of the current picture, and the mask information is constructed based on the information of the current frame to obtain the foreground mask of the current frame; S80, merging the foreground mask of the current frame with the time sequence compensation mask to obtain a final output mask of the current frame, and outputting the final output mask to the outside as a current frame result; S90, saving example information used by a foreground mask selection algorithm and a time sequence mask compensation algorithm of the current frame, and processing the foreground mask selection algorithm and the time sequence mask compensation algorithm of the next frame image; s100, processing from S20 to S90 is carried out on the next frame image until the processing of the video stream image is completed.
2. The method for preventing occlusion of foreground object bullet screen based on artificial intelligence as claimed in claim 1, wherein the step S20 of using a shot switching detection algorithm to process video stream specifically comprises the steps of S201 of scaling an image to a certain size, S202 of making a difference with a previous frame of reserved image, counting the number of changed pixels and counting the number of unchanged pixels when the difference is larger than a preset image change threshold, judging that shot switching occurs if the number of changed pixels is larger than the preset threshold and the number of unchanged pixels is smaller than the preset threshold, S203 of converting the image to a hue saturation brightness HSV color space, extracting histogram information of hue saturation components, comparing the histogram information with the histogram information of the hue saturation components of the previous frame, obtaining the histogram correlation of the hue saturation components, simultaneously converting the image to a gray color space, extracting gray level histogram information, comparing the histogram information with the histogram information of the previous frame, obtaining the histogram correlation of the gray level, and if the correlation of the saturation components is larger than the preset threshold and the number of unchanged pixels, judging that the current frame is smaller than the preset threshold, and outputting a current frame of the current frame of video is a current frame of judgment that the hue saturation switching occurs when the histogram is smaller than the preset threshold, S203 of the current frame is updated, and the current frame of the video is updated, and when the current frame is updated, and the current frame of the current frame is not updated is displayed.
3. The artificial intelligence based foreground object barrage anti-blocking method of claim 1, wherein the face detector in S30 uses a nano architecture of YOLOv model, and the specific working process comprises: S301, preprocessing the current video stream image, and scaling the video stream image to a image with the width of 640 pixels and the height of 384 pixels, converting the storage channel format of the image from [ height, width and color channel ] to [ batch, color channel, height and width ]; S302, after the preprocessed image is sent into a model, firstly, feature extraction is carried out through a backbone network to generate a plurality of feature images with different scales, and the feature images capture information of the image on different layers; the feature images are sent into a neck network, the neck network carries out further processing and fusion on the feature images output by the main network, the feature images after the neck network processing are sent into a prediction head, the prediction head is responsible for generating a detection result according to the feature images, the detection result is a matrix, each row represents a detected object, each column represents an attribute of the object, and the sequence is that the abscissa of a center point of a boundary frame, the ordinate of the center point of the boundary frame, the width of the boundary frame, the height of the boundary frame, the yolo detection confidence and the probability of a face class; S303, yolov, wherein the neck network is provided with a plurality of size feature graphs, each feature graph is provided with a pre-measurement head, a prediction matrix is generated, the output matrixes of all the pre-measurement heads are spliced according to columns to generate a new prediction matrix, each row represents a detection result, each column represents an attribute of the detection result, and the detection results sequentially comprise an abscissa of a boundary frame center point, an ordinate of the boundary frame center point, a width of the boundary frame, a height of the boundary frame, yolo detection confidence and probability of a face class; s304, filtering the detection result of yolo detected faces in the obtained matrix, wherein the confidence coefficient of the detection result is lower than a preset threshold value of 0.5, and filtering overlapped bounding boxes by performing non-maximum suppression operation on the filtered detection result; s305, converting the coordinates of the boundary box of the final detection result from the normalized coordinates to the coordinates of the original video stream image, multiplying the width and the height of the boundary box by the aspect ratio of the original image, correspondingly multiplying the abscissa of the center point by the aspect ratio of the original image, and outputting the detection result after coordinate conversion.
4. The artificial intelligence based foreground object barrage anti-blocking method of claim 1, wherein the model used by the instance divider in S40 is SOLOv2, and the processing the current image using the instance divider comprises: s401, preprocessing the current video stream image, and scaling the video stream image to a image with the width of 640 pixels and the height of 360 pixels; S402, inputting a preprocessed image into a model to obtain a feature map, inputting the preprocessed image into a backbone network to perform feature extraction, then obtaining feature maps of different scales through a feature pyramid network, performing up-sampling or down-sampling operation on the feature maps of different scales to match uniform resolution, and performing further feature fusion through a convolution layer; s403, dividing the obtained feature map into 9 rows and 16 columns of grids, wherein each grid is responsible for predicting targets in a certain range, and learning mask kernels and mask kernel parameters from the feature map through a series of convolution operations for each grid; S404, for each grid unit, convolving the obtained mask kernel with mask features to generate an instance mask, and for each grid, generating an instance confidence score of each grid by a prediction head, wherein the score represents whether a target instance is contained in the grid, and the instance mask and the instance confidence score jointly form the output of the grid; S405, combining the outputs of all grids, and filtering the grid output with the instance confidence score less than 0.5; S406, performing operations including fusion and up-sampling on the feature graphs output by the feature pyramid network to obtain mask features, wherein the mask features comprise information of the position and the shape of the target; s407, performing convolution operation on all mask kernels and mask features to generate an instance mask of the grid output; S408, filtering the output instance mask by using matrix non-maximum suppression processing; S409, the filtered examples are combined and output after matrix non-maximum inhibition processing, wherein the output matrix is (n, h, w, 2), n represents the number of output examples, h, w are original width and height, n examples output n 2-value mask graphs with the same width and height as the original, each mask graph represents information of one example, 0 represents background for each pixel, and 1 represents foreground of the current example.
5. The artificial intelligence-based foreground object barrage shielding method as claimed in claim 1, wherein the model used by the semantic segmenter in S50 is a PID neural network, and the specific working process includes: S501, preprocessing the current video stream image, and scaling the video stream image to a image with the width of 640 pixels and the height of 384 pixels; S502, acquiring deep features of the preprocessed image through a backbone network, and capturing local features and context information in the image; s503, extracting fusion features, respectively sending deep features into a P branch, an I branch and a D branch to obtain features of each branch, wherein the P branch is responsible for analyzing and retaining detailed information in a high-resolution feature map, the I branch is responsible for aggregating local and global context information to analyze long-term dependency relationship between pixels, and an encoder-decoder structure is used, wherein an encoder gradually reduces spatial resolution to capture context, and a decoder gradually restores resolution to position details; S504, post-processing to obtain a semantic segmentation map, wherein each pixel is assigned to a class label, the feature map output by the boundary attention guide fusion BAG module is matched with the resolution of an original input image through up-sampling operation, the up-sampled feature map is sent to a semantic header to obtain the semantic feature map, the semantic feature map comprises 1x1 convolution and softmax operation, the 1x1 convolution is used for channel number adjustment and feature fusion, the softmax operation is assigned to each pixel, the class with the highest human body probability is selected for each pixel in the semantic feature map, the class with the highest human body probability is used as human body semantic output when the semantic human body class probability is larger than a threshold value of 0.5, the other classes are all backgrounds, the semantic segmentation map is generated, the semantic segmentation map is output, the output matrix is (1, h, w, 2), the output matrix is shown to be generated, a 2-value matrix consistent with the original image width is generated, 0 represents the pixel is background, and 1 represents the pixel is foreground.
6. The artificial intelligence based foreground object barrage anti-blocking method of claim 1, wherein the model used by the depth estimator in S60 is DEPTH ANYTHING, and the specific working process includes: S601, preprocessing the current video stream image, and scaling the video stream image to a map with 448 pixels wide and 224 pixels high; S602, sending the preprocessed picture into an encoder of a model for encoding, wherein the encoder for extracting the characteristics is a vits encoder of DINOv < 2 >, and the encoder can extract the characteristics from the preprocessed picture; S603, sending the obtained codes into a DPT decoder for decoding, and mapping the features back to the depth map to obtain a depth feature map; S604, upsampling the depth feature map to the same size as the original map results in a depth map, where the value on each pixel on the depth map represents the depth of the position, the smaller the value represents the farther the position is, and the larger the value represents the closer the position is.
7. The artificial intelligence-based foreground object barrage shielding method as claimed in claim 1, wherein the foreground mask selection algorithm flow in S70 specifically includes: S7011, based on the area and width and height information, neglecting marks are made on the examples that the area, width and height are smaller than a preset threshold value; S7012, counting the number of examples containing neglect marks, if the number exceeds the preset number, and calling a post-processing module to search whether foreground information exists in a previous frame mask, and if the previous frame mask has no foreground information output, outputting a blank mask without foreground information by the frame; S7013, matching the face information with the instance information, and making an instance with the face information to comprise a face mark; S7014, if no face information is displayed on the current frame, the frame does not output a mask, S7015 excludes an instance containing neglect marks, picks an instance with face information, scores an optimal face based on the position information, area information and aspect ratio information of the face, corresponds to the optimal face, the focus depth of the instance is the optimal depth of the current frame, S7016 reads the optimal depth of the previous frame when the optimal depth of the current frame is not empty, if the optimal depth of the current frame is greater than a dynamic threshold, the value mode of the dynamic threshold is 0.2 x the optimal depth of the previous frame, then all instances of depth of field are traversed, the instance closest to the optimal depth of the previous frame is selected as the optimal depth of field of the current frame, S7017 excludes an instance containing neglect marks, and traverses an instance with face information, if the optimal depth of field of the current frame is smaller than the dynamic threshold, the optimal depth of field is 0.2 x the optimal depth of field, then the instance is considered to be the current frame, the mark is the important, the previous frame is the optimal depth of the previous frame, if the optimal depth of the current frame is greater than the dynamic threshold, the previous frame is the dynamic threshold, the mask is the mask, S7018 is the mask is set to be the mask which is the same as the mask, and represents the mask 1, and the mask is set to be superimposed on the mask 1, namely, setting the value of the position with 1 in the example mask at the corresponding position on the mask as 1 to obtain the foreground mask of the current frame.
8. The method for preventing occlusion of foreground object bullet screen of claim 7 wherein the step 7015 of selecting the optimal face comprises traversing all face results of the current frame, and calculating position, area and aspect ratio of each face respectively, wherein the position calculation mode is to set the coordinates of the center point of the face as cx, cy, the width of the current frame image as imgw, the height as imgh, the position calculation formula of score 1=1 if abs (0.5 x imgw-cx)/imgw < 0.3 else 1-abs (0.5 x imgw-cx)/imgw, the area calculation mode is to traverse all faces, the maximum area is max_face_area, the area is face_area for each face, the area calculation formula of score 2=face_area/max_face_area, the aspect ratio calculation mode is to set the long side as l, the aspect ratio is score 1=2=face_area_area, and the aspect ratio is calculated as three face points of the optimal face position.
9. The method for preventing blocking of foreground object barrage based on artificial intelligence as set forth in claim 1, wherein the timing mask compensation algorithm in S70 includes S7021 reading previous frame output instance information, reading current frame instance information, matching the previous frame output instance with the current frame instance based on center of gravity and bounding box, outputting the instance mask if the current frame instance can be matched with the previous frame output instance, and marking the matched previous frame instance as matched, S7022 taking an intersection of the matching previous frame output instance with a semantic segmentation and depth information result of the current frame, taking an intersection result as a new instance to be output if the intersection is larger than a preset threshold, and S7023 building a blank mask with the same width as the original image, and overlapping all instance masks to be output on the mask to obtain the timing mask.
10. The method for preventing blocking of a foreground object barrage based on artificial intelligence as claimed in claim 9, wherein the matching calculation mode in S7021 is that centers of two instances are calculated respectively, the calculation mode is that for the instance, all points which are foreground are taken, an average value of x and y is calculated to be the center of gravity, if the center of gravity is not classified as foreground on the instance, all foreground points are arranged into a one-dimensional vector according to the sequence of horizontal coordinates from small to large and vertical coordinates from small to large, the middle point is taken as the center of gravity, the Euclidean distance of the centers of gravity of the two instances is calculated, and if the distance is smaller than 0.25 times of the previous frame instance and smaller than 0.25 times of the previous frame instance, the two instances are considered to be matched.

Description

Artificial intelligence-based foreground object barrage shielding prevention method Technical Field The invention belongs to the technical field of artificial intelligence, and particularly relates to an artificial intelligence-based foreground object barrage shielding prevention method. Background The bullet screen technology originates from the japanese video website, which allows users to post scrolling subtitle comments that appear in the form of a bullet screen on the video frame as the video is played. This technology is soon introduced into china, and numerous video platforms have further driven the development of barrage culture. The barrage not only realizes the real-time interaction between the audience and the works, but also promotes the secondary creation of the works on the screen by the audience, and becomes an indispensable important part in the current movie and television popular culture. However, bullet screen technology also has some drawbacks. For example, the large number of shots may affect the viewing experience of the user, especially when the number of shots is excessive, which may obscure the screen and affect the viewing of the video content. The massive appearance of the barrage can block the picture, resulting in a reduced user's look and feel. The invention aims to solve the problem of shielding caused by excessive barrages in the video watching process, and particularly, when the barrages shield important visual contents in the video, such as faces of people, specific targets or key information, the watching experience of users can be influenced. Through developing intelligent barrage and preventing keeping off technique, can make the barrage "walk around" the core region in the video accurately to realize immersive viewing and the interactive dual advantage of barrage, promote user's viewing satisfaction and participation. The anti-shielding barrage technology is an application of masking technology. Masking refers to creating a transparent "mask" that can be overlaid over a particular area of a video frame. In this way, the bullet screen can be controlled to be displayed only in areas outside the mask, thereby avoiding obscuring important video content. The existing anti-shielding barrage technology has the problems that (1) in the process of automatically detecting and generating a masking plate, false identification possibly occurs, so that an area which is not required to be shielded is erroneously shielded or the area which is required to be shielded is not correctly processed, (2) intelligent judgment is not carried out on detected targets, a plurality of targets which are not required to be processed are shielded, so that barrage watching experience is poor, and (3) the masking plate technology is automatically detected and generated based on a single frame, front and back time sequence semantic continuity is not generated, and the barrage flickering is easy to cause under the condition of incomplete detection. Disclosure of Invention In view of the problems, the invention provides an artificial intelligence-based foreground target barrage shielding method, which ensures that target masking is tracked all the time from the appearance to the appearance from multiple dimensions by combining depth information, semantic information and instance information, ensures continuity of masking, prevents the problem that barrages are sometimes and sometimes appeared frequently, and improves barrage appearance of users. In order to solve the technical problems, the invention adopts the following technical scheme: a foreground object barrage shielding prevention method based on artificial intelligence comprises the following steps: S10, sampling an original video at a preset frequency to obtain a video stream image, wherein the original video is a video which needs to be subjected to foreground human barrage shielding prevention; S20, processing the video stream by using a shot switching detection algorithm, judging whether shot switching occurs in the current video frame, and resetting a post-processing module cache if the shot switching occurs in the current frame, wherein the post-processing module cache comprises target tracking information, main target depth information of a previous frame and whether mask output exists in the previous frame; S30, processing the current image by using a face detector to obtain target face information of the current image; S40, processing the current image by using an instance divider to obtain object instance division information of the current image; S50, processing the current image by using a semantic divider to obtain human semantic division information of a target of the current image, wherein the human semantic division information is a one-dimensional feature map which is the same as the original image in width and height, and the value on each pixel represents the probability that the pixel is a human body; s60, processing the current