CN-122027848-A - Live broadcast segment editing and recommending method and system based on cross-modal understanding

CN122027848ACN 122027848 ACN122027848 ACN 122027848ACN-122027848-A

Abstract

The invention relates to a live fragment editing and recommending method and system based on cross-modal understanding, and belongs to the field of video editing. The method comprises the steps of collecting original live stream data to obtain a feature vector sequence, constructing an interest hypergraph based on the feature vector sequence to obtain an activation intensity vector Performing cross-modal gradient calibration based on the feature vector sequence to obtain an aligned new feature vector sequence 、 And Based on the new feature vector sequence 、 And Ruminant attention suppression to obtain fusion features Based on the activation intensity vector And fusion features Performing ambiguous highlight determination to obtain a series of highlight moments, each highlight moment being accompanied by its dominant interest dimension And carrying out multi-version intelligent editing and personalized recommendation based on the highlight time. The invention is an intelligent method for understanding ambiguity of live content, aligning modal time, restraining modal bias and generating multi-version clips.

Inventors

MAO YU

Assignees

上海点掌文化科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260325

Claims (8)

1. The live fragment editing and recommending method based on cross-modal understanding is characterized by comprising the following steps of: Step S1, acquiring original live stream data to obtain a feature vector sequence, wherein the feature vector sequence comprises an audio feature stream Visual feature stream And text feature stream ; S2, constructing an interest hypergraph based on the feature vector sequence to obtain an activation intensity vector ; Step S3, performing cross-modal gradient calibration based on the feature vector sequence to obtain an aligned new feature vector sequence 、 And ; Step S4, based on the new feature vector sequence 、 And Ruminant attention suppression to obtain fusion features ; Step S5, based on the activation intensity vector And the fusion feature Performing ambiguous highlight determination to obtain a series of highlight moments, each highlight moment being accompanied by its dominant interest dimension Scoring each dimension; And step S6, carrying out multi-version intelligent clipping and personalized recommendation based on the highlight time.
2. The method for editing and recommending live clips based on cross-modal understanding according to claim 1, wherein the step S1 specifically comprises: The original live stream data of the live broadcast platform, which comprises an audio-video stream and a barrage stream, are accessed in real time to obtain three groups of time sequences, namely an audio frame sequence, a video frame sequence and a barrage text set sequence; and converting the original live stream data into a numerical vector through a pre-trained lightweight model to obtain three groups of characteristic vector sequences.
3. The method for editing and recommending live fragments based on cross-modal understanding according to claim 1, wherein the construction of the interest hypergraph in step S2 is specifically as follows: automatically learning from historical behavior data of a user, and defining K interest basis vectors; For each modality Introducing a linear projection matrix Bias and method of making same And define projection functions , Is a characteristic vector sequence of the mode m at the moment t; for each dimension of interest k and each modality m, a learnable preference vector is introduced ; By the projection function Projecting each mode characteristic to a public interval to obtain three projection vectors 、 And And associating each of the projection vectors with the corresponding preference vector Multiplying by element to obtain three feature vectors of interest perception, summing the feature vectors of interest perception to obtain comprehensive feature vector The comprehensive feature vector is taken L2 norm of (2) ; Introduction of history items An exponentially decaying average representing the activation intensity of the dimension of interest over a period of time; the current signal energy and the historical memory are weighted and summed and compressed to a (0, 1) interval through a Sigmoid function to obtain the activation intensity Mathematically described as , wherein, And Is a learnable scalar parameter; for each instant t, the activation intensity vector of K dimensions is obtained 。
4. The method for editing and recommending live fragments based on cross-modal understanding according to claim 3, wherein the cross-modal gradient calibration in step S3 is specifically: For each modality m, calculate its gradient ; For each pair of modes m and n, a dynamic delay is obtained by training a small neural network according to the current context prediction Training method of neural network to minimize anchor loss function The anchor loss function , wherein, And The mean value and standard deviation of the gradient of the mode m in the whole live broadcast process, Is of the mode n The gradient of the moment in time is such that, And The mean value and standard deviation of the gradient of the mode n in the whole live broadcast process are adopted; Based on the dynamic delay Resampling the feature sequence of each mode to obtain the new feature vector sequence 、 And 。
5. The method for editing and recommending live clips based on cross-modal understanding according to claim 4, wherein the ruminant attention suppressing in step S4 is specifically: Sequence the new feature vector 、 And Inputting a standard cross-mode attention layer, and calculating a first round of attention weight matrix And inquiring the key value of other modes by using the inquiry of each mode to obtain a first round of fusion characteristics ; Using the first round of attention weight matrix Obtaining a second round of attention weight matrix as a suppression template Simultaneously obtain the second round of fusion characteristics Mathematically described as Wherein Q, K and V are respectively query, key and value matrix, which are obtained by linear transformation of the aligned multi-modal characteristics, d is characteristic dimension, For the element-by-element multiplication, Stopping the operation for the gradient; the two-round output is fused through a dynamic fusion formula, and the fusion characteristic is obtained Mathematically described as , wherein, In order that scalar parameters may be learned, Is a temperature coefficient of the silicon carbide material, For amplifying the effective contribution of the second round.
6. The method for editing and recommending live clips based on cross-modal understanding according to claim 5, wherein the ambiguous highlight determination in step S5 is specifically: based on the activation intensity vector And the fusion feature Obtain a high light score Mathematically described as , wherein, To be used in Mapping to one and The vector of the same dimension is used to determine, Is a novelty punishment item based on the high light score Obtaining a K-dimensional high-light score vector at each time t ; Selecting Taking the time exceeding the threshold value as a candidate highlight time, performing non-maximum suppression to obtain a series of highlight time, wherein each highlight time is attached with the dominant interesting dimension Each dimension score.
7. The method for trans-modal understanding based live clip and recommendation according to claim 6, wherein the multi-version intelligent clip and personalized recommendation in step S6 is specifically: pre-learning a set of clipping parameters for each dominant dimension of interest Generating a clipping preference vector Indicating the reserved weights of audio, video, bullet screen in the clip version; At high light time For the center, extending a basic time length forwards and backwards respectively, and according to the clipping preference vector The original multi-mode stream is subjected to weighted reservation, so that a plurality of clip versions are generated, and each clip version is provided with a corresponding interest tag; And carrying out personalized recommendation based on the clip version and the interest tag.
8. A live fragment editing and recommending system based on cross-modal understanding, which is characterized in that the system is applied to the live fragment editing and recommending method based on cross-modal understanding as claimed in any one of claims 1 to 7, and comprises a preprocessing module, an interest hypergraph construction module, a cross-modal gradient calibration module, a ruminant attention suppression module and an ambiguity highlight judgment module; the preprocessing module is used for acquiring original live stream data to obtain a feature vector sequence, and the feature vector sequence comprises an audio feature stream Visual feature stream And text feature stream ; The interest hypergraph construction module is used for constructing an interest hypergraph based on the feature vector sequence to obtain an activation intensity vector ; The cross-modal gradient calibration module is used for performing cross-modal gradient calibration based on the feature vector sequence to obtain an aligned new feature vector sequence 、 And ; The ruminant attention suppression module is used for being based on the new feature vector sequence 、 And Ruminant attention suppression to obtain fusion features ; The ambiguity-resolved module is configured to determine an activation intensity vector based on the activation intensity vector And the fusion feature Performing ambiguous highlight determination to obtain a series of highlight moments, each highlight moment being accompanied by its dominant interest dimension And carrying out multi-version intelligent editing and personalized recommendation based on the highlight time.

Description

Live broadcast segment editing and recommending method and system based on cross-modal understanding Technical Field The invention belongs to the technical field of video editing, and particularly relates to a live fragment editing and recommending method and system based on cross-modal understanding. Background With the rapid development of the live industry, the production of massive live content makes it difficult for users to obtain interesting highlights in a short time. Traditional video editing methods rely on manual or simple rules (such as barrage density and volume peak value) to perform highlight extraction, so that the semantic richness and ambiguity of content are difficult to capture. In recent years, an automatic editing method based on multi-modal understanding is increasingly emerging, and highlight clips are identified by fusing audio, visual and text information. However, the existing method still has the following defects that firstly, the interest difference of different audience groups cannot be effectively modeled, the clipping result is single, secondly, the time asynchronism problem among modes is not properly processed, the fusion effect is affected, thirdly, the models are easy to excessively depend on the modes with strong signals, useful information in the weak modes is ignored, and thirdly, the multi-version generation capability of the same highlight moment is lacking, so that the personalized recommendation requirement cannot be met. Disclosure of Invention In order to solve the problems in the prior art, the invention provides a live fragment editing and recommending method and system based on cross-modal understanding. The aim of the invention can be achieved by the following technical scheme: A live fragment editing and recommending method based on cross-modal understanding, the implementation of the live fragment editing and recommending method based on cross-modal understanding comprising the following steps: Step S1, acquiring original live stream data to obtain a feature vector sequence, wherein the feature vector sequence comprises an audio feature stream Visual feature streamAnd text feature stream; S2, constructing an interest hypergraph based on the feature vector sequence to obtain an activation intensity vector; Step S3, performing cross-modal gradient calibration based on the feature vector sequence to obtain an aligned new feature vector sequence、And; Step S4, based on the new feature vector sequence、AndRuminant attention suppression to obtain fusion features; Step S5, based on the activation intensity vectorAnd the fusion featurePerforming ambiguous highlight determination to obtain a series of highlight moments, each highlight moment being accompanied by its dominant interest dimensionScoring each dimension; And step S6, carrying out multi-version intelligent clipping and personalized recommendation based on the highlight time. Preferably, the step S1 specifically includes: The original live stream data of the live broadcast platform, which comprises an audio-video stream and a barrage stream, are accessed in real time to obtain three groups of time sequences, namely an audio frame sequence, a video frame sequence and a barrage text set sequence; and converting the original live stream data into a numerical vector through a pre-trained lightweight model to obtain three groups of characteristic vector sequences. Preferably, the construction of the interest hypergraph in step S2 specifically includes: automatically learning from historical behavior data of a user, and defining K interest basis vectors; For each modality Introducing a linear projection matrixBias and method of making sameAnd define projection functions,Is a characteristic vector sequence of the mode m at the moment t; for each dimension of interest k and each modality m, a learnable preference vector is introduced ; By the projection functionProjecting each mode characteristic to a public interval to obtain three projection vectors、AndAnd associating each of the projection vectors with the corresponding preference vectorMultiplying by element to obtain three feature vectors of interest perception, summing the feature vectors of interest perception to obtain comprehensive feature vectorThe comprehensive feature vector is takenL2 norm of (2); Introduction of history itemsAn exponentially decaying average representing the activation intensity of the dimension of interest over a period of time; the current signal energy and the historical memory are weighted and summed and compressed to a (0, 1) interval through a Sigmoid function to obtain the activation intensity Mathematically described as, wherein,AndIs a learnable scalar parameter; for each instant t, the activation intensity vector of K dimensions is obtained 。 Preferably, the cross-modal gradient calibration in the step S3 specifically includes: For each modality m, calculate its gradient ; For each pair of modes m and n, a dynamic delay is o