CN-122002030-A - Video processing method, device, electronic equipment, storage medium and product

CN122002030ACN 122002030 ACN122002030 ACN 122002030ACN-122002030-A

Abstract

The application relates to a video processing method, a video processing device, electronic equipment, a storage medium and a product. The method comprises the steps of extracting a global visual feature and a plurality of local visual features corresponding to each target frame in a target video set aiming at each target video in the target video set to obtain a global feature set and a first local feature set corresponding to each target video, compressing the first local feature set to obtain a second local feature set aiming at each target video under the guidance of the global feature set, compressing a candidate global feature set to obtain a target global feature set, and compressing the candidate local feature set to obtain a target local feature set under the guidance of the candidate global feature set, and obtaining visual characterization features corresponding to the target video set based on the target global feature set and the target local feature set. The application provides a scheme for obtaining the corresponding characteristics of the video set more adaptively.

Inventors

CHEN SHIZHE

Assignees

腾讯科技（深圳）有限公司

Dates

Publication Date: 20260508
Application Date: 20241104

Claims (12)

1. A method of video processing, the method comprising: Extracting a global visual feature and a plurality of local visual features corresponding to each target frame in a target video set aiming at each target video to obtain a global feature group and a first local feature group corresponding to each target video, wherein the target video set comprises a plurality of target videos with association relations; Aiming at each target video, compressing the first local feature group under the guidance of the global feature group to obtain a second local feature group; Compressing a candidate global feature set to obtain a target global feature set, and compressing a candidate local feature set under the guidance of the candidate global feature set to obtain a target local feature set, wherein the candidate local feature set is composed of the second local feature sets corresponding to the target videos respectively, and the candidate global feature set is composed of the global feature sets corresponding to the target videos respectively; And obtaining visual characterization features corresponding to the target video set based on the target global feature set and the target local feature set.
2. The method according to claim 1, wherein compressing the first local feature set under the guidance of the global feature set to obtain a second local feature set includes: Performing self-attention mechanism processing on the global feature group and at least one first preset feature to obtain a first guide feature group; And determining a first attention weight based on the first key vector and the first query vector aiming at each first key vector corresponding to the first local feature group by taking the first guide feature group as a first query vector, and updating a first value vector corresponding to the first key vector based on the first attention weight so as to obtain the second local feature group corresponding to the first local feature group.
3. The method according to claim 1 or 2, wherein compressing the candidate local feature set under the guidance of the candidate global feature set to obtain the target local feature set comprises: performing self-attention mechanism processing on the candidate global feature set and at least one second preset feature to obtain a second guide feature set; And taking the second guide feature set as a second query vector, determining a second attention weight based on the second key vector and the second query vector for each second key vector corresponding to the candidate local feature set, and updating a second value vector corresponding to the second key vector based on the second attention weight so as to obtain the target local feature set corresponding to the candidate local feature set.
4. The method according to claim 1 or 2, wherein compressing the candidate global feature set to obtain the target global feature set comprises: And taking at least one third preset feature as a third query vector, determining a third attention weight based on the third key vector and the third query vector aiming at each third key vector corresponding to the candidate global feature set, and updating a third value vector corresponding to the third key vector based on the third attention weight so as to obtain the target global feature set corresponding to the candidate global feature set.
5. The method of claim 1, wherein the method is implemented by a video processing model comprising a feature extraction network, a first feature compression network, a second feature compression network, and a feature fusion network; the feature extraction network is used for extracting a global visual feature and a plurality of local visual features corresponding to each target frame in target videos aiming at each target video in a target video set to obtain a global feature group and a first local feature group corresponding to each target video, wherein the target video set comprises a plurality of target videos with association relations; the first feature compression network is used for compressing the first local feature group to obtain a second local feature group under the guidance of the global feature group aiming at each target video; The second feature compression network is used for compressing a candidate global feature set to obtain a target global feature set, and compressing a candidate local feature set under the guidance of the candidate global feature set to obtain a target local feature set, wherein the candidate local feature set is composed of the second local feature sets corresponding to the target videos respectively, and the candidate global feature set is composed of the global feature sets corresponding to the target videos respectively; The feature fusion network is used for executing the following operation that visual representation features corresponding to the target video set are obtained based on the target global feature set and the target local feature set.
6. The method according to claim 1 or 5, wherein after the obtaining the visual representation feature corresponding to the target video set based on the target global feature set and the target local feature set, the method further comprises: And taking a preset prompt text, a visual characterization feature corresponding to the target video set and a text characterization feature as inputs, and carrying out content label prediction on the target video set by using a large language model.
7. The method of claim 6, wherein a target prediction model comprises the video processing model and the large language model, wherein the target prediction model is obtained by training based on a plurality of sample video sets, fixing parameters of each of the feature extraction network, the feature fusion network, and the large language model during the training, and adjusting parameters of each of a first candidate network and a second candidate network based on a preset difference, wherein the preset difference characterizes a difference between a content tag prediction result and a content tag labeling condition of the sample video sets, wherein the first feature compression network indicates the first candidate network after parameter adjustment, and wherein the second feature compression network indicates the first candidate network after parameter adjustment.
8. The method of claim 1, wherein prior to extracting the global visual feature and the plurality of local visual features corresponding to each target frame in the target video, the method further comprises: And extracting a preset number of target frames from the target video by taking a preset interval as a constraint, wherein the preset interval is determined based on prediction effect information and a plurality of candidate intervals, the plurality of candidate intervals correspond to a plurality of reference videos, the candidate intervals are constraints for extracting a plurality of reference video frames from the reference videos, the prediction effect information is determined based on the effect of participation of a visual feature group corresponding to each of the plurality of reference videos in content tag prediction, and the visual feature group is formed by visual features corresponding to each of the plurality of reference video frames.
9. A video processing apparatus, the apparatus comprising: The feature extraction module is used for extracting a global visual feature and a plurality of local visual features corresponding to each target frame in the target video aiming at each target video in a target video set to obtain a global feature group and a first local feature group corresponding to each target video, wherein the target video set comprises a plurality of target videos with association relations; the first feature compression module is used for compressing the first local feature group to obtain a second local feature group aiming at each target video under the guidance of the global feature group; The second feature compression module is used for compressing a candidate global feature set to obtain a target global feature set, and compressing a candidate local feature set under the guidance of the candidate global feature set to obtain a target local feature set, wherein the candidate local feature set is composed of the second local feature sets corresponding to the target videos respectively, and the candidate global feature set is composed of the global feature sets corresponding to the target videos respectively; And the feature fusion module is used for obtaining visual characterization features corresponding to the target video set based on the target global feature set and the target local feature set.
10. An electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores at least one instruction or at least one program that is loaded and executed by the at least one processor to implement the video processing method of any of claims 1-8.
11. A computer readable storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement the video processing method of any of claims 1-8.
12. A computer program product comprising at least one instruction or at least one program loaded and executed by a processor to implement the video processing method of any of claims 1-8.

Description

Video processing method, device, electronic equipment, storage medium and product Technical Field The present application relates to the field of internet communications technologies, and in particular, to a video processing method, a video processing device, an electronic device, a storage medium, and a product. Background Video is an important carrier for delivering information. In the related art, a single video is taken as an object to be processed, a plurality of video frames in the video are generally extracted, and feature extraction is performed on each video frame, so that features corresponding to the video frames are taken as features corresponding to the video. However, when the object to be processed is a video set including a plurality of videos, if the features corresponding to the videos are taken as the features corresponding to the video set in the manner of obtaining the features corresponding to the videos, the features corresponding to the video set have a problem of large data volume. This affects the efficiency and effectiveness of further processing of the video set based on its corresponding features. Accordingly, there is a need to provide a more adaptive solution for obtaining the corresponding features of the video set. Disclosure of Invention In order to solve at least one technical problem set forth above, the present application provides a video processing method, apparatus, electronic device, storage medium and product: according to a first aspect of the present application, there is provided a video processing method, the method comprising: Extracting a global visual feature and a plurality of local visual features corresponding to each target frame in a target video set aiming at each target video to obtain a global feature group and a first local feature group corresponding to each target video, wherein the target video set comprises a plurality of target videos with association relations; Aiming at each target video, compressing the first local feature group under the guidance of the global feature group to obtain a second local feature group; Compressing a candidate global feature set to obtain a target global feature set, and compressing a candidate local feature set under the guidance of the candidate global feature set to obtain a target local feature set, wherein the candidate local feature set is composed of the second local feature sets corresponding to the target videos respectively, and the candidate global feature set is composed of the global feature sets corresponding to the target videos respectively; And obtaining visual characterization features corresponding to the target video set based on the target global feature set and the target local feature set. According to a second aspect of the present application, there is provided a video processing apparatus, the apparatus comprising: The feature extraction module is used for extracting a global visual feature and a plurality of local visual features corresponding to each target frame in the target video aiming at each target video in a target video set to obtain a global feature group and a first local feature group corresponding to each target video, wherein the target video set comprises a plurality of target videos with association relations; the first feature compression module is used for compressing the first local feature group to obtain a second local feature group aiming at each target video under the guidance of the global feature group; The second feature compression module is used for compressing a candidate global feature set to obtain a target global feature set, and compressing a candidate local feature set under the guidance of the candidate global feature set to obtain a target local feature set, wherein the candidate local feature set is composed of the second local feature sets corresponding to the target videos respectively, and the candidate global feature set is composed of the global feature sets corresponding to the target videos respectively; And the feature fusion module is used for obtaining visual characterization features corresponding to the target video set based on the target global feature set and the target local feature set. According to a third aspect of the present application, there is provided an electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores at least one instruction or at least one program that is loaded and executed by the at least one processor to implement the video processing method as described in the first aspect. According to a fourth aspect of the present application, there is provided a computer readable storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement the video processing method as described in the first aspect. According to a fifth aspect of the present application there is pro