CN-118214918-B - Method, apparatus, electronic device and computer program product for video processing

CN118214918BCN 118214918 BCN118214918 BCN 118214918BCN-118214918-B

Abstract

Embodiments of the present disclosure relate to methods, apparatuses, electronic devices, and computer program products for video processing. The method includes capturing video and prompt text. Further, the method includes generating an output of the hint text indication based on the video and the hint text, the plurality of visual markers of each image frame of the video being compressed by a gating process. Thus, according to the scheme of the embodiment of the disclosure, the visual marks of the image frames in the video can be compressed through the gating processing, the visual marks which have high correlation and provide effective information are reserved, so that the number of the visual marks is reduced, the consumption of computing resources is reduced, and balance is achieved between the number of the visual marks and the video processing effect, so that the model can process a large number of video image frames and simultaneously improve the effect of the video processing.

Inventors

WANG HAN
WANG YANJIE
YE YONGJIE
NIE YUXIANG
HUANG CAN

Assignees

抖音视界有限公司

Dates

Publication Date: 20260508
Application Date: 20240315

Claims (12)

1. A method of video processing, comprising: Acquiring video and prompt text, and Generating an output of the prompt text indication based on the video and the prompt text; Wherein the hint text specifies an object in each image frame of the video, and generating an output of the hint text indication comprises: compressing a first number of visual markers per image frame of the video by a gating process, generating a second number of visual markers, the second number being less than the first number, and Combining a plurality of text labels of the prompt text with the second number of visual labels to generate an output of the prompt text indication associated with the object.
2. The method of claim 1, further comprising: the time stamp of the image frame is added to the second number of visual markers.
3. The method of claim 1, wherein generating the second number of visual indicia comprises: generating a visual marker score for the first number of visual markers by the gating process, and The second number of visual markers is selected from the first number of visual markers based on the visual marker score.
4. The method of claim 1, wherein the hint text indicates at least one of: generating coordinates of objects in the video, and A description of the object at a given coordinate in the video is generated.
5. The method of claim 1, wherein the output is generated via an object processing model, and the method further comprises: acquiring a first training set comprising image data; Pre-training a compression module for the compression process in a video processing model based on the first training set, and And pre-training the video processing model based on the first training set.
6. The method of claim 5, further comprising: obtaining a second training set comprising video-level data; generating a third training set comprising object-level data; model parameters of the video processing model are adjusted based on the second training set and the third training set.
7. The method of claim 6, wherein generating the third training set comprising the object-level data comprises: Acquiring a training video and a video description corresponding to the training video; Generating predicted positions of objects in a plurality of specified image frames of the training video based on the training video and the video description, and A predicted position of the object in each image frame of the training video is generated based on the predicted positions of the object in the plurality of designated image frames.
8. The method of claim 7, wherein generating the predicted position of the object in the plurality of designated image frames of the training video comprises: Generating text blocks corresponding to the objects based on the video description, and The predicted position of the object in the plurality of specified image frames of the training video is determined using a positioning model based on the text block and the training video.
9. The method of claim 7, wherein generating the predicted position of the object in each image frame of the training video comprises: Generating a location template based on the predicted location of the object in a first image frame of the training video, and The predicted position of the object in each image frame is generated using a tracking model based on the position template and the predicted position of the object in the plurality of designated image frames.
10. An apparatus for video processing, comprising: A text video acquisition module configured to acquire video and prompt text, and An indication output generation module configured to generate an output of the indication of the prompt text based on the video and the prompt text; wherein the hint text specifies an object in each image frame of the video, and the instruction output generation module is further configured to: compressing a first number of visual markers per image frame of the video by a gating process, generating a second number of visual markers, the second number being less than the first number, and Combining a plurality of text labels of the prompt text with the second number of visual labels to generate an output of the prompt text indication associated with the object.
11. An electronic device, comprising: processor, and A memory coupled with the processor, the memory having instructions stored therein, which when executed by the processor, cause the electronic device to perform the method of any of claims 1-9.
12. A computer program product tangibly stored on a non-transitory computer readable medium and comprising computer executable instructions that are executed by a processor to implement the method of any one of claims 1 to 9.

Description

Method, apparatus, electronic device and computer program product for video processing Technical Field The present application relates to the field of computer technology, and in particular, to a method, an apparatus, an electronic device, and a computer program product for video processing. Background With the rapid development of digitization, video processing tasks are increasingly becoming critical. The processing of various video content is involved, both in the business field and in personal life. The existing video processing tasks are divided into three types according to processing granularity, namely a video-level task which mainly focuses on capturing global information in a video, a frame-level task which mainly distinguishes and analyzes image frames of the video and emphasizes time perception among the image frames, and an object-level task which requires a model to locate objects in each image frame and can effectively distinguish and track the objects along with the time. With the continued advancement of video processing technology, the importance of processing video content at finer granularity is becoming increasingly prominent. With the development of technology and the diversification of application scenes, more careful processing of video becomes more and more important, and the method has become an important trend in the video technical field, and has important significance for promoting the development and innovation of the video processing technical field. Disclosure of Invention Embodiments of the present disclosure provide a method, apparatus, electronic device, computer program product, and medium for video processing. According to a first aspect of the present disclosure, a method of video processing is provided. The method includes capturing video and prompt text. Further, the method includes generating an output of the hint text indication based on the video and the hint text, the plurality of visual markers of each image frame of the video being compressed by a gating process. According to a second aspect of the present disclosure, an apparatus for video processing is provided. The apparatus includes a text video acquisition module configured to acquire video and prompt text. The apparatus further includes an indication output generation module configured to generate an output of the indication of the prompt text based on the video and the prompt text, the plurality of visual markers of each image frame of the video being compressed by a gating process. According to a third aspect of the present disclosure, an electronic device is provided. The electronic device comprises a processor and a memory coupled to the processor, the memory having instructions stored therein that, when executed by the processor, cause the electronic device to perform the method according to the first aspect. In a fourth aspect of the present disclosure, a computer-readable storage medium is provided. The computer program product is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions that, when executed, cause a computer to perform the steps of the method of the first aspect of the present disclosure. In a fifth aspect of the present disclosure, a computer-readable storage medium is provided. The computer readable storage medium has stored thereon one or more computer instructions, wherein the one or more computer instructions are executed by a processor to implement the method according to the first aspect. The summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Drawings The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which: FIG. 1 illustrates a schematic diagram of an example environment in which devices and/or methods may be implemented, according to embodiments of the present disclosure; FIG. 2 illustrates a flow chart of a method of video processing according to an embodiment of the present disclosure; FIG. 3A shows a schematic diagram of a process of performing video processing tasks according to an embodiment of the disclosure; FIG. 3B shows a schematic diagram of an implementation of a compression module according to an embodiment of the present disclosure; FIG. 3C illustrates a schematic diagram of a hint text and corresponding visual output according to embodiments of the present disclosure; FIG. 4 illustrates a schematic diagram of a process of constructing training data for object-level video p