CN-121985164-A - Video processing method and system based on audio-video synchronization and electronic equipment

CN121985164ACN 121985164 ACN121985164 ACN 121985164ACN-121985164-A

Abstract

The application relates to a video processing method based on audio-video synchronization, which comprises the steps of screening candidate fragments from a video material library according to user input content, determining target fragments from the candidate fragments based on a depth interaction algorithm and semantic entity correlation analysis, determining the video time length of the target fragments, determining the number of the line words of the target fragments based on a large language model, calculating target playing time length according to the number of the line words and a preset dubbing speech speed, judging whether the target fragments have time length deviation based on the target playing time length and the video time length, and carrying out self-adaptive variable speed processing on the target fragments based on the target playing time length and the video time length under the condition that the target fragments have time length deviation, so as to generate a final video fragment. The application solves the problem of asynchronous video audio and video, dynamically optimizes the variable speed algorithm based on the deviation degree of the target playing time length and the video time length, and maintains the integrity of the picture and the audio while guaranteeing the synchronization of the audio and the video.

Inventors

YANG YAZHENG
HUANG YABIN
LI ANG
HUANG YIFENG

Assignees

杭州嘿库智能科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251217

Claims (10)

1. A video processing method based on audio-visual synchronization, the method comprising: Screening candidate fragments from a video material library according to user input content, and determining target fragments from the candidate fragments based on a depth interaction algorithm and semantic entity correlation analysis; Determining the video time length of the target segment, determining the number of the line words of the target segment based on a large language model, calculating the target playing time length according to the number of the line words and a preset dubbing speed, and judging whether the target segment has time length deviation based on the target playing time length and the video time length; and under the condition that the target segment has time length deviation, performing self-adaptive variable speed processing on the target segment based on the target playing time length and the video time length to generate a final video segment.
2. The method of claim 1, wherein the adaptively shifting the target segment based on the target play duration and the video duration, generating a final video segment comprises: calculating a variable speed factor according to the video duration and the target playing duration; And judging whether the speed change is required or not based on the speed change factor and a preset speed change tolerance threshold, and if so, performing speed change processing on the target segment based on the speed change factor to obtain the final video segment.
3. The method of claim 2, wherein the performing a shift process on the target segment based on the shift factor to obtain the final video segment comprises: under the condition that the speed change factor does not exceed a preset speed change range, performing time scaling processing on the target segment according to the speed change factor to obtain the final video segment; and under the condition that the speed change factor exceeds a preset speed change range, performing frame extraction or frame interpolation processing on the target segment based on a motion compensation or frame interpolation strategy to obtain the final video segment.
4. The method according to claim 2, wherein the method further comprises: judging whether the absolute difference value between the target segment duration after the self-adaptive speed change processing and the target playing duration is larger than a secondary verification threshold value, if so, fine-adjusting the speed change factor, and re-carrying out speed change processing on the original target segment based on the fine-adjusted speed change factor, and/or And recording the video time length, the target playing time length, the speed change factor, the speed change processing time consumption and the output path corresponding to each target segment.
5. The method of claim 1, wherein the screening candidate segments from the video material library based on the user input comprises: Analyzing and encoding the user input content to generate an input picture semantic vector and an input text semantic vector; Determining a first candidate segment from a video material library based on the input picture semantic vector, the input text semantic vector and a semantic vector library, wherein the semantic vector library is constructed based on the video material library; and carrying out multidimensional fusion scoring on the first candidate fragments, and selecting a preset number of fragments from the first candidate fragments as second candidate fragments according to a multidimensional fusion scoring result.
6. The method of claim 5, wherein the semantic vector library comprises a picture semantic vector library and a text semantic vector library, wherein the determining a first candidate segment from a video material library based on the input picture semantic vector, the input text semantic vector, and the semantic vector library comprises: generating a structured query language based on user input content, and matching the structured query language with an SQLite database to obtain a preliminary candidate segment set; matching the preliminary candidate segment set with the picture semantic vector library and the text semantic vector library respectively to obtain a picture vector subset and a text vector subset; And through FAISS, according to the input picture semantic vector and the input text semantic vector, performing k neighbor search in the picture vector subset and the text vector subset, and determining a first candidate segment according to a search result.
7. The method of claim 5, wherein said multi-dimensional fusion scoring of said first candidate segment comprises: Determining picture semantic components according to the input picture semantic vectors and the picture semantic vectors of the first candidate segments; determining the semantic score of the line according to the input text semantic vector and the text semantic vector of the first candidate segment; calculating the label similarity between the user input content and the first candidate segment based on a preset label system to obtain a dynamic label matching score; And carrying out linear weighted fusion on the picture semantic score, the line semantic score and the dynamic tag matching score to obtain a multidimensional weighted score.
8. The method of claim 1, wherein determining a target segment from the candidate segments based on the depth interaction algorithm and semantic entity correlation analysis comprises: Based on an interactive encoder, carrying out depth interaction on the picture description and the line of the second candidate segment and the user input content respectively to obtain the fusion score of each second candidate segment; Extracting a first core entity in the user input content and a second core entity in the second candidate segment, and calculating the semantic similarity of the first core entity and the second core entity to obtain an object coincidence degree score; And determining a comprehensive score of each second candidate segment according to the fusion score and the object coincidence degree score, and determining a target segment from the second candidate segments based on the comprehensive score.
9. A video processing system based on audio-visual synchronization, the system comprising: The segment selection module is used for screening candidate segments from the video material library according to the input content of the user, and determining target segments from the candidate segments based on a depth interaction algorithm and semantic entity correlation analysis; The deviation judging module is used for determining the video time length of the target segment, determining the number of the line words of the target segment based on a large language model, calculating the target playing time length according to the number of the line words and a preset dubbing speed, and judging whether the target segment has time length deviation based on the target playing time length and the video time length; And the audio and video alignment module is used for carrying out self-adaptive variable speed processing on the target segment based on the target playing time length and the video time length under the condition that the target segment has time length deviation, so as to generate a final video segment.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the audio and video synchronization based video processing method according to any one of claims 1 to 8 when executing the computer program.

Description

Video processing method and system based on audio-video synchronization and electronic equipment Technical Field The application relates to the technical field of video processing, in particular to a video processing method, a system and electronic equipment based on audio-video synchronization. Background In the fields of short video production, AI video generation, intelligent dubbing and the like, the realization of accurate synchronization of audio and pictures is a key for guaranteeing the quality of content. In the process of material retrieval and matching, the prior art focuses on semantic relevance of content mostly, and video clips which are matched with given speech semantics are retrieved from a material library usually based on keyword or vector similarity, and the retrieved clips often have significant deviation with the target audio length in time length. In the time sequence alignment processing level, the existing scheme is often used for directly carrying out simple overall speed change on the video, such as a conventional double-speed playing function of FFmpeg and other tools, and the simple speed change processing can cause the video quality to be damaged and influence the look and feel. Disclosure of Invention The embodiment of the application provides a video processing method, a system, electronic equipment and a storage medium based on audio-video synchronization, which are used for at least solving the problem of asynchronous video audio-video in the related technology. In a first aspect, an embodiment of the present application provides a video processing method based on audio-video synchronization, where the method includes: Screening candidate fragments from a video material library according to user input content, and determining target fragments from the candidate fragments based on a depth interaction algorithm and semantic entity correlation analysis; Determining the video time length of the target segment, determining the number of the line words of the target segment based on a large language model, calculating the target playing time length according to the number of the line words and a preset dubbing speed, and judging whether the target segment has time length deviation based on the target playing time length and the video time length; and under the condition that the target segment has time length deviation, performing self-adaptive variable speed processing on the target segment based on the target playing time length and the video time length to generate a final video segment. In some embodiments, the adaptively shifting the target segment based on the target playing duration and the video duration, and generating the final video segment includes: calculating a variable speed factor according to the video duration and the target playing duration; And judging whether the speed change is required or not based on the speed change factor and a preset speed change tolerance threshold, and if so, performing speed change processing on the target segment based on the speed change factor to obtain the final video segment. In some embodiments, the performing the variable speed processing on the target segment based on the variable speed factor, to obtain the final video segment includes: under the condition that the speed change factor does not exceed a preset speed change range, performing time scaling processing on the target segment according to the speed change factor to obtain the final video segment; and under the condition that the speed change factor exceeds a preset speed change range, performing frame extraction or frame interpolation processing on the target segment based on a motion compensation or frame interpolation strategy to obtain the final video segment. In some of these embodiments, the method further comprises: judging whether the absolute difference value between the target segment duration after the self-adaptive speed change processing and the target playing duration is larger than a secondary verification threshold value, if so, fine-adjusting the speed change factor, and re-carrying out speed change processing on the original target segment based on the fine-adjusted speed change factor, and/or And recording the video time length, the target playing time length, the speed change factor, the speed change processing time consumption and the output path corresponding to each target segment. In some embodiments, the selecting candidate segments from the video material library according to the user input content includes: Analyzing and encoding the user input content to generate an input picture semantic vector and an input text semantic vector; Determining a first candidate segment from a video material library based on the input picture semantic vector, the input text semantic vector and a semantic vector library, wherein the semantic vector library is constructed based on the video material library; and carrying out multidimensional fusion scoring on t