CN-122027854-A - Video editing method and system based on scene recognition

CN122027854ACN 122027854 ACN122027854 ACN 122027854ACN-122027854-A

Abstract

The invention is applicable to the field of video editing and provides a video editing method and a system based on scene recognition, wherein the method comprises the steps of obtaining an original video stream to be processed, and extracting heterogeneous characterization vectors containing global space-time characteristics, local motion vectors and acoustic emotion envelopes through cross-modal sampling; the method comprises the steps of identifying semantic logic association among video fragments, generating a multi-dimensional scene label, constructing a scene semantic map based on a non-Euclidean space, calculating narrative energy distribution, screening a highlight material set, performing automatic synthesis of audio-visual phase synchronization on the highlight material set by adopting an evolutionary search algorithm based on aesthetic constraint, performing adaptive coding optimization, performing closed-loop updating of a scene perception model based on user interaction feedback, and effectively overcoming the defects of weak semantic perception and arrangement homogenization and realizing accurate cooperation and creative adaptive evolution of audio-visual modes through space-time causal reasoning and aesthetic manifold modeling.

Inventors

PAN WENJUN
YI ZHOU
SHEN AO
SU HONGYING

Assignees

安徽协创物联网技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260211

Claims (9)

1. A video editing method based on scene recognition, the method comprising: acquiring an original video stream to be processed, and extracting heterogeneous characterization vectors containing global space-time characteristics, local motion vectors and acoustic emotion envelopes through cross-modal sampling; Carrying out dynamic evolution modeling on the heterogeneous characterization vectors by using a preset scene perception model, identifying semantic logic association among video fragments, and generating a multi-dimensional scene tag; Constructing a scene semantic map based on a non-Euclidean space, and calculating the energy distribution of the narrative by combining the track of a core main body on a visual aesthetic manifold so as to screen out a highlight material set; adopting an evolutionary search algorithm based on aesthetic constraints to match clipping operators corresponding to the scene tags, and executing automatic synthesis of audio-visual phase synchronization on the highlight material set; And performing adaptive coding optimization, and performing closed-loop updating of the scene perception model based on user interaction feedback.
2. The video clipping method based on scene recognition according to claim 1, wherein the dynamic evolution modeling of the heterogeneous token vector by using a preset scene perception model, the recognition of semantic logical association between video clips, and the generation of the multi-dimensional scene tag specifically comprise: Respectively extracting static texture details and inter-frame dynamic displacement trends in the original video stream by using a space-time decoupling convolution operator in the scene perception model; Introducing a multi-head self-attention mechanism to perform cross-modal interaction mapping on visual and audio characteristics, and calculating the association confidence coefficient of audio-visual semantics; adopting a contrast learning algorithm to project segment features to a unified latent space, and determining a scene segmentation boundary by calculating a semantic distance; capturing a long-range dependency relationship of a scene by using a recurrent neural network, and outputting the scene tag containing scene categories and environmental atmosphere; and correcting the identified scene tag based on causal intervention logic, and eliminating interference of irrelevant environment variables on the identification process.
3. The method for video editing based on scene recognition according to claim 1, wherein said constructing a scene semantic map based on non-euclidean space, and calculating the energy distribution of the narrative in combination with the trajectory of the core body on the visual aesthetic manifold, and further screening out the highlight material set specifically comprises: Locking the core main body in time sequence through a target detection and re-identification algorithm, and calculating the visual saliency weight of the core main body in a picture; extracting mel frequency cepstrum coefficients of an audio stream in the original video stream, and identifying emotion explosion points and melody turning points of background music; constructing a semantic association model based on a graph attention network, and quantifying the contribution degree of each visual element in the scene narrative; mapping the motion intensity of the core main body and the emotion energy score of the audio stream to a nonlinear manifold space, and evaluating the narrative value of the video segment; and searching and extracting the highlight material sets meeting the logical continuity constraint on a time axis by applying a dynamic programming algorithm.
4. The video editing method based on scene recognition according to claim 1, wherein said employing an evolutionary search algorithm based on aesthetic constraints to match editing operators corresponding to said scene tags, performing an automatic composition of audiovisual phase synchronization on said highlight material sets, comprises: Searching a cloud artistic strategy library according to the scene tag to obtain a clipping configuration containing a color space conversion matrix and rhythm guiding parameters; extracting a re-shooting site of background music by using an audio waveform peak detection and phase tracking technology, and establishing a sound-picture synchronization time reference; the optimal lens combination sequence is automatically searched based on a reward mechanism of reinforcement learning, so that resonance matching of video picture kinetic energy and music rhythm frequency is realized; Calculating pixel-level optical flow vectors between adjacent materials, and generating a seamless transition rendering effect based on picture motion logic; And combining the depth information graph of the video, and applying an adaptive color correction operator in the rendering process to generate a final candidate slice.
5. The video editing method based on scene recognition according to claim 1, wherein said performing adaptive coding optimization and performing closed-loop updating of the scene perception model based on user interaction feedback specifically comprises: Identifying a hardware decoding specification of a target release platform, and dynamically configuring a perception video coding operator to optimize visual quality; executing parallel rendering based on hardware acceleration of a graphic processor, and synchronously exporting video files meeting multi-terminal picture requirements; embedding the structured metadata containing the scene tag and the editing logic basis into a video container, and supporting semantic retrieval; Monitoring the retention rate, editing modification or sharing behavior of a user on the film, and generating an implicit preference rewarding signal; And updating the scene perception model by using knowledge distillation and incremental learning technology to realize personalized artistic evolution aiming at specific user groups.
6. A video editing system based on scene recognition, wherein the video editing method based on scene recognition according to any of claims 1-5 is applied, the system comprising: the cross-modal sensing module is used for executing synchronous sampling of the audio-visual data and extracting heterogeneous characterization vectors containing space, time and frequency dimensions; The semantic logic reasoning module is used for running a scene perception model constructed based on space-time causal reasoning, modeling the heterogeneous characterization vector through the scene perception model, deducing the semantic attribution of the video fragment and generating a scene label; the material intelligent evaluation module is used for constructing a time sequence narrative energy map based on the aesthetic track and the audio emotion flow direction analysis of the core main body and screening out a highlight material set; the automatic creative synthesis module is used for dynamically matching a clipping operator based on an evolutionary search algorithm to realize the arrangement of sound and picture synchronization and the rendering of transition effects; And the closed-loop feedback optimization module is used for executing video coding distribution according to the target platform specification and driving on-line parameter iteration of the scene perception model according to user feedback.
7. The video clip system based on scene recognition as recited in claim 6, wherein the semantic logical reasoning module specifically comprises: the space-time characteristic decoupling unit is used for independently extracting the static structure in the frame and the dynamic displacement between frames and improving the perception precision of the scene perception model on complex actions; the audio-visual association mapping unit is used for calculating the mutual information quantity of the visual and audio mode information at the semantic level and realizing characteristic depth alignment; the causal intervention correction unit is used for eliminating the influence of environmental noise on recognition by using causal graph analysis and improving the stability of the scene tag generation; The hierarchical scene classification unit is used for supporting multi-level label system inference from a macroscopic environment to a microscopic event and providing fine scene guidance for clipping; and the time sequence smoothing processing unit is used for correcting the jump of the scene tag by using the hidden Markov logic and ensuring the logic continuity of the identification result.
8. The video clip system based on scene recognition as recited in claim 6, wherein the intelligent material assessment module specifically comprises: An aesthetic manifold tracking unit for tracking the motion profile of the core body in a non-euclidean space, evaluating the aesthetic score of its visual composition; the acoustic emotion calculating unit is used for extracting envelope energy and frequency fluctuation in the audio signal and providing psychological basis of non-visual dimension for highlight judgment; The attention weight distribution unit is used for dynamically adjusting the contribution ratio of the visual saliency to the acoustic saliency according to the category of the scene tag; The narrative integrity checking unit is used for evaluating the front-back logic association of the candidate materials and eliminating isolated frame fragments which cause the logic fracture of the narrative; and the material redundancy removing unit is used for filtering candidate materials with highly repeated content or poor image quality by calculating the characteristic space distribution density.
9. The scene recognition based video editing system of claim 6, wherein the automated creative composition module specifically comprises: The phase alignment synchronization unit is used for realizing sub-second accurate alignment of the video action instantaneous energy point and the audio strong shooting position and improving the tone-picture matching degree; The evolutionary algorithm searching unit is used for automatically searching an optimal clipping sequence path conforming to aesthetic constraint in a preset strategy space; the optical flow transition generating unit is used for synthesizing the natural transition effect of the physical movement of the simulated lens in real time according to the movement vector distribution of the segment edges; a depth-aware rendering unit for applying adaptive color correction and filter enhancement at different scene depths using a depth information map of the video; And the visual center of gravity prediction unit is used for automatically locking the salient region of the picture during multi-proportion output and ensuring that the core main body is always positioned in the visual center.

Description

Video editing method and system based on scene recognition Technical Field The invention belongs to the field of video editing, and particularly relates to a video editing method and system based on scene recognition. Background With the popularization of 5G communication technology and the improvement of the shooting performance of mobile terminals, the global video data is in explosive growth, and short video social contact, online education, news propagation and live broadcast of sports events become core carriers for information interaction. In this context, the consumer's consumption of video content is no longer satisfied with the simple pile-up of raw material, but pursues deep authoring with narrative logic, emotional resonance, and artistic appeal. However, high quality video clips often rely on manual scheduling by a professional clipper, which is time consuming and labor intensive and difficult to handle with the processing requirements of large amounts of real-time data. Although a batch of automated editing tools has emerged in recent years, attempting to assist in material screening using AI technology, it is still difficult to balance the depth of content understanding with the flexibility of editing art when dealing with complex, diverse dynamic scenes. The existing automatic editing scheme focuses on physical feature extraction or discrete object label recognition, has the common core defects of semantic perception dimension deficiency, audiovisual multi-mode collaborative unbalance, logic arrangement stiffness and the like, is difficult to construct a coherent narrative logic through modeling long-range time sequence evolution, cannot realize depth association and precise highlight capture of picture dynamics and audio rhythms, and excessively depends on a preset hard coding template, and lacks self-adaptive adjustment capability on different scene environments, so that editing finished products often appear as material stacking with serious homogenization, and cannot meet the requirements of high-quality and personalized artistic creation. Disclosure of Invention The invention aims to provide a video editing method and a system based on scene recognition, which aim to solve the problems in the background technology. The present invention is embodied in one aspect, a video editing method based on scene recognition, the method comprising: acquiring an original video stream to be processed, and extracting heterogeneous characterization vectors containing global space-time characteristics, local motion vectors and acoustic emotion envelopes through cross-modal sampling; Carrying out dynamic evolution modeling on the heterogeneous characterization vectors by using a preset scene perception model, identifying semantic logic association among video fragments, and generating a multi-dimensional scene tag; Constructing a scene semantic map based on a non-Euclidean space, and calculating the energy distribution of the narrative by combining the track of a core main body on a visual aesthetic manifold so as to screen out a highlight material set; adopting an evolutionary search algorithm based on aesthetic constraints to match clipping operators corresponding to the scene tags, and executing automatic synthesis of audio-visual phase synchronization on the highlight material set; And performing adaptive coding optimization, and performing closed-loop updating of the scene perception model based on user interaction feedback. As a further aspect of the present invention, the dynamically evolving and modeling the heterogeneous token vector using a preset scene perception model, identifying semantic logic associations between video segments, and generating a multi-dimensional scene tag specifically includes: Respectively extracting static texture details and inter-frame dynamic displacement trends in the original video stream by using a space-time decoupling convolution operator in the scene perception model; Introducing a multi-head self-attention mechanism to perform cross-modal interaction mapping on visual and audio characteristics, and calculating the association confidence coefficient of audio-visual semantics; adopting a contrast learning algorithm to project segment features to a unified latent space, and determining a scene segmentation boundary by calculating a semantic distance; capturing a long-range dependency relationship of a scene by using a recurrent neural network, and outputting the scene tag containing scene categories and environmental atmosphere; and correcting the identified scene tag based on causal intervention logic, and eliminating interference of irrelevant environment variables on the identification process. As still further aspect of the present invention, the constructing a scene semantic graph based on a non-euclidean space, and calculating a narrative energy distribution in combination with a trajectory of a core main body on a visual aesthetic manifold, and further scree