CN-121615662-B - Audio and video real-time intelligent shunt translation method and system combined with buffer strategy

CN121615662BCN 121615662 BCN121615662 BCN 121615662BCN-121615662-B

Abstract

The application relates to the technical field of audio and video real-time translation and discloses an audio and video real-time intelligent shunting translation method and system combined with a buffer strategy, wherein the method comprises the steps of analyzing audio and video data acquired in real time, shunting the audio and video data through a preset data shunting model to obtain audio data and video data; the method comprises the steps of configuring a dynamic buffer mechanism, carrying out semantic analysis and data division on audio data of an audio buffer area to obtain an audio data block set, analyzing priorities corresponding to semantic importance calculation of the data, carrying out key frame division on video data to obtain video key frame priorities and audio block priorities, translating the video key frame and the audio data block through a preset translation model to obtain a first translation result and a second translation result, dynamically adjusting display time of the translation result to obtain a real-time translation result, and reducing waiting time and improving accuracy and instantaneity of the translation result through dynamically adjusting the size and layered structure of the buffer area.

Inventors

WEN YAN
DONG ZHENJIANG

Assignees

江苏智檬智能科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (8)

1. The audio and video real-time intelligent shunt translation method combined with the buffer strategy is characterized by comprising the following steps of: analyzing audio and video data acquired in real time, and shunting the audio and video data through a preset data shunting model to obtain audio data and video data; According to the network transmission quality parameters monitored in real time, analyzing the network state through a preset network quality analysis model to determine dynamic buffer configuration parameters; Dividing the audio buffer area into a first buffer layer, a second buffer layer and a third buffer layer according to the dynamic buffer configuration parameters, wherein the first buffer layer stores current audio data to be processed, the second buffer layer preloads the audio data according to real-time audio semantics, and the third buffer layer expands when the network state changes to obtain a layered buffer area; Combining the dynamic buffer configuration parameters and real-time audio semantic analysis results, dynamically adjusting the boundary of the buffer layer and dynamically adjusting the size of an audio buffer area; Carrying out semantic analysis and data division on the audio data of the audio buffer area to obtain a first audio unit set, analyzing the duration of a mute interval between adjacent audio units, and carrying out merging optimization on the first audio unit set to obtain an audio data block set; Combining the video data and the audio data block set, analyzing the semantic importance of the data, calculating the corresponding priority, and carrying out key frame division on the video data to obtain the video key frame priority and the audio block priority; according to the video key frame priority and the audio block priority, translating the video key frame and the audio data block through a preset translation model respectively to obtain a first translation result and a second translation result; and analyzing the mapping relation between the first translation result and the second translation result according to the time sequence, and dynamically adjusting the display time of the translation result to obtain a real-time translation result.
2. The method for real-time intelligent streaming translation of audio and video in combination with the buffering strategy according to claim 1, wherein the analyzing the audio and video data acquired in real time, streaming the audio and video data through a preset data streaming model, obtaining audio data and video data, comprises: analyzing audio and video data acquired in real time, and determining data track codes; and according to the data track codes, streaming the audio and video data through a preset data streaming model to obtain the audio data and the video data.
3. The method for real-time intelligent streaming translation of audio and video in combination with the buffering strategy according to claim 1, wherein the dynamically adjusting the boundary of the buffer layer and the size of the audio buffer area by combining the dynamic buffering configuration parameters and the real-time audio semantic analysis result comprises: Analyzing the corresponding buffer time length of the audio data in the first buffer layer by combining the dynamic buffer configuration parameters and the real-time audio semantic analysis result, and optimizing the boundary of the first buffer layer to obtain a first optimized buffer layer; Carrying out semantic prediction on a real-time audio semantic analysis result through a preset semantic prediction model, calculating a corresponding buffer time length of the preloaded audio data, and optimizing the boundary of the second buffer layer to obtain a second optimized buffer layer; According to the dynamic buffer configuration parameters and the real-time network state, calculating the corresponding capacity expansion requirement under the real-time network state, and optimizing the boundary of the third buffer layer to obtain a third optimized buffer layer; And combining the first optimized buffer layer, the second optimized buffer layer and the third optimized buffer layer, and dynamically adjusting the size of the audio buffer area.
4. The method for real-time intelligent audio and video streaming translation combined with the buffer policy according to claim 3, wherein the performing semantic analysis and data partitioning on the audio data in the audio buffer to obtain a first audio unit set, analyzing a duration of a silence interval between adjacent audio units, and performing merging optimization on the first audio unit set to obtain an audio data block set includes: extracting acoustic characteristics of audio data of a first buffer layer in an audio buffer zone, and carrying out semantic analysis and data division through a preset voice recognition model to obtain a first audio unit set; Analyzing the mute interval duration between adjacent audio units according to the first audio unit set, and combining the adjacent audio units with the mute interval duration smaller than a preset interval threshold to obtain a second audio unit set; And analyzing the semantic integrity of the audio units in the second audio unit set, and performing expansion optimization on semantic boundaries to obtain an audio data block set.
5. The method for real-time intelligent streaming translation of audio and video in combination with the buffering strategy according to claim 4, wherein the analyzing the semantic integrity of the audio units in the second audio unit set, performing expansion optimization on semantic boundaries to obtain the audio data block set, comprises: analyzing the semantic integrity of the audio units in the second audio unit set through a preset semantic analysis model, and calculating a corresponding semantic integrity score; For the audio units with semantic integrity scores smaller than a preset semantic threshold, expanding a preset semantic length forwards and backwards respectively until the semantic integrity scores are larger than or equal to the preset semantic threshold to obtain a first expansion unit; screening semantic segmentation points meeting the semantic integrity score greater than or equal to a preset semantic threshold for a first expansion unit with the semantic length greater than a preset length threshold, and carrying out semantic segmentation on the first expansion unit to obtain a second expansion unit; And combining the second expansion unit and the audio units with semantic integrity scores greater than or equal to a preset semantic threshold value in the second audio unit set to obtain an audio data block set.
6. The method for real-time intelligent streaming translation of audio and video in combination with the buffering strategy according to claim 1, wherein the combining the video data and the audio data block set, analyzing the priority corresponding to the semantic importance calculation of the data, performing keyframe division on the video data to obtain the video keyframe priority and the audio block priority, comprises: combining the video data and the audio data block set, analyzing the semantic importance of the data, analyzing the semantic density, the emotion strength and the context association degree through a preset priority analysis model, and fusing to obtain the corresponding priority; And carrying out key frame division on the video data according to the priority, and obtaining the video key frame priority and the audio block priority.
7. The method for real-time intelligent streaming translation of audio and video in combination with the buffering strategy according to claim 6, wherein the performing key frame division on the video data according to the priority to obtain the video key frame priority and the audio block priority comprises: According to the priority, dividing the key frames of the video data to obtain a key frame set; And carrying out semantic consistency analysis on each key frame and each audio block, constructing a semantic association degree matrix between the key frames and the audio blocks, and analyzing the priorities corresponding to the key frames and the audio blocks through a preset dynamic priority analysis model to obtain video key frame priorities and audio block priorities.
8. The audio/video real-time intelligent shunting translation system combined with a buffer strategy is characterized by being used for realizing the audio/video real-time intelligent shunting translation method combined with the buffer strategy according to any one of claims 1-7, and comprising the following steps: the data distribution module analyzes the audio and video data acquired in real time, distributes the audio and video data through a preset data distribution model, and acquires the audio data and the video data; the audio data block dividing module is used for configuring a dynamic buffer mechanism, dynamically adjusting the size of an audio buffer area according to network transmission quality parameters monitored in real time, and carrying out semantic analysis and data division on audio data of the audio buffer area to obtain an audio data block set; the priority analysis module is used for analyzing the priority corresponding to the semantic importance calculation of the data by combining the video data and the audio data block set, and performing key frame division on the video data to obtain video key frame priority and audio block priority; The shunting translation module translates the video key frames and the audio data blocks through a preset translation model according to the video key frame priority and the audio block priority to obtain a first translation result and a second translation result; and the real-time translation module is used for analyzing the mapping relation between the first translation result and the second translation result according to the time sequence, and dynamically adjusting the display time of the translation result to obtain the real-time translation result.

Description

Audio and video real-time intelligent shunt translation method and system combined with buffer strategy Technical Field The application relates to the technical field of audio and video real-time translation, in particular to an audio and video real-time intelligent shunt translation method and system combined with a buffer strategy. Background At present, the requirements of audio and video real-time translation technology in the fields of national conferences, online education, live broadcasting, telemedicine and the like are increasing. In the prior art, a streaming processing mode is generally adopted for real-time audio and video translation, and the audio and video stream is wholly analyzed and translated. However, in the method, audio and video data are uniformly processed, a shunt mechanism is lacked, the processing efficiency is low, a buffer zone with a fixed size is adopted, the method cannot adapt to the dynamic change of network transmission quality, the real-time property and accuracy of translation are affected, the translation result is not synchronous, and the accuracy and fluency of real-time communication are affected. The prior art has the following problems that a buffer zone with a fixed size is adopted, and cannot be dynamically adjusted according to the network transmission quality, so that data backlog or loss is caused when network conditions are poor, and translation delay is increased; the application provides an audio and video real-time intelligent shunt translation method and system combining a buffer strategy, which are used for solving at least one of the problems that audio and video data are treated as a whole, semantic association between the audio and video data is not considered, so that translation results are asynchronous in time and poor in continuity, priority evaluation on semantic importance is lacked, key information is delayed to be treated, and real-time and accuracy of translation are affected. Disclosure of Invention Aiming at the defects existing in the prior art, the application aims to provide the audio and video real-time intelligent shunt translation method and system combined with the buffer strategy, which can effectively solve the problems in the background art. The specific technical scheme of the application is as follows: an audio/video real-time intelligent shunt translation method combined with a buffer strategy comprises the following steps: analyzing audio and video data acquired in real time, and shunting the audio and video data through a preset data shunting model to obtain audio data and video data; a dynamic buffer mechanism is configured, the size of an audio buffer area is dynamically adjusted according to network transmission quality parameters monitored in real time, semantic analysis and data division are carried out on audio data of the audio buffer area, and an audio data block set is obtained; Combining the video data and the audio data block set, analyzing the semantic importance of the data, calculating the corresponding priority, and carrying out key frame division on the video data to obtain the video key frame priority and the audio block priority; according to the video key frame priority and the audio block priority, translating the video key frame and the audio data block through a preset translation model respectively to obtain a first translation result and a second translation result; and analyzing the mapping relation between the first translation result and the second translation result according to the time sequence, and dynamically adjusting the display time of the translation result to obtain a real-time translation result. Specifically, the analyzing the audio and video data acquired in real time, and splitting the audio and video data through a preset data splitting model to obtain audio data and video data includes: analyzing audio and video data acquired in real time, and determining data track codes; and according to the data track codes, streaming the audio and video data through a preset data streaming model to obtain the audio data and the video data. Specifically, the dynamic buffering mechanism includes: According to the network transmission quality parameters monitored in real time, analyzing the network state to determine dynamic buffer configuration parameters, constructing a layered buffer zone, and dynamically adjusting the size of an audio buffer zone; and carrying out semantic analysis and data division on the audio data of the audio buffer area to obtain a first audio unit set, analyzing the duration of the mute interval between adjacent audio units, and carrying out merging optimization on the first audio unit set to obtain an audio data block set. Specifically, the analyzing the network state to determine the dynamic buffer configuration parameters according to the network transmission quality parameters monitored in real time, constructing a layered buffer area, and dynamically adjusting the siz