CN-121191487-B - Business window multisource recording data fusion intelligent analysis system driven by digital industry cards

CN121191487BCN 121191487 BCN121191487 BCN 121191487BCN-121191487-B

Abstract

The invention discloses a business window multi-source recording data fusion intelligent analysis system driven by a digital tablet, in particular to the technical field of voice processing, which is based on asynchronous audio streams collected by a wearable digital tablet, an array microphone and an environment pickup device, the steps of anchor point detection, event diagram construction, cross-source matching, elastic time warping alignment and the like are adopted to realize the structured alignment and unified time reference construction of multi-source recording data. The system completes cross-source event marking by introducing multi-class anchor point detectors (semantic keywords, acoustic mutation and prompt tones), builds an event alignment graph and performs confidence classification by combining uniform clipping and cross-sliding matching strategies in the anchor point pairing process, then performs multi-scale alignment and resampling on an audio stream based on a layered elastic time warping method, and finally outputs a frame-level synchronous voice data stream to provide uniform time support for downstream speaker separation, behavior audit and semantic mapping.

Inventors

WEI ZIRUI
HE RUITING
HAN SHUOCHEN
GUO LEI
SUN YI
YI ZHONGLIN
LIU XUEWU
WANG JIE
YUE WEIPENG
WANG YUFAN
WANG LI
YU HAIXIA
ZHANG RU
WANG TIANQI
LIU YAN
WANG LONGYU
LIU YING
ZHU MEIYING
HU HAO

Assignees

国网冀北电力有限公司计量中心

Dates

Publication Date: 20260508
Application Date: 20250916

Claims (5)

1. The intelligent analysis system for fusing the multi-source recording data of the business window driven by the digital industry tablet is characterized by comprising a multi-source anchor point detection and quadruple generation module, an event alignment graph construction and anchor point pairing strategy module and a multi-scale elastic time warping and synchronous resampling module, The multi-source anchor point detection and quadruple generation module is used for parallelly executing anchor point detection in each path of audio stream to generate an anchor point structure quadruple, the anchor point structure quadruple comprises a time stamp of the anchor point in the audio source, a type label of the anchor point, an anchor point detection confidence coefficient score and a time scale tolerance range, The event alignment diagram construction and anchor point pairing strategy module is used for defining a current sliding time window by taking a digital tablet audio stream as a main reference source, and synchronously cutting anchor point structure quadruples of all audio sources according to a uniform time axis cutting strategy; when the anchor points of any audio source after the uniform time axis is cut are zero, directly starting a cross sliding matching strategy and reserving original anchor point sets of all the audio sources, when each audio source after the uniform time axis is cut is reserved with anchor points, calculating the short-time RMS energy fluctuation rate and the cross source anchor point synchronization rate of a main reference source, and selecting to continue to use the uniform time axis cutting strategy or switch to the cross sliding matching strategy according to the comparison result between the weighted sum comprehensive scoring result and the preset switching threshold value; under a cross sliding matching strategy, dynamically calculating the maximum matching sliding window width of each anchor point in a source m on the source n, wherein the maximum matching sliding window width is proportional to the weighted sum of an anchor point time divergence index and an anchor point density variance; The multi-scale elastic time warping and synchronous resampling module is used for grading the joint confidence of candidate anchor point pairs, wherein the candidate anchor point pairs higher than a strong constraint threshold are marked as strong alignment control anchor points and serve as segmentation time mapping key control points, the candidate anchor point pairs between the strong constraint threshold and the soft constraint threshold are marked as soft alignment guide anchor points, weak guide support is provided, the candidate anchor point pairs lower than the soft constraint threshold are judged to be abnormal matching and are removed, a time mapping function is built in a layering mode based on grading results, coarse scales take the strong alignment control anchor points as reference points to build a segmentation affine time mapping function so as to correct global delay and linear drift, the middle scales combine the soft alignment guide anchor points in adjacent strong alignment control anchor points to execute constraint elastic time warping, fine scales carry out frame-level fine adjustment in voice activity overlapping segments based on Mel spectrum distance or coherent phase difference, and frame-level synchronous voice data streams are output.
2. The intelligent analysis system for multi-source recording data fusion of a business window driven by a digital tablet according to claim 1, wherein the anchor point detection comprises three types of detectors, namely a semantic anchor detector, an acoustic anchor detector, a prompt anchor detector, a fixed frequency prompt tone detector, a template matching filtering algorithm, wherein the semantic anchor detector is used for detecting preset business keywords based on a lightweight keyword recognition model and generating a semantic anchor point, the acoustic anchor detector is used for recognizing an acoustic mutation event through short-time energy and spectrum flux analysis.
3. The intelligent analysis system for multi-source recording data fusion of a business window driven by a digital tablet is characterized in that the cross-source anchor point synchronization rate is defined as traversing anchor point structure quadruples of all audio sources in a current sliding time window, judging whether each anchor point in a certain path of audio sources has anchor points which are of the same type and are close in time position in any other audio source, if the conditions are met, regarding the anchor point pair as a one-time effective cross-source anchor point synchronization event, further counting the number of anchor point pairs forming the cross-source synchronization event in all anchor points in the current window, and calculating the cross-source anchor point synchronization rate according to the number.
4. The intelligent analysis system for fusing the digital industry tablet driven business window multi-source recording data of claim 1, wherein the anchor point time divergence index is quantified by the maximum time difference mean value of similar anchor points in different sources, and the anchor point density variance is quantified by the discrete degree of the number distribution of anchor points of each source.
5. The digital tablet driven business window multi-source recording data fusion intelligent analysis system of claim 1, wherein the audio stream source comprises a wearable digital tablet microphone, a counter array microphone and a business window environment pickup device.

Description

Business window multisource recording data fusion intelligent analysis system driven by digital industry cards Technical Field The invention relates to the technical field of voice processing, in particular to a business window multisource recording data fusion intelligent analysis system driven by a digital tablet. Background Along with the improvement of data traceability and service quality analysis requirements of scenes such as intelligent government affairs, financial services, customer operation windows and the like, multi-source audio acquisition and fusion processing become key links in an intelligent business system. In practical deployment, the system is often connected to multiple heterogeneous audio sources from wearable devices (such as digital tablet microphones), fixed pickup devices (such as counter microphones), and environmental monitoring devices, and is used to record voice information, service alert tones, and user responses during business processes. However, due to different distribution positions of the acquisition equipment, asynchronous sampling frequency, equipment clock drift and other reasons, obvious time deviation and structural dislocation often exist between different audio streams, and the processing effects of modules such as subsequent voice recognition, speaker separation and behavior audit are severely restricted. In the prior art, most multichannel recording systems are only roughly aligned based on time stamps or static templates, lack a structural registration means for the actual interaction event level, have poor tolerance for the degree of asynchronism between audio sources, and cannot effectively cope with complex factors such as noise environments, inconsistent clocks, speech speed changes and the like. Meanwhile, alignment quality cannot be quantitatively evaluated, a dynamic adjustment mechanism based on confidence feedback is lacked, and system stability and accuracy of dialogue restoration are difficult to guarantee. Therefore, a multi-source recording fusion analysis system with anchor point identification capability, a structured pairing mechanism and a multi-scale elastic alignment strategy is needed, and can realize high-robustness and high-precision asynchronous audio stream alignment and time unified processing in a complex business environment, thereby meeting the requirements of intelligent analysis and compliance audit. Disclosure of Invention In order to overcome the above-mentioned drawbacks of the prior art, embodiments of the present invention provide a business window multi-source recording data fusion intelligent analysis system driven by a digital tablet to solve the problems set forth in the above-mentioned background art. In order to achieve the above purpose, the present invention provides the following technical solutions: the business window multisource recording data fusion intelligent analysis system driven by the digital industry tablet comprises the following modules: The multi-source anchor point detection and quadruple generation module is used for parallelly executing anchor point detection in each path of audio stream to generate an anchor point structure quadruple, wherein the anchor point structure quadruple comprises a time stamp of the anchor point in the audio source, a type label of the anchor point, an anchor point detection confidence coefficient score and a time scale tolerance range; The event alignment graph construction and anchor point pairing strategy module is used for selecting a cross sliding matching strategy or a uniform time axis clipping strategy to construct a cross-source event alignment graph according to an anchor point time axis clipping strategy selection mechanism, and searching an anchor point pair with the minimum matching cost; The multi-scale elastic time warping and synchronous resampling module establishes a time mapping function based on anchor point to confidence level layering, achieves multi-scale alignment through affine mapping and elastic time warping with constraint and frame level fine adjustment, and outputs a frame level synchronous voice data stream. In a preferred embodiment, the anchor detection comprises three types of detectors, namely a semantic anchor detector detects preset service keywords and generates a semantic anchor based on a lightweight keyword recognition model, an acoustic anchor detector recognizes an acoustic mutation event through short-time energy and spectral flux analysis, and a prompt tone anchor detector recognizes a fixed-frequency prompt tone through a template matching filtering algorithm. In a preferred embodiment, the anchor point time axis clipping strategy selection mechanism is dynamically triggered by taking a digital tablet audio stream as a main reference source, defining a sliding time window, starting a cross sliding matching strategy if the number of any audio source anchor points is zero after unifying the time axis clipping strategy, otherwise