CN-122024750-A - Cloud voice assistant system based on directional pickup and noise reduction of microphone array

CN122024750ACN 122024750 ACN122024750 ACN 122024750ACN-122024750-A

Abstract

The invention relates to the technical field of voice signal processing, and discloses a cloud voice assistant system based on directional pickup and noise reduction of a microphone array, which comprises a reference feature extraction module, a voice recognition module and a voice recognition module, wherein the reference feature extraction module is used for extracting reference voice features of service personnel; the device comprises a direction judging module, a heuristic evaluation module, a directional noise reduction module, a dereverberation enhancing module, a fragment packaging module and a cloud processing module, wherein the direction judging module is used for determining candidate main directions, direction scores and direction attributes, the heuristic evaluation module is used for determining compatibility and conversation states, the directional noise reduction module is used for generating a main beam output signal and carrying out directional noise reduction, the dereverberation enhancing module is used for obtaining a main conversation enhanced voice signal, the fragment packaging module is used for generating a packaged effective voice data packet, and the cloud processing module is used for analyzing the data packet and obtaining a structured return result. The invention realizes the directional enhancement extraction of the main conversation voice and the accurate processing of the cloud voice assistant in the complex service scene.

Inventors

LIU TONGXU
LIU WEIMIN
YI FEI
WANG YUNCHUAN
WU JIAQI
WANG CHUANGYE
QIU DEGUI
WANG YU
SHEN SHILIN
JIN JIAJIA
LIU TONGTONG
HE ZHIFAN
SHANG SHUAI

Assignees

国网安徽省电力有限公司蚌埠供电公司

Dates

Publication Date: 20260512
Application Date: 20260324

Claims (10)

1. Cloud voice assistant system of making an uproar falls based on directional pickup of microphone array, its characterized in that includes: The reference characteristic extraction module is used for acquiring a near-end voice signal and a multi-channel far-end voice signal and extracting service personnel reference voice characteristics based on the near-end voice signal; the direction judging module is used for acquiring the space sector parameters of the service dialogue axis, and carrying out direction judgment based on the multichannel far-end voice signals and the space sector parameters of the service dialogue axis to obtain candidate main directions, direction scores and direction attributes; the heuristic evaluation module is used for performing heuristic beam forming processing on the multichannel far-end voice signals based on the candidate main directions to obtain heuristic beam output signals, determining the compatibility based on the heuristic beam output signals and the service personnel reference voice characteristics, and determining the conversation state based on the direction attribute and the compatibility; The directional noise reduction module is used for generating a main beam weight and an inhibition beam weight based on the session state and the candidate main direction, obtaining a main beam output signal and an inhibition beam output signal based on the main beam weight, the inhibition beam weight and the multichannel far-end voice signal, and carrying out directional noise reduction processing based on the main beam output signal and the inhibition beam output signal to obtain a main beam output signal after directional noise reduction; The dereverberation enhancement module is used for carrying out dereverberation processing based on the main beam output signal after directional noise reduction to obtain a main session enhanced voice signal; The segment packaging module is used for determining a session effective segment based on the compatibility, the direction score and the main session enhanced voice signal, and obtaining a packaged effective voice data packet based on the main session enhanced voice signal and the structural metadata corresponding to the session effective segment; the cloud processing module is used for analyzing the packaged effective voice data packet to obtain a main session enhanced voice signal and structured metadata, and processing the main session enhanced voice signal based on the session state in the structured metadata to obtain a structured return result.
2. The microphone array directional pickup noise reduction based cloud voice assistant system of claim 1, wherein obtaining a near-end voice signal and a multi-channel far-end voice signal, extracting attendant reference voice features based on the near-end voice signal, comprises: Step 11, performing unified preprocessing on the near-end voice signal and the multi-channel far-end voice signal, wherein the unified preprocessing comprises the steps of executing sampling rate unification, performing amplitude normalization and direct current component removal based on the maximum amplitude value of each voice signal to obtain a standardized near-end voice signal and a standardized multi-channel far-end voice signal; Step 12, framing the standardized near-end voice signal and the standardized multi-channel far-end voice signal according to the same framing parameters, and respectively performing frequency domain transformation to obtain a near-end frequency domain signal and a multi-channel far-end frequency domain signal; And step 13, extracting service personnel reference voice characteristics based on the effective near-end voice frame, wherein the service personnel reference voice characteristics comprise near-end frame energy, voice frequency band energy duty ratio, amplitude envelope, frequency domain energy center position and time frequency occupation interval.
3. The cloud voice assistant system based on directional pickup and noise reduction of microphone array as claimed in claim 2, wherein the process of extracting the frequency domain energy center position and the time-frequency occupation area in step 13 comprises: Taking the energy value of each frequency point as a weight, carrying out weighted summation on each frequency point, and carrying out ratio processing on the weighted summation result and the sum of the energy values of each frequency point to obtain the frequency domain energy center position of the corresponding effective near-end voice frame; The method comprises the steps of determining a frequency point range of which the energy value meets a preset occupation threshold value for a near-end frequency domain signal corresponding to each effective near-end voice frame, taking the frequency point range as a frequency domain occupation interval corresponding to the effective near-end voice frame, determining a time interval corresponding to the frequency domain occupation interval according to a distribution range of continuous effective near-end voice frames, and determining the time interval and the frequency domain occupation interval as a time-frequency occupation interval.
4. The microphone array directional pickup noise reduction based cloud voice assistant system of claim 1, wherein obtaining service dialogue axis space sector parameters, performing direction determination based on multi-channel far-end voice signals and service dialogue axis space sector parameters, obtaining candidate main directions, direction scores and direction attributes, comprises: Step 21, acquiring service dialogue axis space sector parameters, wherein the service dialogue axis space sector parameters comprise a service personnel sector, a client sector and a forbidden angle set, and determining the service personnel sector and the client sector together as an allowed search angle set, wherein the service personnel sector is a preset angle range of a fixed microphone array facing a service personnel station area, the client sector is a preset angle range of the fixed microphone array facing a client station area, and the forbidden angle set is a preset angle range which does not participate in candidate main directions; Step 22, discretizing the allowed search angle set according to a preset angle resolution to obtain a candidate direction set, determining the path difference of sound waves reaching each channel according to the spatial position relation of each channel of the fixed microphone array and each candidate direction, and determining the theoretical propagation delay of the channel pair corresponding to each candidate direction according to the path difference; Step 23, aligning the frequency domain phase differences of each channel pair in the multi-channel far-end frequency domain signal according to the theoretical propagation delay of the channel pair corresponding to each candidate direction, accumulating the frequency domain related results of each channel pair after alignment in the frequency point range, summarizing the accumulated results of each channel pair to obtain a direction score corresponding to each candidate direction, determining the candidate direction with the largest direction score as the candidate main direction, determining the direction score corresponding to the candidate main direction as the direction score, determining the direction attribute as the direction pending when the difference between the candidate direction with the largest direction score and the next largest candidate direction does not meet the preset distinguishing threshold, determining the direction attribute as the service personnel direction candidate when the difference between the candidate direction with the largest direction score and the next largest candidate direction meets the preset distinguishing threshold and determining the direction attribute as the client direction candidate when the difference between the candidate direction with the largest direction score and the next largest candidate direction is located in the client sector.
5. The cloud voice assistant system based on directional pickup and noise reduction of a microphone array according to claim 1, wherein performing a heuristic beam forming process on the multi-channel far-end voice signal based on the candidate main direction to obtain a heuristic beam output signal, determining a compatibility based on the heuristic beam output signal and a service person reference voice feature, determining a session state based on a direction attribute and the compatibility, comprises: Step 31, determining the frequency domain guiding delay of each channel facing the candidate main direction according to the spatial position relation of each channel of the candidate main direction and the fixed microphone array, setting a corresponding phase compensation quantity for each channel at each frequency point according to the guiding delay, and generating a probing beam weight according to the phase compensation quantity; Step 32, based on the trial beam output signal and the trial beam time domain signal, respectively extracting trial frame energy, trial voice frequency band energy duty ratio, trial amplitude envelope, trial frequency domain energy center position and trial time frequency occupation interval, comparing the five characteristics with the same-name characteristics in the service personnel reference voice characteristics one by one to obtain five consistency values; Step 33, determining a conversation state based on the direction attribute, the direction score and the compatibility, judging whether the direction attribute is direction pending, whether the direction score is lower than a preset threshold or the compatibility falls into a preset fuzzy interval, determining the conversation state as an alternate pending state when any condition is met, determining the conversation state as a service person talkback state when the direction attribute is a service person direction candidate, the compatibility reaches the preset threshold and the direction score reaches the preset threshold when the condition is not met, and determining the conversation state as a client talkback state when the direction attribute is a client direction candidate, the compatibility does not reach the preset threshold and the direction score reaches the preset threshold.
6. The microphone array directional pickup noise reduction based cloud voice assistant system of claim 5, wherein said consistency comparison in step 32 comprises: calculating the difference between the heuristic frame energy and the near-end frame energy in the service personnel reference voice characteristics, and determining an energy consistency value according to the difference; calculating the difference between the ratio of the energy of the heuristic voice frequency band and the ratio of the energy of the voice frequency band in the reference voice characteristics of the service personnel, and determining a frequency band consistency value according to the difference; Performing difference comparison on the heuristic amplitude envelope and the amplitude envelope in the service personnel reference voice characteristics, and determining an envelope consistency value according to a comparison result; calculating the position deviation between the tentative frequency domain energy center position and the frequency domain energy center position in the service personnel reference voice characteristic, and determining a center position consistency value according to the position deviation; and determining the overlapping degree between the trial time-frequency occupation region and the time-frequency occupation region in the service personnel reference voice characteristic, and determining the occupation region consistency value according to the overlapping degree.
7. The cloud voice assistant system based on directional pickup and noise reduction of a microphone array according to claim 1, wherein generating a main beam weight and an inhibition beam weight based on a session state and a candidate main direction, obtaining a main beam output signal and an inhibition beam output signal based on the main beam weight, the inhibition beam weight and a multi-channel far-end voice signal, performing directional noise reduction processing based on the main beam output signal and the inhibition beam output signal, and obtaining a main beam output signal after directional noise reduction, includes: Step 41, determining a main target direction range taking the candidate main direction as a center based on the candidate main direction, the conversation state and the space sector parameters of the service conversation axis, determining the main target direction range according to the width of the first direction range when the conversation state is a main speaking state of a service person or a main speaking state of a client, determining the main target direction range according to the width of the second direction range which is larger than the first direction range when the conversation state is an alternately pending state, and extracting direction information which is positioned outside the main target direction range and direction information in a forbidden angle set in a search permission range to obtain a suppression direction set; Step 42, based on the array direction responses corresponding to the direction information in the inhibition direction set, accumulating each array direction response item by item and carrying out normalization processing to obtain an inhibition direction constraint matrix, and then taking the array direction response corresponding to the candidate main direction as a target passing constraint, and carrying out joint solution on the target passing constraint and the inhibition direction constraint matrix to obtain a main beam weight; Step 43, based on the array direction response corresponding to each direction information in the inhibition direction set, carrying out convergence processing on each array direction response according to the direction dimension to obtain an inhibition direction representative response representing the overall direction characteristic of the inhibition direction set; Step 44, performing weighted synthesis on each channel signal of the multi-channel far-end voice signal according to the main beam weight to obtain a main beam output signal, and performing weighted synthesis on each channel signal of the multi-channel far-end voice signal according to the suppression beam weight to obtain a suppression beam output signal, wherein the main beam output signal and the suppression beam output signal correspond to the same processing frame, the same frequency domain component and the same candidate main direction; Step 45, based on the main beam output signal and the suppression beam output signal, determining a corresponding relation between the energy of the suppression beam output signal and the sum of the energy of the main beam output signal and the energy of the suppression beam output signal to obtain a directional interference ratio, determining the suppression intensity according to the session state, adopting the first suppression intensity when the session state is a service person main speaking state or a client main speaking state, adopting the second suppression intensity lower than the first suppression intensity when the session state is an alternately pending state, determining a directional noise reduction gain according to the directional interference ratio and the suppression intensity, and performing gain control processing on the main beam output signal to obtain a main beam output signal after directional noise reduction.
8. The cloud voice assistant system for directional pickup and noise reduction based on a microphone array as claimed in claim 1, wherein the main conversation enhancement voice signal is obtained by performing dereverberation processing based on the main beam output signal after directional noise reduction, comprising: Step 51, determining current frame energy corresponding to each processed frame and each frequency domain component based on the main beam output signal after directional noise reduction, and determining a delayed frame energy sequence corresponding to the current frame energy based on a plurality of continuous historical processed frame corresponding signals delayed by a preset initial frame number relative to the current processed frame, wherein the delayed frame energy sequence is composed of signal energy corresponding to a plurality of continuous historical processed frames; step 52, accumulating the signal energy corresponding to each delay processing frame item by item based on the delay frame energy sequence to obtain a history energy accumulation value, and normalizing the history energy accumulation value according to the delay processing frame number corresponding to the delay frame energy sequence to obtain late reverberation estimated energy corresponding to the current frame energy; Step 53, determining a direct retention coefficient based on the current frame energy, the late reverberation estimated energy and the conversation state, wherein when the conversation state is a service person main speaking state or a customer main speaking state, determining a first state suppression coefficient, when the conversation state is an alternate pending state, determining a second state suppression coefficient lower than the first state suppression coefficient, multiplying the state suppression coefficient with the late reverberation estimated energy to obtain modulated late reverberation estimated energy, taking the current frame energy as a molecule, taking the sum of the current frame energy and the modulated late reverberation estimated energy as a denominator, and determining the ratio of the molecule to the denominator to obtain the direct retention coefficient; Step 54, determining a smooth direct retention coefficient corresponding to the current processing frame based on the direct retention coefficient corresponding to the current processing frame and a smooth direct retention coefficient corresponding to the previous processing frame, wherein the smooth direct retention coefficient corresponding to the previous processing frame is weighted according to a preset smooth coefficient, the direct retention coefficient corresponding to the current processing frame is weighted according to a complementary weight corresponding to the preset smooth coefficient, and then the weighted results of the two parts are accumulated to obtain a recursive fusion result, and the recursive fusion result is compared with a preset coefficient lower limit, and a result which is not lower than the preset coefficient lower limit is determined as the smooth direct retention coefficient corresponding to the current processing frame; Step 55, performing gain control processing on the main beam output signal after directional noise reduction based on the smooth direct retention coefficient to obtain a main session enhanced speech signal, wherein for each processing frame and each frequency domain component, multiplying the smooth direct retention coefficient with a corresponding component of the main beam output signal after directional noise reduction to obtain an enhanced signal component corresponding to each processing frame and each frequency domain component, and combining each enhanced signal component to obtain the main session enhanced speech signal.
9. The microphone array directional pickup noise reduction based cloud voice assistant system of claim 1, wherein determining a session valid segment based on the compatibility, the direction score, and the main session enhanced voice signal, obtaining an encapsulated valid voice data packet based on the main session enhanced voice signal and structured metadata corresponding to the session valid segment, comprises: Step 61, based on the compatibility and the direction score, weighting the compatibility according to a preset compatibility weight, weighting the direction score according to a preset direction score weight, and accumulating the weighted results of the two parts to obtain a frame-level effective judgment value corresponding to each processed frame; Step 62, based on the main session enhanced voice signal, determining the energy of the enhanced frame corresponding to each processed frame, comparing the frame-level effective determination value with a preset frame-level effective determination threshold, and determining the effective frame mark corresponding to each processed frame by combining the determination result of whether the energy of the enhanced frame is greater than zero; Step 63, determining a processing frame with a plurality of continuous effective frames marked as effective as a candidate continuous frame interval, recording a segment start frame and a segment stop frame corresponding to each candidate continuous frame interval, comparing the frame length of the candidate continuous frame interval with a preset minimum segment length threshold value, reserving the candidate continuous frame interval meeting the threshold value condition as a session effective segment, and merging adjacent session effective segments with interval frame numbers not more than the preset segment intermittent merging threshold value; Step 64, intercepting corresponding processing frames and frequency domain components from the main conversation enhancement voice signal according to the conversation starting frame and the conversation ending frame corresponding to the conversation effective fragments to obtain the fragment enhancement voice, and generating structural metadata corresponding to the fragment enhancement voice based on conversation states, candidate main directions, compatibility and direction scores corresponding to the processing frames in the conversation effective fragments; Step 65, binding and packaging the segment enhanced voice and the structured metadata corresponding to the same session effective segment according to the unified data packet format, and determining the packaging result as the packaged effective voice data packet corresponding to the session effective segment.
10. The microphone array directional pickup noise reduction based cloud voice assistant system of claim 1, wherein the parsing of the encapsulated valid voice data packet to obtain a main session enhanced voice signal and structured metadata, the processing of the main session enhanced voice signal to obtain a structured return result based on a session state in the structured metadata, comprises: Step 71, analyzing the encapsulated valid voice data packet, extracting a voice content field and a metadata field in the encapsulated valid voice data packet, determining the voice content field as a main session enhanced voice signal, and determining the metadata field as structured metadata, wherein the structured metadata at least comprises a session state, a candidate main direction, a compatibility, a direction score and a segment boundary; Step 72, determining a processing mode corresponding to the session state based on the structured metadata, wherein the processing mode is determined to be a service response processing mode when the session state is a service person main speaking state, the processing mode is determined to be a client request processing mode when the session state is a client main speaking state, the processing mode is determined to be a pending buffer processing mode when the session state is an alternate pending state, and candidate main directions, compatibilities, direction scores and segment boundaries in the structured metadata are synchronously extracted; Step 73, performing recognition processing on the main session enhanced voice signal based on the processing mode to obtain a recognition text, and performing semantic processing on the recognition text, the processing mode and the structured metadata to obtain a semantic processing result, wherein when the processing mode is a client request processing mode, request intention recognition and request element extraction are performed on the recognition text; Step 74, based on the recognition text, the semantic processing result and the structured metadata, a structured return result is generated, wherein the recognition text is determined as a recognition text field in the structured return result, the semantic processing result is determined as a semantic processing result field in the structured return result, and the session state, the candidate main direction, the compatibility, the direction score and the segment boundary in the structured metadata are respectively determined as corresponding fields in the structured return result.

Description

Cloud voice assistant system based on directional pickup and noise reduction of microphone array Technical Field The invention belongs to the technical field of voice signal processing, and particularly relates to a cloud voice assistant system based on directional pickup and noise reduction of a microphone array. Background Along with the increasing application of voice recognition and voice assistant technology in business halls, service counters, government affair windows, medical guide diagnosis, on-site reception and other scenes, how to obtain clear and recognizable target voice in a complex acoustic environment has become an important problem affecting the subsequent voice interaction effect. In such a scenario, service personnel and customers typically communicate face-to-face around a fixed area, often accompanied by nearby personnel speaking, environmental noise, equipment operating noise, and reflected reverberations caused by walls, desktops, counters, and the like, in addition to target speech. The above factors can cause a large amount of interference components to be mixed in the collected voice signals, so that the target voice is distorted, blurred, overlapped and tailing, and the accuracy of voice enhancement, voice recognition and semantic understanding is further affected. In the prior art, the common voice enhancement method focuses on general noise reduction, echo suppression or array-based direction enhancement, and although voice definition can be improved to a certain extent, a targeted processing mechanism is often lacking for the situations that service personnel and clients alternately speak, the main speaking direction is frequently switched, and the difference of the near-end pickup conditions and the far-end pickup conditions is obvious in a service scene. Particularly, when a plurality of persons talk in close proximity or overlap sounds in a short time, it is difficult for the existing scheme to constantly and stably highlight the current main conversation voice and effectively suppress non-target directional voices and late reverberation components. In addition, the existing scheme mostly directly sends the enhanced voice into the subsequent processing flow, and lacks an information organization mode combining session state, direction attribute and fragment effectiveness, so that the overall processing effect of the voice assistant system in the complex service environment is difficult to further improve. Disclosure of Invention The invention provides a cloud voice assistant system based on microphone array directional pickup and noise reduction, which solves the technical problems that in the related art, main conversation target voice is difficult to stably enhance under a complex service scene, and the accuracy of cloud voice assistant recognition and semantic processing is low due to the fact that the main conversation target voice is easily affected by non-target direction voice and reverberation interference. The invention provides a cloud voice assistant system based on directional pickup and noise reduction of a microphone array, which comprises: The reference characteristic extraction module is used for acquiring a near-end voice signal and a multi-channel far-end voice signal and extracting service personnel reference voice characteristics based on the near-end voice signal; the direction judging module is used for acquiring the space sector parameters of the service dialogue axis, and carrying out direction judgment based on the multichannel far-end voice signals and the space sector parameters of the service dialogue axis to obtain candidate main directions, direction scores and direction attributes; the heuristic evaluation module is used for performing heuristic beam forming processing on the multichannel far-end voice signals based on the candidate main directions to obtain heuristic beam output signals, determining the compatibility based on the heuristic beam output signals and the service personnel reference voice characteristics, and determining the conversation state based on the direction attribute and the compatibility; The directional noise reduction module is used for generating a main beam weight and an inhibition beam weight based on the session state and the candidate main direction, obtaining a main beam output signal and an inhibition beam output signal based on the main beam weight, the inhibition beam weight and the multichannel far-end voice signal, and carrying out directional noise reduction processing based on the main beam output signal and the inhibition beam output signal to obtain a main beam output signal after directional noise reduction; The dereverberation enhancement module is used for carrying out dereverberation processing based on the main beam output signal after directional noise reduction to obtain a main session enhanced voice signal; The segment packaging module is used for determining a session effective segment based on