CN-122017729-A - Conference room speaker positioning method, system, device and storage medium

CN122017729ACN 122017729 ACN122017729 ACN 122017729ACN-122017729-A

Abstract

The application provides a conference room speaker positioning method, a system, a device and a storage medium, wherein a radar device is used for collecting first space information of candidate targets, a microphone device is used for collecting second space information of sounding targets, an acoustic positioning compensation strategy is used for correcting acoustic positioning errors, a target speaker is screened out based on a space matching relation between the first space information and the second space information, and final space information of the target speaker is determined by combining the two types of space information, so that high-precision positioning of the conference room speaker can be realized, and the technical problems that a sound source positioning scheme depending on a beam forming technology in related technologies is easily influenced by position deviation, positioning precision is insufficient, and sound pickup performance is greatly reduced are solved.

Inventors

FENG WANJIAN
LIN LIFENG

Assignees

厦门亿联网络技术股份有限公司

Dates

Publication Date: 20260512
Application Date: 20251231

Claims (20)

1. A conference room speaker positioning method, the method comprising: Acquiring first spatial information of at least one candidate target output by a radar device; Acquiring second spatial information of at least one sounding target output by the microphone device; Determining a target speaker according to the space matching relation between the first space information and the second space information; and determining the space information of the target speaker based on the first space information and the second space information corresponding to the target speaker.
2. The method of claim 1, wherein the determining the target speaker based on the spatial matching relationship between the first spatial information and the second spatial information comprises: For each sounding target, determining a first degree of matching between second spatial information of the sounding target and first spatial information of each candidate target; And determining a sounding target with the first matching degree reaching a first preset threshold as the target speaker.
3. The method according to claim 2, wherein the method further comprises: If the number of the first matching degrees of the sounding targets reaches the first preset threshold is a plurality of, determining first space information corresponding to the highest first matching degree of the sounding targets as first space information corresponding to the target speaker.
4. A method according to claim 3, characterized in that the method further comprises: and if the number of the highest first matching degrees is a plurality of, taking the first space information corresponding to each highest first matching degree as the first space information corresponding to one target speaker to obtain a plurality of first space information corresponding to a plurality of target speakers, wherein the plurality of target speakers are in one-to-one correspondence with the plurality of first space information.
5. The method according to any one of claims 1-4, wherein determining the spatial information of the target speaker based on the first spatial information and the second spatial information corresponding to the target speaker comprises: based on a preset fusion algorithm, fusing the first space information and the second space information corresponding to the target speaker to obtain the space information; the preset fusion algorithm comprises one of a weighted fusion algorithm, a tight coupling fusion algorithm and a Bayesian fusion algorithm.
6. The method of claim 5, wherein fusing the first spatial information and the second spatial information corresponding to the target speaker based on the weighted fusion algorithm to obtain the spatial information, comprises: acquiring a first weight corresponding to the first space information and a second weight corresponding to the second space information; And carrying out weighted fusion on the first space information and the second space information corresponding to the target speaker based on the first weight and the second weight to obtain the space information.
7. The method of claim 6, wherein the relationship of the first weight and the second weight satisfies one of: if the target speaker is in a dynamic and sounding state, the first weight is greater than the second weight; If the target speaker is in a static and sounding state, the first weight is smaller than the second weight; if the matching degree between the second space information of the two sounding targets reaches a second preset threshold, the first weight is larger than the second weight; If the matching degree between the second spatial information of the two sounding targets does not exist reaches the second preset threshold, the first weight is smaller than or equal to the second weight.
8. The method according to any one of claims 1-4, 6, 7, wherein determining the target speaker based on the spatial matching relationship between the first spatial information and the second spatial information comprises: Acquiring third spatial information of at least one second candidate target output by a camera; determining the target speaker according to the first spatial information, the second spatial information and the third spatial information; the determining the spatial information of the target speaker based on the first spatial information and the second spatial information corresponding to the target speaker includes: and determining the space information of the target speaker based on the first space information, the second space information and the third space information corresponding to the target speaker.
9. The method of claim 8, wherein the method further comprises: acquiring, by the camera, an image based on spatial indication information, the spatial indication information including the first spatial information and/or the second spatial information; Analyzing the image through the camera to obtain facial motion characteristics of at least one third candidate target; And determining the at least one second candidate target from the at least one third candidate target by the camera based on the matching degree between the facial motion features and the features in the preset sounding facial motion feature library.
10. The method of claim 9, wherein the determining, by the camera, the at least one second candidate object from the at least one third candidate object based on a degree of matching between the facial motion features and features in a library of preset vocal facial motion features comprises: Determining, by the camera, at least one fourth candidate object from the at least one third candidate object based on a degree of matching between the facial motion features and features in a preset vocal facial motion feature library; Determining, by the camera, whether each fourth candidate object is an interfering object based on the pose characteristics of each fourth candidate object; a fourth candidate object that is not an interfering object is determined by the camera as the at least one second candidate object.
11. The method of claim 10, wherein a degree of matching between any two of the first spatial information, the second spatial information, and the third spatial information corresponding to the target speaker reaches a first preset threshold.
12. The method of claim 11, wherein the method further comprises: For each sounding target, if the number of the first matching degrees of the sounding targets reaches the first preset threshold is multiple and the number of the second matching degrees of the sounding targets reaches the first preset threshold is 1, determining first space information corresponding to the highest first matching degree of the sounding targets as first space information corresponding to the target speaker; if the number of the first matching degrees of the sounding targets reaches 1 and the number of the second matching degrees of the sounding targets reaches a plurality of the first preset thresholds, determining third space information corresponding to the highest second matching degree of the sounding targets as third space information corresponding to the target speaker; The first matching degree represents the matching degree between the first space information and the second space information, and the second matching degree represents the matching degree between the second space information and the third space information.
13. The method according to claim 12, wherein the method further comprises: If the number of the first matching degrees of the sounding targets reaches the first preset threshold is multiple, and the number of the second matching degrees of the sounding targets reaches the first preset threshold is multiple, determining a third matching degree between the first space information corresponding to each first matching degree and the third space information corresponding to each second matching degree; taking the first space information and the third space information corresponding to the third matching degree reaching the first preset threshold value as the first space information and the third space information respectively corresponding to the same target speaker; And taking the first space information corresponding to the third matching degree which does not reach the first preset threshold value as the first space information corresponding to one target speaker and the third space information corresponding to the third matching degree which does not reach the first preset threshold value as the third space information corresponding to the other target speaker.
14. The method of any one of claims 1-4, 9, 10, 11, 12, 13, further comprising: analyzing the received echo signals through the radar device to obtain sounding micro-motion characteristics of at least one first candidate target; And determining the at least one candidate target from the at least one first candidate target by the radar device based on the matching degree between the sounding micro motion feature and the feature in the preset sounding micro motion feature library.
15. The method of claim 14, wherein the method further comprises: Acquiring, by the radar device, a first distance and a first direction of the at least one candidate object relative to the radar device; The first spatial information is determined by the radar apparatus based on the first distance and the first direction.
16. The method of claim 15, wherein the method further comprises: analyzing the received sound signals through the microphone device to obtain a second distance and a second direction of the at least one sound emission target relative to the microphone device; Determining, by the microphone apparatus, the second spatial information based on the second distance and the second direction.
17. The method of claim 16, wherein the method further comprises: Transmitting, by the radar device, first spatial information of at least one candidate object to the microphone device; And acquiring signals based on the first spatial information of the at least one candidate target through the microphone device, and obtaining the sound signals.
18. The method of claim 17, wherein the radar device is integrated with the microphone device; in the case where the number of radar devices is one, the radar devices coincide with the geometric centers of the microphone devices; in the case where the number of radar devices is plural, the distribution pattern of the plural radar devices on the microphone device includes any one of a circular uniform distribution, a linear uniform distribution, a matrix uniform distribution, and a fan-shaped uniform distribution.
19. The method of claim 18, wherein the configuration of the microphone assembly comprises any of a ceiling mounted configuration, a wall mounted configuration, a desk top configuration, and a front mounted configuration.
20. A conference room speaker positioning method, the method comprising: acquiring fourth spatial information of at least one fifth candidate target output by the radar device; analyzing the acquired images through a camera to obtain facial motion characteristics of at least one sixth candidate target; Determining, by the camera, at least one seventh candidate object from the at least one sixth candidate object based on a degree of matching between the facial motion features and features in a preset library of vocal facial motion features; Acquiring fifth spatial information of at least one seventh candidate object output by the camera; Determining a first target speaker according to the fourth spatial information and the fifth spatial information; And determining the space information of the first target speaker based on the fourth space information and the fifth space information corresponding to the first target speaker.

Description

Conference room speaker positioning method, system, device and storage medium Technical Field The application relates to the technical field of sound source positioning, in particular to a conference room speaker positioning method, a conference room speaker positioning system, a conference room speaker positioning device and a storage medium. Background The sound source localization technology aims at determining the physical position of one or more sound sources in space, and has wide application prospects in the fields of video conferences, intelligent robots, security monitoring, voice interaction equipment and the like. Currently, the mainstream sound source localization schemes mainly rely on beamforming technology based on microphone arrays, and the sound source localization accuracy is poor. Disclosure of Invention The application provides a conference room speaker positioning method, a conference room speaker positioning system, a conference room speaker positioning device and a conference room speaker positioning storage medium, and aims to solve the problem of poor sound source positioning precision. In a first aspect, a conference room speaker positioning method is provided, the method including: Acquiring first spatial information of at least one candidate target output by a radar device; Acquiring second spatial information of at least one sounding target output by the microphone device; Determining a target speaker according to the space matching relation between the first space information and the second space information; and determining the space information of the target speaker based on the first space information and the second space information corresponding to the target speaker. In a second aspect, a conference room speaker positioning method is provided, which aims to solve the problem of poor sound source positioning precision, and the method includes: acquiring fourth spatial information of at least one fifth candidate target output by the radar device; analyzing the acquired images through a camera to obtain facial motion characteristics of at least one sixth candidate target; Determining, by the camera, at least one seventh candidate object from the at least one sixth candidate object based on a degree of matching between the facial motion features and features in a preset library of vocal facial motion features; Acquiring fifth spatial information of at least one seventh candidate object output by the camera; Determining a first target speaker according to the fourth spatial information and the fifth spatial information; And determining the space information of the first target speaker based on the fourth space information and the fifth space information corresponding to the first target speaker. In some embodiments, the method further comprises: Transmitting the fourth spatial information to the camera by the radar apparatus; the image is acquired by the camera based on the fourth spatial information. In a third aspect, a conference room speaker positioning method is provided, which aims to solve the problem of poor sound source positioning precision, and the method includes: Acquiring first spatial information of at least one candidate target output by a radar device; Acquiring second spatial information of at least one sounding target output by the microphone device; determining first space information corresponding to each sounding target according to the space matching relation between the first space information and the second space information; And determining the space information of each sounding target based on the first space information and the second space information corresponding to each sounding target. In a fourth aspect, a conference room speaker positioning system is provided, which aims to solve the problem of poor sound source positioning accuracy, and includes a processing device, a radar device, and a microphone device; the radar device is used for outputting first space information of at least one candidate target; The microphone device is used for outputting second space information of at least one sounding target; The processing device is used for determining a target speaker according to the space matching relation between the first space information and the second space information; the processing device is further configured to determine spatial information of the target speaker based on the first spatial information and the second spatial information corresponding to the target speaker. A fifth aspect provides a conference room speaker positioning system, which aims to solve the problem of poor sound source positioning accuracy, the system comprising a processing device, a radar device and a camera; The radar device is used for outputting fourth space information of at least one fifth candidate target; The camera is used for analyzing the acquired images to obtain facial motion characteristics of at least one sixth candidate target; The camera is f