US-12621572-B2 - Systems and methods for talker tracking and camera positioning in the presence of acoustic reflections

US12621572B2US 12621572 B2US12621572 B2US 12621572B2US-12621572-B2

Abstract

Systems and methods configured to generate talker coordinates for directing a camera towards an active talker in the presence of acoustic reflections are disclosed. One method comprises receiving sound location information for a detected audio source from a microphone; determining, based on the sound location information, a first set of coordinates representing an estimated talker location; determining, based on the sound location information and a height of the environment, a second set of coordinates representing a corrected talker location; calculating a weighted height coordinate based on a first height coordinate of the first set of coordinates, a second height coordinate of the second set of coordinates, and stored height coordinates from previously detected audio sources; and transmitting, to a camera, a third set of coordinates comprising the weighted height coordinate and representing a final talker location, to cause the camera to point its image capturing component towards the received location.

Inventors

Zachary Kane
Christopher George Rieger

Assignees

SHURE ACQUISITION HOLDINGS, INC.

Dates

Publication Date: 20260505
Application Date: 20240716

Claims (20)

1 . A method performed by one or more processors in communication with each of a camera and at least one microphone disposed in an environment, the method comprising: receiving, from the at least one microphone, sound location information for an audio source detected by the at least one microphone; determining, based on the sound location information, a first set of coordinates representing an estimated talker location for the audio source; determining, based on the sound location information and a height measurement of the environment, a second set of coordinates representing a corrected talker location for the audio source; calculating a weighted height coordinate based on a first height coordinate of the first set of coordinates, a second height coordinate of the second set of coordinates, and stored height coordinates obtained for previously detected audio sources; and transmitting, to the camera, a third set of coordinates comprising the weighted height coordinate and representing a final talker location for the audio source, wherein receipt of the third set of coordinates causes the camera to point an image capturing component of the camera towards the final talker location.
2 . The method of claim 1 , further comprising: determining an amount of discrepancy between the first set of coordinates and the second set of coordinates; and upon determining that the discrepancy exceeds a threshold, generating the third set of coordinates by replacing the height coordinate of the first set of coordinates with the weighted height coordinate.
3 . The method of claim 2 , wherein the threshold is configured to identify whether an acoustic reflection is present in the environment.
4 . The method of claim 1 , wherein calculating the weighted height coordinate comprises: determining a first weight value for the first height coordinate based on the stored height coordinates; determining a second weight value for the second height coordinate based on the stored height coordinates; and calculating, using the first weight value and the second weight value, a weighted average of the first height coordinate and the second height coordinate.
5 . The method of claim 1 , wherein the camera is configured to point the image capturing component towards the final talker location by adjusting one or more of an angle, a tilt, a zoom, and a framing of the camera.
6 . The method of claim 1 , further comprising: determining the sound location information using an audio localization algorithm executed by an audio activity localizer.
7 . The method of claim 1 , wherein the height measurement comprises a height of the at least one microphone relative to a floor of the environment.
8 . A system comprising: at least one microphone disposed in an environment and configured to determine sound location information for an audio source detected by the at least one microphone; a camera disposed in the environment and comprising an image capturing component; and one or more processors communicatively coupled to each of the at least one microphone and the camera, the one or more processors configured to: receive the sound location information from the at least one microphone; determine, based on the sound location information, a first set of coordinates representing an estimated talker location for the audio source; determine, based on the sound location information and a height measurement of the environment, a second set of coordinates representing a corrected talker location for the audio source; calculate a weighted height coordinate based on a first height coordinate of the first set of coordinates, a second height coordinate of the second set of coordinates, and stored height coordinates obtained for previously detected audio sources; and transmit, to the camera, a third set of coordinates comprising the weighted height coordinate and representing a final talker location for the audio source, wherein responsive to receiving the third set of coordinates, the camera is configured to point the image capturing component towards the final talker location.
9 . The system of claim 8 , wherein the one or more processors are further configured to: determine an amount of discrepancy between the first set of coordinates and the second set of coordinates; and upon determining that the discrepancy exceeds a threshold, generate the third set of coordinates by replacing the height coordinate of the first set of coordinates with the weighted height coordinate.
10 . The system of claim 9 , wherein the threshold is configured to identify whether an acoustic reflection is present in the environment.
11 . The system of claim 8 , wherein calculating the weighted height coordinate comprises: determining a first weight value for the first height coordinate based on the stored height coordinates; determining a second weight value for the second height coordinate based on the stored height coordinates; and calculating, using the first weight value and the second weight value, a weighted average of the first height coordinate and the second height coordinate.
12 . The system of claim 8 , wherein the camera is configured to point the image capturing component towards the final talker location by adjusting one or more of an angle, a tilt, a zoom, and a framing of the camera.
13 . The system of claim 8 , further comprising an audio activity localizer configured to determine the sound location information using an audio localization algorithm executed by the audio activity localizer.
14 . The system of claim 8 , wherein the height measurement comprises a height of the at least one microphone relative to a floor of the environment.
15 . A non-transitory computer-readable storage medium comprising instructions that, when executed by one or more processors in communication with each of at least one microphone, and a camera, cause the one or more processors to perform the following: receive sound location information for an audio source detected by the at least one microphone; determine, based on the sound location information, a first set of coordinates representing an estimated talker location for the audio source; determine, based on the sound location information and a height measurement of the environment, a second set of coordinates representing a corrected talker location for the audio source; calculate a weighted height coordinate based on a first height coordinate of the first set of coordinates, a second height coordinate of the second set of coordinates, and stored height coordinates obtained for previously detected audio sources; and transmit, to the camera, a third set of coordinates comprising the weighted height coordinate and representing a final talker location for the audio source, wherein receipt of the third set of coordinates causes the camera to point an image capturing component of the camera towards the final talker location.
16 . The non-transitory computer-readable storage medium of claim 15 , wherein the instructions further cause the one or more processors to: determine an amount of discrepancy between the first set of coordinates and the second set of coordinates; and upon determining that the discrepancy exceeds a threshold, generate the third set of coordinates by replacing the height coordinate of the first set of coordinates with the weighted height coordinate.
17 . The non-transitory computer-readable storage medium of claim 16 , wherein the threshold is configured to identify whether an acoustic reflection is present in the environment.
18 . The non-transitory computer-readable storage medium of claim 15 , wherein calculating the weighted height coordinate comprises: determining a first weight value for the first height coordinate based on the stored height coordinates; determining a second weight value for the second height coordinate based on the stored height coordinates; and calculating, using the first weight value and the second weight value, a weighted average of the first height coordinate and the second height coordinate.
19 . The non-transitory computer-readable storage medium of claim 15 , wherein the camera is configured to point the image capturing component towards the final talker location by adjusting one or more of an angle, a tilt, a zoom, and a framing of the camera.
20 . The non-transitory computer-readable storage medium of claim 15 , wherein the instructions further cause the one or more processors to determine the sound location information using an audio localization algorithm executed by an audio activity localizer.

Description

CROSS-REFERENCE This application claims priority to U.S. Provisional Patent Application No. 63/514,046, filed on Jul. 17, 2023, the contents of which are incorporated by reference herein in their entirety. TECHNICAL FIELD This disclosure generally relates to talker tracking and camera positioning, and more specifically, to systems and methods for positioning a camera towards a talker based on a talker location determined in the presence of acoustic reflections using one or more microphones. BACKGROUND Various audio-visual environments, such as conference rooms, boardrooms, classrooms, video conferencing settings, performance venues, and more, typically involve the use of microphones (including microphone arrays) for capturing sound from one or more audio sources (e.g., human speakers or talkers) in the environment and one or more image capture devices (e.g., cameras) for capturing images and/or videos of the one or more audio sources or other persons and/or objects in the environment. The captured audio and video may be disseminated to a local audience in the environment through loudspeakers (for sound reinforcement) and display screens (for visual reinforcement), and/or transmitted to a remote location for listening and viewing a remote audience (such as via a telecast, webcast, or the like). For example, the transmitted audio and video may be used by persons in a conference room to conduct a conference call with other persons at the remote location. In some cases, it can be difficult for the viewers at the remote location to see particular talkers, for example, when the camera is configured to show the entire room, or fixed on a specific portion of the room while the talkers move in and out of view. Some existing camera systems are configured to actively move or point a camera towards the direction of a detected talker, such as a human in the environment that is speaking, singing, or otherwise making sounds, so that viewers, locally or remotely, can better see who is talking. Some cameras use motion sensors and/or facial recognition software in order to guess which person is talking for camera tracking purposes. Some camera systems use multiple cameras to optimally capture persons located at different parts of the environment or otherwise capture video of the whole environment. SUMMARY The techniques of this disclosure provide systems and methods designed to, among other things: (1) determine coordinates for positioning a camera towards a talker based on a talker location identified by at least one microphone in an environment; and (2) adjust the talker coordinates based on a height of the environment and previously detected talker heights to account for acoustic reflections in the environment. In an embodiment, a method, performed by one or more processors in communication with each of a camera and at least one microphone disposed in an environment, comprises: receiving, from the at least one microphone, sound location information for an audio source detected by the at least one microphone; determining, based on the sound location information, a first set of coordinates representing an estimated talker location for the audio source; determining, based on the sound location information and a height measurement of the environment, a second set of coordinates representing a corrected talker location for the audio source; calculating a weighted height coordinate based on a first height coordinate of the first set of coordinates, a second height coordinate of the second set of coordinates, and stored height coordinates obtained for previously detected audio sources; and transmitting, to the camera, a third set of coordinates comprising the weighted height coordinate and representing a final talker location for the audio source, wherein receipt of the third set of coordinates causes the camera to point an image capturing component of the camera towards the final talker location. In another embodiment, a system comprises at least one microphone disposed in an environment and configured to determine sound location information for an audio source detected by the at least one microphone; a camera disposed in the environment and comprising an image capturing component; and one or more processors communicatively coupled to each of the at least one microphone and the camera, the one or more processors configured to: receive the sound location information from at least one microphone; determine, based on the sound location information, a first set of coordinates representing an estimated talker location for the audio source; determine, based on the sound location information and a height measurement of the environment, a second set of coordinates representing a corrected talker location for the audio source; calculate a weighted height coordinate based on a first height coordinate of the first set of coordinates, a second height coordinate of the second set of coordinates, and stored height coordinates obtained for pr