US-20260129396-A1 - SOURCE STATE DETERMINATION USING MACHINE-LEARNING MODELS

US20260129396A1US 20260129396 A1US20260129396 A1US 20260129396A1US-20260129396-A1

Abstract

An audiovisual system uses a spatial position detection module to process audio signals and determine the location and orientation of a speaking participant. Based on this data, the system dynamically controls sensors to optimize audio and video capture. Behavioral and contextual information may also be used to train intelligence models for improved system performance. Further, sensor arrays may be utilized to identify gaze vectors of participants to select a camera sensor of the sensor array. Meeting content collected with the sensor array may be organized into meeting metrics in accordance with an analytics strategy before training an intelligence module with the meeting metrics. Meeting content may also be configured into digital tiles in accordance with a tile strategy. At least one of the digital tiles may be altered in response to a meeting condition detected by the sensor array.

Inventors

James Michael Dallas
Matthew Skogmo
Damian Andrea FRICK
Ryan Pring
Pranav BAROT

Assignees

QSC, LLC

Dates

Publication Date: 20260507
Application Date: 20250606

Claims (20)

1 . A computer-implemented method to determine a position of a person in an environment using a machine-learning (“ML”) model, the method comprising: capturing one or more audio signals using one or more microphones positioned within the environment; supplying the one or more audio signals to the ML model; and processing the one or more audio signals, using the ML model, to determine a spatial position of the person, the spatial position being an x, y and z coordinate of the person inside the environment.
2 . The computer-implemented method as defined in claim 1 , wherein the audio signals are further processed, using the ML model, to determine a head pose of the person.
3 . The computer-implemented method as defined in claim 1 , wherein the spatial position is an x, y and z coordinate of a head of the person.
4 . The computer-implemented method as defined in claim 1 , wherein one or more cameras are operated based upon the spatial position of the person.
5 . The computer-implemented method as defined in claim 1 , wherein the audio signals are further processed, using the ML model, to determine a pitch of a head of the person.
6 . The computer-implemented method as defined in claim 1 , wherein the audio signals are further processed, using the ML model, to determine a yaw of a head of the person.
7 . The computer-implemented method as defined in claim 1 , wherein the spatial position is used to determine a context of the environment.
8 . The computer-implemented method as defined in claim 1 , further comprising: identifying two or more persons in the environment with a sensor array; determining relationships between the two or more persons; and utilizing the relationship data along with one or more video data streams to train at least one intelligence model.
9 . A system to determine a position of a person in an environment using a machine-learning (“ML”) model, the system comprising: one or more microphones positioned within the environment; and a processing device communicably coupled to the one or more microphones, the processing device having an audio optimization and control (“AOC”) operating system executable thereon to manage and control functionality of the one or more microphones, the processing device being configured to perform operations comprising: capturing one or more audio signals using the one or more microphones; supplying the one or more audio signals to the ML model; and processing the one or more audio signals, using the ML model, to determine a head pose of the person.
10 . The system as defined in claim 9 , wherein the audio signals are further processed, using the ML model, to determine a spatial position of the person, the spatial position being an x, y and z coordinate of the person inside the environment.
11 . The system as defined in claim 10 , wherein the spatial position is an x, y and z coordinate of a head of the person.
12 . The system as defined in claim 10 , further comprising one or more cameras communicably coupled to the processing device, wherein the one or more cameras are operated based upon the spatial position of the person.
13 . The system as defined in claim 9 , wherein the audio signals are further processed, using the ML model, to determine a pitch of a head of the person.
14 . The system as defined in claim 13 , wherein the audio signals are further processed, using the ML model, to determine a yaw of the head.
15 . The system as defined in claim 10 , wherein the spatial position is used to determine a context of the environment.
16 . The system as defined in claim 9 , wherein the processing device is further configured to perform operations comprising: identifying two or more persons in the environment with a sensor array; determining relationships between the two or more persons; and utilizing the relationship data along with one or more video data streams to train at least one intelligence model.
17 . The system as defined in claim 9 , further comprising: identifying a gaze of the person; and operating the one or more microphones or one or more cameras based on the gaze.
18 . A non-transitory computer-readable storage medium storing instructions that, when executed by a computing system, causes the computing system to perform operations comprising: capturing one or more audio signals using one or more microphones positioned within the environment; supplying the one or more audio signals to a machine-learning (“ML”) model; and processing the one or more audio signals, using the ML model, to determine a spatial position of the person, the spatial position being an x, y and z coordinate of the person inside the environment.
19 . The computer-readable storage medium as defined in claim 18 , wherein the spatial position is an x, y and z coordinate of a head of the person.
20 . The computer-readable storage medium as defined in claim 18 , wherein the audio signals are further processed, using the ML model, to determine at least one of a pitch or yaw of a head of the person.

Description

PRIORITY This non-provisional application claims priority to U.S. Provisional Application No. 63/716,521, filed Nov. 5, 2024, entitled CONFERENCING SYSTEM WITH MULTI-MODAL SENSING AND CONTEXTUAL MODEL TRAINING”, naming James M. Dallas et al. as inventors, the disclosure of which is hereby incorporated by reference in its entirety. FIELD OF THE INVENTION The present disclosure is generally directed, but not limited to, optimization of audiovisual environments and, more specifically, to audiovisual systems that determine a position of a person in an environment using machine-learning (“ML”) models and other related methods. BACKGROUND Conferencing systems are commonly used to facilitate communication between individuals located in different physical locations. These systems often incorporate microphones, cameras, and other sensors to capture and transmit audio and video content from a meeting environment to remote participants. While such systems can support basic conferencing functions, they frequently rely on static sensor configurations and manually controlled settings, which can lead to suboptimal content capture, particularly in dynamic or multi-participant settings. Challenges arise when participants move within the meeting space, speak simultaneously, or exhibit non-verbal behaviors such as gestures or changes in body orientation. Traditional systems may struggle to determine which sensor inputs are most relevant at any given time or to interpret participant behavior in a meaningful context. Additionally, current systems often lack the capability to adapt in real time based on the spatial position or orientation of speakers, resulting in degraded audio-visual fidelity and reduced situational awareness for remote attendees, thus providing a sub-optimal audiovisual experience for those users. SUMMARY Embodiments of the present disclosure are generally directed to a conferencing system employing multi-modal sensing and spatial audio analysis to intelligently understand a conferencing environment. The system may be utilized to gather participant behavior, determine spatial position and orientation of participants, assign context to the gathered information, and train one or more intelligence models using the participants' contextual and spatially-derived actions. A conferencing system, in accordance with some embodiments, includes a sensor array positioned in a meeting space. An initial set of operating parameters is installed for the sensor array prior to detecting characteristics of the meeting space using the array. Meeting participants are identified, and a relationship strategy is generated by a computing device connected to the sensor array. The relationship strategy prescribes a set of operating parameters that enable the detection of interpersonal relationships between meeting participants. Based on this strategy, the computing device may designate an initial relationship status to a pair of participants. A context strategy is then generated by the computing device that prescribes a set of operating parameters to detect the behavior of at least one meeting participant. The computing device assigns one or more identifiers to the detected behavior that indicate the meaning or emotional state corresponding to that behavior. A conferencing strategy is further generated that prescribes customized audio and video collection settings. In addition to these functions, the conferencing system may employ a spatial position detection module to process audio signals captured from microphones to determine the (x, y, z) location of a speaker, as well as head pose including pitch, yaw, and roll. These spatial parameters may influence which sensors are activated, deactivated, or dynamically adjusted based on speaker position and direction. The computing device formats the behavioral and spatial identifiers to train at least one intelligence model with enhanced contextual and spatial precision. In other embodiments, the conferencing system positions a sensor array in a meeting space before measuring the meeting space with the sensor array. At least one participant within the meeting space is detected with the sensor array and then a gaze vector of the participant is detected that is employed to select a camera sensor of the sensor array. Meeting content collected with the sensor array is organized into meeting metrics in accordance with an analytics strategy before training an intelligence module with the meeting metrics. Meeting content is additionally configured into a plurality of digital tiles in accordance with a tile strategy. At least one of the plurality of digital tiles is then altered in response to a meeting condition detected by the sensor array. Other embodiments of a conferencing system position a sensor array in a meeting space with one or more video cameras and directional microphones. Meeting participants are identified, and their spatial positions and orientations are detected using the spatial pos