US-20260129388-A1 - System and Method for Synthesizing a Spatial Auditory Network via Ray-traced Multipath Sound Propagation

US20260129388A1US 20260129388 A1US20260129388 A1US 20260129388A1US-20260129388-A1

Abstract

A method of training a machine learning artificial intelligence system that includes generating scenario realizations each having a virtual spatial layout of sound-influencing features, and generating acoustic recordings of sounds moving through each scenario realization, where each acoustic recording is based on propagation effects associated with a corresponding virtual spatial layout. The method may include identifying isolated sounds in the acoustic recordings, and training a machine learning model comprising a multi-layer convolutional recurrent neural network (CRNN), with the one or more isolated sounds, wherein the training is via rectified linear unit (ReLU) activation and max pooling along a frequency axis, wherein the trained machine learning model generates output event activity probabilities. The method may include receiving a subsequent acoustic recording of one or more subsequent sound sources, and classifying, via the trained machine learning model, the one or more subsequent sound sources based on the generated output event activity probabilities.

Inventors

Steven M. Dennis
Bradley M. Landreneau
Christopher J. Michael

Assignees

THE GOVERNMENT OF THE UNITED STATES OF AMERICA, AS REPRESENTED BY THE SECRETARY OF THE NAVY

Dates

Publication Date: 20260507
Application Date: 20241107

Claims (20)

1 . A method of training a machine learning artificial intelligence system, comprising: generating, by a computing device, one or more scenario realizations, each scenario realization comprising a virtual spatial layout of one or more sound-influencing features; generating, by the computing device, a set of one or more acoustic recordings of one or more sounds moving through each scenario realization, the one or more sounds originating from a sound source in the scenario realization, wherein each acoustic recording is based on a set of one or more propagation effects associated with a corresponding virtual spatial layout of the one or more sound-influencing features, wherein each sound-influence feature causes an audio effect on the one or more sounds; identifying, by the computing device, one or more isolated sounds in the set of one or more acoustic recordings; training, by the computing device, a machine learning model comprising a multi-layer convolutional recurrent neural network (CRNN), with the one or more isolated sounds, wherein the training is via rectified linear unit (ReLU) activation and max pooling along a frequency axis, wherein the trained machine learning model generates output event activity probabilities; receiving, by the computing device, a subsequent acoustic recording of one or more subsequent sound sources; and classifying, by the computing device, via the trained machine learning model, the one or more subsequent sound sources based on the generated output event activity probabilities.
2 . The method of claim 1 , wherein output of a convolutional layer of the CRNN is stacked and fed into recurrent layers before a forward feed layer with sigmoid activation produces the output event activity probabilities.
3 . The method of claim 1 , wherein binary event activity predictions are produced by thresholding the output event activity probabilities at 0.5.
4 . The method of claim 1 , further comprising identifying, in the subsequent acoustic recording, via the trained machine learning AI system, one or more subsequent isolated frequencies.
5 . The method of claim 1 , wherein the classifying comprises determining a level of correspondence between the one or more subsequent isolated frequencies and the at least one of the isolated sounds.
6 . The method of claim 1 , wherein the level of correspondence is based on one or more of the generated output event activity probabilities.
7 . The method of claim 1 , wherein the one or more sound-influence features comprise one or more physical attributes.
8 . The method of claim 7 , wherein the one or more physical attributes affect sound via Occlusion, Reflection, Transmission, Scattering, Absorption, Reverberation, or Doppler effect.
9 . The method of claim 1 , wherein the one or more sound-influence features comprises surface type.
10 . The method of claim 1 , wherein the one or more sound-influence features comprises surface geometry.
11 . The method of claim 1 , wherein the set of one or more propagation effects comprise ray-traced multipath sound propagation.
12 . The method of claim 1 , wherein each scenario realization comprises a set of one or more constraints that influence sound propagation.
13 . The method of claim 1 , wherein the training further comprises training, validating, and testing a machine learning classifier implementation for a sound event detection application.
14 . The method of claim 1 , further comprising generating a spatialized ensemble sound file for a first scenario realization comprising a plurality of sounds and one or more isolated spatial recordings of respective constituent sound sources of the scenario realization.
15 . The method of claim 1 , wherein each scenario realization comprises one or more sensors for detecting audio associated with the one or more sounds.
16 . The method of claim 1 , wherein the one or more scenario realization comprises one or more locations of listeners, sound sources, tracks of sound sources, or ambiance.
17 . The method of claim 1 , from comprising receiving user input to generate the one or more scenario realizations.
18 . The method of claim 17 , wherein the user input comprises a human-readable text file that specifies the location of listeners, sound sources, tracks of sound sources, and ambiance.
19 . The method of claim 1 , wherein each scenario realization comprises sounds sources, motion characteristics, environmental geometry, or environmental acoustic properties.
20 . The method of claim 1 , further comprising performing a water-based operation based on the classification.

Description

CROSS-REFERENCE This application is a nonprovisional application of and claims the benefit of priority under 35 U.S.C. § 119 based on U.S. Provisional Patent Application No. 63/596,722 filed on Nov. 7, 2023. The Provisional Application and all references cited herein is hereby incorporated by reference into the present disclosure in their entirety. FEDERALLY-SPONSORED RESEARCH AND DEVELOPMENT The United States Government has ownership rights in this invention. Licensing inquiries may be directed to Office of Technology Transfer, US Naval Research Laboratory, Code 1004, Washington, DC 20375, USA; +1.202.767.7230; nrltechtran@us.navy.mil, referencing Navy Case #211587. TECHNICAL FIELD The present disclosure is related to machine learning, and more specifically to, but not limited to training a machine learning model via ray-traced multipath sound propagation. BACKGROUND The subject of automatic detection and categorization of certain classes of sounds recorded by an auditory network is interesting and useful for several applications ranging from surveillance to mission planning. Modern supervised machine learning techniques are effective in other applications of automatic detection, but require very large amounts of highly curated information to yield favorable results. Unfortunately, no such dataset of curated auditory examples currently exists in the state of the art. In such situations, data synthesis may be used to rapidly create such a dataset without the need of costly field collection, staggering amounts of manual data labeling, and rigorous quality assurance. SUMMARY This summary is intended to introduce, in simplified form, a selection of concepts that are further described in the Detailed Description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. Instead, it is merely presented as a brief overview of the subject matter described and claimed herein. The present disclosure provides for a method of training a machine learning artificial intelligence system. The method may include generating, by a computing device, one or more scenario realizations, each scenario realization comprising a virtual spatial layout of one or more sound-influencing features. The method may include generating, by the computing device, a set of one or more acoustic recordings of one or more sounds moving through each scenario realization, the one or more sounds originating from a sound source in the scenario realization, wherein each acoustic recording is based on a set of one or more propagation effects associated with a corresponding virtual spatial layout of the one or more sound-influencing features, wherein each sound-influence feature causes an audio effect on the one or more sounds. The method may include identifying, by the computing device, one or more isolated sounds in the set of one or more acoustic recordings. The method may include training, by the computing device, a machine learning model comprising a multi-layer convolutional recurrent neural network (CRNN), with the one or more isolated sounds, wherein the training is via rectified linear unit (ReLU) activation and max pooling along a frequency axis, wherein the trained machine learning model generates output event activity probabilities. The method may include receiving, by the computing device, a subsequent acoustic recording of one or more subsequent sound sources. The method may include classifying, by the computing device, via the trained machine learning model, the one or more subsequent sound sources based on the generated output event activity probabilities. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates a block schematic illustration of an example flow diagram of a Spatial Auditory Network Dataset Synthesis (SANDS) embodiment in accordance with disclosed aspects. FIG. 2 illustrates an example scenario realization in accordance with one or more disclosed aspects. FIG. 3 illustrates an example scenario realization in accordance with one or more disclosed aspects. FIG. 4 illustrates the Unity game engine and the Steam audio plugin examples in accordance with one or more disclosed aspects. FIG. 5 illustrates a human-readable text file in accordance with one or more disclosed aspects. FIG. 6 illustrates example sound recordings in accordance with one or more disclosed aspects. FIG. 7 illustrates an example scenario realization in accordance with one or more disclosed aspects. FIG. 8 illustrates an example scenario realization in accordance with one or more disclosed aspects. FIG. 9 illustrates example sound recordings in accordance with one or more disclosed aspects. FIG. 10 illustrates F-score, Precision, and Recall results in accordance with one or more disclosed aspects. FIG. 11 illustrates the performance of the model as a function of the number of active sounds present in the scenario in accordance w