EP-4736161-A1 - FUSING AUDIO, VISUAL AND SENSOR CONTEXT INFORMATION IN MOBILE CAPTURE

EP4736161A1EP 4736161 A1EP4736161 A1EP 4736161A1EP-4736161-A1

Abstract

The disclosed systems and methods include a context detection module that detects a current context of an environment of a mobile device. Audio and video processing of audio and images captured by a microphone and camera of the device, respectively, in the environment is determined based on the detected context. The context detection module contains at least one audio classifier and at least one visual classifier. In some embodiments, the context detection module can be extended to use sensor information, in place of or in addition to, the audio and visual information. The captured audio, visual and sensor information are aligned on a time axis based on outputs of the audio classifier, the visual classifier and timestamps associated with the sensor information. One or more fusion methods are used to combine the context detection results.

Inventors

SUN, Jundai
Shuang, Zhiwei
MA, Yuanxing
LIU, YANG

Assignees

Dolby Laboratories Licensing Corporation

Dates

Publication Date: 20260506
Application Date: 20240626

Claims (20)

1. A method comprising: receiving, with at least one processor of a mobile device, input signals including audio, video and sensor signals captured by the mobile device; extracting, with the at least one processor, an audio feature vector from the audio signal; extracting, with the at least one processor, a visual feature vector from the video signal; extracting, with the at least one processor, a sensor feature vector from the sensor signal; and generating, with a classifier, a classification decision indicating an environment context type based on the audio, visual and sensor feature vectors.
2. The method of claim 1 , further comprising: segmenting, using a sliding window, the audio signal into overlapping segments and performing the extracting and generating steps on the overlapping segments.
3. The method of claim 1 or 2, wherein the audio, video and sensor signals are continuously captured by the mobile device.
4. The method of any preceding claim, further comprising resampling the visual and sensor feature vectors to match a length of the audio feature vector.
5. The method of any preceding claim, further comprising normalizing at least one of the audio, visual or sensor feature vectors to limit the feature vectors to a specified range.
6. The method of any preceding claim, further comprising generating a confidence score for the classification decision.
7. The method of any preceding claim, where at least one audio feature in the audio feature vector is Mel-frequency cepstral coefficients (MFCC), at least one visual feature in the visual feature vector is a color model value and at least one feature in the sensor feature vector is a measure of ambient light of the environment.
8. A method comprising: receiving, with at least one processor of a mobile device, input signals including audio, video and sensor signals captured in an operating environment of a mobile device; extracting, with the at least one processor, an audio feature vector from the audio signal; detecting, with the at least one processor, a first environment context type and first confidence score for the first environment context type with an audio classifier and based on the audio feature vector; extracting, with the at least one processor, a visual feature vector from the video signal; detecting, with the at least one processor, a second environment context type and second confidence score for the second environment context type with a visual classifier and based on the visual feature vector; extracting, with the at least one processor, a sensor feature vector from the sensor signal; detecting, with the at least one processor, a third environment context type and third confidence score for the third environment context type with a sensor classifier and based on the sensor feature vector; and generating a final classification decision of the environment context type based on the first, second and third confidence scores.
9. The method of claim 8, wherein the first, second and third confidence scores are weighted according to the confidence scores.
10. The method of claim 9, wherein a weight for the second confidence score is reduced when a user of the mobile device zooms a camera of the mobile device into a specific object.
11. The method of claim 9 or 10, wherein a weight for the second confidence score is adjusted based on a quality of an image captured by a camera of the mobile device.
12. The method of any one of claims 9 to 11, further comprising: determining that there is music playing on the mobile device; and reducing a weight of the audio classifier.
13. The method of any one of claims 9 to 12, wherein a weight for the second confidence score is adjusted based on a time of day.
14. The method of any one of claims 9 to 13, wherein a weight for the first confidence score is adjusted if a microphone of the device is occluded.
15. The method of any one of claims 9 to 14, wherein a weight for the first confidence score is adjusted based on whether the input audio signal was edited.
16. The method of any one of claims 9 to 15, further comprising: determining, by a motion sensor of the device, that the device is shaking while the method is performed; purging the final classification decision of the environment context type; determining, by the motion sensor, that the device is no longer shaking; detecting new first, second and third environment context types and corresponding first, second, and third confidence scores; and generating a new final classification decision of the environment context type based on the new first, second, and third corresponding confidence scores.
17. The method of any one of claims 9 to 16, wherein the visual classifier estimates if the mobile device is indoors or outdoors.
18. The method of claim 9 or any claim dependent thereon, wherein the weights are time varying.
19. The method of claim 9 or any claim dependent thereon, wherein at least one of the weights is adjusted when a change in the environment is detected.
20. The method of claim 9 or any claim dependent thereon, wherein the first weight for the first confidence score is reduced.

Description

FUSING AUDIO, VISUAL AND SENSOR CONTEXT INFORMATION IN MOBILE CAPTURE CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims the benefit of priority from U.S. Provisional Patent Application No. 63/550,287, filed February 6, 2024, and PCT International Patent Application No.PCT/CN2023/102812, filed June 27, 2023, each of which is incorporated by reference herein in its entirety. TECHNICAL FIELD [0002] The disclosed embodiments relate to audio, image and video processing, and in particular to combining audio, visual and sensor context information in mobile capture. BACKGROUND [0003] User-generated content (UGC) is typically created by consumers and can include any form of content (e.g., images, videos, text, audio). One trend related to UGC is personal moment sharing in variable environments (e.g., indoors, outdoors, by the sea) by recording video and audio using a personal mobile device (e.g., smart phone, tablet computer, wearable device). SUMMARY [0004] Various embodiments are disclosed for fusing audio, visual and sensor context information in mobile capture. [0005] In some embodiments, a method comprises: receiving, with at least one processor of a mobile device, input signals including audio, video and sensor signals captured by the mobile device; extracting, with the at least one processor, an audio feature vector from the audio signal; extracting, with the at least one processor, a visual feature vector from the video signal; extracting, with the at least one processor, a sensor feature vector from the sensor signal; and generating, using a classifier, a classification decision indicating an environment context type based on the audio, visual and sensor feature vectors. [0006] In some embodiments, the method further comprises segmenting, using a sliding window, the audio signal into overlapping segments and performing the extracting and generating steps on the overlapping segments. [0007] In some embodiments, the audio, video and sensor signals are continuously captured by the mobile device. [0008] In some embodiments, the method further comprises resampling the visual and sensor feature vectors to match a length of the audio feature vector. [0009] In some embodiments, the method further comprises normalizing at least one of audio, visual or sensor feature vectors to limit the feature vectors to a specified range. [00010] In some embodiments the method further comprises generating confidence scores for the first, second and third context detections. [00011] In some embodiments, at least one audio feature in the audio feature vector is Mel-frequency cepstral coefficients (MFCC), at least one visual feature in the visual feature vector is a color model value and at least one feature in the sensor feature vector is a measure of ambient light of the environment. [00012] In some embodiments, a method comprises: receiving, with at least one processor of a mobile device, input signals including audio, video and sensor signals captured in an operating environment of a mobile device; extracting, with the at least one processor, an audio feature vector from the audio signal; detecting, with the at least one processor, a first environment context type and first confidence score for the first environment context type with an audio classifier and based on the audio feature vector; extracting, with the at least one processor, a visual feature vector from the video signal; detecting, with the at least one processor, a second environment context type and second confidence score for the second environment context type with a visual classifier and based on the visual feature vector; extracting, with the at least one processor, a sensor feature vector from the sensor signal; detecting, with the at least one processor, a third environment context type and third confidence score for the third environment context type with a sensor classifier and based on the sensor feature vector; and generating a final classification decision of the environment context type based on the first, second and third confidence scores [00013] In some embodiments, the first, second and third confidence scores are weighted according to the confidence scores. [00014] In some embodiments, a weight for the second confidence score is reduced when a user of the mobile device zooms a camera of the mobile device into a specific object. [00015] In some embodiments, a weight for the second confidence score is adjusted based on a quality of an image captured by a camera of the mobile device. [00016] In some embodiments, the method further comprises determining that there is music playing on the mobile device and reducing a weight of the audio classifier. [00017] In some embodiments, a weight for the second confidence score is adjusted based on a time of day. [00018] In some embodiments, a weight for the first confidence score is adjusted if [00019] a microphone of the device is occluded. [00020] In some embodiments, a weight for the fir