US-12620236-B2 - Data processing system, data processing method, and information providing system

US12620236B2US 12620236 B2US12620236 B2US 12620236B2US-12620236-B2

Abstract

A data processing system efficiently identifies a target object pointed to by a vehicle occupant. The data processing system provides a position detection unit that detects the position of the vehicle; an occupant state recognition unit that recognizes motion of an occupant pointing to outside of the vehicle; a target object database that indicates position of target objects that may be pointed to by an occupant; an appearance feature database that indicates appearance features of the target object; a speech recognition unit that recognizes words indicative of appearance features from the speech of the occupant; an object recognition unit that extracts target object candidates pointed at by the occupant by searching the target object database and the appearance feature database using a direction pointed at by an occupant, and a word included in the speech recognized by the speech recognition unit; and an output unit that outputs target object candidates.

Inventors

Yuki Matsushita

Assignees

Faurecia Clarion Electronics Co., Ltd.

Dates

Publication Date: 20260505
Application Date: 20230621
Priority Date: 20220729

Claims (10)

1 . A system for multi-modal object identification, the system comprising: an interior camera that captures images of an interior of a vehicle including an occupant of the vehicle; an exterior camera that captures images of an area adjacent to an exterior of the vehicle; a microphone that captures sounds in the interior of the vehicle including words spoken by the occupant of the vehicle; a memory that stores a database of objects, wherein the database of objects includes, for each respective object, words describing an appearance of the respective object and a geographic position of the respective object; and one or more processors communicatively coupled to the interior camera, the exterior camera, the microphone, and the memory, wherein the one or more processors are collectively configured to: recognize, using the interior camera, motion of the occupant of the vehicle pointing to outside of the vehicle, recognize, using the microphone, a word or phrase comprising one or more appearance descriptors spoken by the occupant, match the recognized appearance descriptor to appearance-descriptor words stored for respective objects in the database, extract target object candidates pointed at by the occupant by searching the database based on a geographic position of the vehicle, the motion of the occupant, and the matched appearance descriptor, and discarding objects that lack the matched appearance descriptor, and filtering to objects that would be visible from the occupant's viewpoint at the time of the utterance based on the geographic position of the vehicle and positions stored in the database, identify a specific target object among the target object candidates based on a dialogue with the occupant that outputs, for each candidate, a feature of that candidate that was not included in the word or phrase spoken by the occupant and receives a confirmation, utilizing the appearance of the respective target object, and output an indication of the specific target object to the occupant.
2 . The system of claim 1 , wherein the one or more processors are further collectively configured to: estimate, from the occupant's viewpoint, a relative positional relationship of a target object using the geographic position and direction of the vehicle and positions stored in the database; and extract the target object candidates based on a condition that the target object would be visible to the occupant.
3 . The system of claim 1 , wherein the one or more processors are further collectively configured to: estimate a relative positional relationship of the target object candidates from a viewpoint of the occupant; recognize a word or a phrase that indicates the relative positional relationship of the target object candidates; and use the relative positional relationship to extract the target object candidates.
4 . The system of claim 1 , wherein the dialogue includes: outputting audio expressing a respective target object candidate using words or phrases that indicate a feature of the respective target object candidate that was not included in the word or phrase spoken by the occupant, and in response to the audio, receiving a confirmation of the respective target object candidate as the specific target object.
5 . The system of claim 1 , wherein the memory further stores a map database used to find and guide a route for the vehicle.
6 . A method for multi-modal object identification, the method comprising: acquiring a position of a vehicle; recognizing motion of an occupant of the vehicle pointing outside of the vehicle from image data captured inside the vehicle; recognizing, from audio data collected in the vehicle, a word or phrase comprising one or more appearance descriptors spoken by the occupant; matching the recognized appearance descriptor(s) to appearance-descriptor words stored for respective objects in a database that stores, for each respective object, a geographic position and appearance-descriptor words; searching the database using a geographic position of the vehicle, a direction pointed to by the occupant, and the matched appearance descriptor; extracting target object candidates pointed to by the occupant by discarding objects that lack the matched appearance descriptor(s) and filtering to objects that would be visible from the occupant's viewpoint at the time of the utterance based on the geographic position of the vehicle and positions stored in the database; identifying a specific target object among the target object candidates based on a dialogue with the occupant that outputs, for each candidate, a feature not included in the recognized word or phrase and receives a confirmation; and outputting an indication of the specific target object to the occupant.
7 . The method of claim 6 , wherein the dialogue includes: outputting audio expressing the respective target object using words or phrases that indicate a feature of the respective target object that was not included in the word or phrase spoken by the occupant and in response to the audio, receiving a confirmation of the respective target object as the specific target object from the occupant.
8 . A system that communicates with a vehicle for performing multi-modal object identification, the system comprising: a memory that stores a database of objects, wherein the database of objects includes, for each respective object, words describing an appearance of the respective object and a geographic position of the respective object; a communication interface communicatively coupled to the vehicle via a communication network; and one or more processors communicatively coupled to the memory and the communication interface, wherein the one or more processors are collectively configured to: receive, using the communication interface, audio data from a microphone located in the vehicle, wherein the microphone captures sounds in an interior of the vehicle including words spoken by an occupant of the vehicle, recognize, from the audio data, a word or phrase comprising one or more appearance descriptors spoken by the occupant, match the recognized appearance descriptor(s) to appearance-descriptor words stored for respective objects in the database, receive, using the communication interface, image data from an interior camera located in the vehicle, wherein the interior camera captures images of the interior of the vehicle including the occupant of the vehicle, recognize a direction of a pointing gesture by the occupant from the image data, extract target object candidates pointed at by the occupant by searching the database based on a geographic position of the vehicle, the pointing gesture of the occupant, and the matched appearance descriptor, and discarding objects that lack the matched appearance descriptor, identify a specific target object among the target object candidates based on a dialogue with the occupant that outputs, for each candidate, a feature of that candidate that was not included in the word or phrase spoken by the occupant and receives a confirmation, utilizing the appearance of the respective target object, and output, using the communication interface, an indication of the specific target object to the occupant.
9 . The system of claim 8 , wherein the one or more processors are collectively configured to: estimate a relative positional relationship of the target object candidates from a viewpoint of the occupant using a geographic position of the vehicle and positions stored in the database; and extract the target object candidates based on a condition that the target object candidates would be visible to the occupant.
10 . The system of claim 9 , wherein the dialogue includes: outputting audio expressing a respective target object using words or phrases that indicate a feature of the respective target object that was not included in the word or phrase spoken by the occupant and in response to the audio, receiving a confirmation of the respective target object as the specific target object from the occupant.

Description

TECHNICAL FIELD The present invention relates to a data processing system, a data processing method, and an information providing system. BACKGROUND TECHNOLOGY Conventionally, there is a technology for identifying target objects pointed to by a user who is occupant of a vehicle. For example, Patent Document 1 states, “a target object identifying device that accurately identifies a target object that exists in a direction to which a user's hand or finger is pointing is provided,” and also states that “positioning unit 13 detects a current vehicle position and vehicle orientation. An imaging unit 18 images the surroundings of the vehicle. A pointing direction detection unit 16 detects a pointing direction pointed toward by the user in the vehicle using their hand. A target object extraction unit extracts target objects that exist in the indicated direction detected by the pointing direction detection unit 16 from the image captured by the imaging unit 18. The target object position identification unit identifies the position of the target object extracted by the target object extraction unit with respect to the vehicle.” PRIOR ART DOCUMENTS Patent Documents [Patent Document 1] JP2007080060 A SUMMARY OF THE INVENTION Problem to the Solved by the Invention With the conventional technology, it is difficult to identify the target object intended by the user when there are a plurality of candidates for the target object in the direction pointed toward by the user. In particular, when pointing far away, there may be target object candidates in front of or behind, as well as to the left or the right of the pointing direction and the recognized position. When there are a plurality of target object candidates, the candidates are enumerated and presented to the occupant, and thus the target object intended by the user can be identified if a selection operation is received from the occupant. However, insufficient narrowing down of candidates will force the occupant to perform cumbersome decisions and operations. In-vehicle devices should not use an interface that requires cumbersome operations, as this may compromise safe driving. Therefore, an object of the present invention is to efficiently identify target objects pointed to by an occupant of a vehicle. Means for Solving the Problem In order to achieve the aforementioned target object, a representative data processing system of the present invention provides: a position detection unit that detects the position of the vehicle; an occupant state recognition unit that recognizes motion of an occupant of the vehicle pointing to outside of the vehicle; a target object database that indicates position of target objects that may be pointed to by an occupant; an appearance feature database that indicates appearance features of the target object; a speech recognition unit that recognizes words indicative of appearance features from the speech of the occupant; an object recognition unit that searches the target object database and the appearance feature database using a direction pointed at by an occupant, and a word included in the speech recognized by the speech recognition unit, to extract target object candidates pointed at by the occupant; and an output unit that outputs the target object candidates. Effects of the Invention An object of the present invention is to efficiently identify target objects pointed to by an occupant of the vehicle. The following description of embodiments will elucidate the problems, configurations, and effects other than those described above. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is an explanatory diagram depicting an overview of data processing of Embodiment 1. FIG. 2 is a configuration diagram of the data processing device of Embodiment 1. FIG. 3 is an explanatory diagram for determining target object candidates. FIG. 4 is a flowchart depicting the processing steps of the data processing device. FIG. 5 is an explanatory diagram of the interface between dialogue and the features of the target object. FIG. 6 is an explanatory diagram of a modified example that includes an in-vehicle device and a server. FIG. 7 is an explanatory diagram of another modified example that includes an in-vehicle device and a server. EMBODIMENTS OF THE INVENTION Next, embodiments of the present invention will be described using the drawings. Embodiment 1 FIG. 1 is an explanatory diagram depicting an overview of data processing of Embodiment 1. FIG. 1 depicts a user, who is an occupant of a vehicle, pointing outside the vehicle. The data processing device 10 provided in the vehicle is equipped with an interior camera that captures images of the interior of the vehicle and an exterior camera that captures images of the surroundings of the vehicle. The data processing device 10 acquires images from the interior camera, analyzes the images to identify the eye position and finger position of the occupant, and determines that a straight line connecting the eye position