US-12626490-B2 - Cross-modal manifold alignment across different data domains

US12626490B2US 12626490 B2US12626490 B2US 12626490B2US-12626490-B2

Abstract

A method and system for cross-modal manifold alignment of different data domains includes determining for a shared embedding space a first embedding function for data of a first domain and a second embedding function for data of a second domain using a triplet loss, wherein triplets of the triplet loss include an anchor data point from the first, a positive and a negative data point from the second domain; creating a first mapping for the data of the first domain using the first embedding function in the shared embedding space; creating a second mapping for the data of the second domain using the second embedding function in the shared embedding space; and generating a cross-modal alignment for the data of the first domain and the data of the second domain.

Inventors

Andre Tai NGUYEN
Luke Edward RICHARDS
Edward Simon Paster RAFF

Assignees

BOOZ ALLEN HAMILTON INC.

Dates

Publication Date: 20260512
Application Date: 20210609

Claims (20)

1 . A method for cross-modal manifold alignment of different data domains, the method comprising: determining for a shared embedding space a first embedding function for data of a first domain and a second embedding function for data of a second domain using a triplet loss, wherein triplets of the triplet loss include an anchor data point from the first domain, and a positive and a negative data point from the second domain, wherein the first domain includes RGB color data and RGB depth data; colorizing plural RGB depth features of the RGB depth data by distributing the plural RGB depth features across plural RGB color features of the RGB color data; creating a first mapping for the data of the first domain using the first embedding function in the shared embedding space; creating a second mapping for the data of the second domain using the second embedding function in the shared embedding space; generating a cross-modal manifold alignment for the data of the first domain and the data of the second domain; and training a classifier model based on the cross-modal manifold alignment of the data of the first domain and the data of the second domain, wherein the generating the cross-modal alignment includes: superimposing one of the first mapping or the second mapping on the other of the first mapping or second mapping to generate a cross-modal manifold alignment, the superimposing including one or more of: translating the first mapping and/or the second mapping in the shared embedding space, scaling the first mapping and/or the second mapping in the shared embedding space, and/or rotating the first mapping and/or the second mapping in the shared embedding space by generating a rotation matrix based on a difference between a first term and a second term, the first term including a first ratio of a first difference between the first embedding function and an average of the first embedding function to a Frobenius matrix of the first difference, and the second term including a second ratio of a second difference between the second embedding function and an average of the second embedding function to a Frobenius matrix of the second difference.
2 . The method of claim 1 , wherein the first domain and the second domain are from the same modality.
3 . The method of claim 1 , wherein the first domain and the second domain are from different modalities.
4 . The method of claim 1 , including: inputting a first data input file and a second data input file into the cross-modal manifold alignment, the first data input file being of the first domain and the second data input file being of the second domain; determining a relationship between the first data input file and the second data input file, wherein the relationship indicates the first data input file and the second data input file represent the same object based on the cross-modal manifold alignment; and storing in a database the first data input file and the second data input file and the relationship between the first data input file and the second data input file.
5 . The method of claim 1 , further comprising: extracting the plural RGB color features and the plural RGB depth features from the first domain; generating feature vectors from the RGB color features and the colorized RGB depth features; and concatenating the RGB color feature vectors and the colorized RGB depth feature vectors.
6 . The method of claim 1 , wherein the superimposing further comprises at least one of: translating the first mapping and/or the second mapping in the shared embedding space; and scaling the first mapping and/or the second mapping in the shared embedding space.
7 . A system for cross-modal manifold alignment of different data domains, the system comprising: a processor configured to: determine for a shared embedding space a first embedding function for data of a first domain and a second embedding function for data of a second domain using triplet loss, wherein triplets of the triplet loss include an anchor data point from the first domain and a positive and a negative data point from the second domain, wherein the data of the first domain includes RGB color data and RGB depth data; colorize plural RGB depth features of the RGB depth data by distributing the plural RGB depth features across plural RGB color features of the RGB color data; create a first mapping for the data of the first domain using the first embedding function in a shared embedding space; create a second mapping for the data of the second domain using the second embedding function in the shared embedding space; and generate a cross-modal manifold alignment for the data of the first domain and the data of the second domain, train a classifier model based on the cross-modal manifold alignment of the data of the first domain and the data of the second domain, wherein to generate the cross-modal manifold alignment the processor is configured to: superimpose one of the first mapping or the second mapping on the other of the first mapping or second mapping to generate a cross-modal manifold alignment, wherein to superimpose one of the first mapping or the second mapping on the other of the first mapping or second mapping, the processor is configured to at least rotate the first mapping and/or the second mapping in the shared embedding space by generating a rotation matrix based on a difference between a first term and a second term, the first term including a first ratio of a first difference between the first embedding function and an average of the first embedding function to a Frobenius matrix of the first difference, and the second term including a second ratio of a second difference between the second embedding function and an average of the second embedding function to a Frobenius matrix of the second difference.
8 . The system of claim 7 , wherein the first domain and the second domain are from the same modality.
9 . The system of claim 7 , wherein the first domain and the second domain are from different modalities.
10 . The system of claim 7 , wherein the processor is configured to: input a first data input file and a second data input file into the cross-modal manifold alignment, the first data input file being of the first domain and the second data input file being of the second domain; determine a relationship between the first data input file and the second data input file, wherein the relationship indicates the first data input file and the second data input file represent the same object based on the cross-modal manifold alignment; and store in a database the first data input file and the second data input file and the relationship between the first data input file and the second data input file.
11 . The system of claim 7 , wherein the data of the first domain is Red, Green, Blue, Depth (RGB-D) sensor data.
12 . The system of claim 7 , wherein to superimpose one of the first mapping or the second mapping on the other of the first mapping or second mapping, the processor is further configured to: translate the first mapping and/or the second mapping in the shared embedding space; and/or scale the first mapping and/or the second mapping in the shared embedding space.
13 . A computer-implemented method comprising: creating a representation of a cross-modal manifold alignment, in a latent space, for robotics data spanning plural heterogenous domains including a first domain comprising domain data from a robotics language model and a second domain comprising domain data from a robotics imaging model; determining a data pair for the robotics data across the first domain and the second domain based on an analysis of the cross-modal manifold alignment, wherein the determining of the data pair comprises correlating words from the robotics language model of the first domain with robotics sensor image data including RGB-D image data of the second domain based on the analysis of the cross-modal manifold alignment; training, based on the determining of the data pair, a cross-modal alignment algorithm to identify a plurality of data pairs for the robotics data across at least the first domain and the second domain using the cross-modal manifold alignment; generating the representation cross-modal manifold alignment for the robotics data by superimposing a first manifold for the first domain with a second manifold for the second domain, which includes optimizing a translation, a scaling, and a rotation of the first manifold and the second manifold to generate the representation, the first manifold and the second manifold being obtained from a first manifold mapping file for the first domain and a second manifold mapping file for the second domain; and wherein optimizing a translation, a scaling, and a rotation of the first manifold and the second manifold includes generating a rotation matrix based on a difference between a first term and a second term of a shared embedding space, the first term including a first ratio of a first difference between a first embedding function and an average of the first embedding function to a Frobenius matrix of the first difference, and the second term including a second ratio of a second difference between a second embedding function and an average of the second embedding function to a Frobenius matrix of the second difference.
14 . The computer-implemented method of claim 13 , further comprising: using the cross-modal alignment algorithm to identify a new data pair for the robotics data.
15 . The computer-implemented method of claim 13 , further comprising: generating a robotics domain-specific data set based on the plurality of data pairs for the robotics data identified in the training.
16 . The computer-implemented method of claim 15 , further comprising: further comprising: storing the robotics domain-specific data set in a database for access and use.
17 . The computer-implemented method of claim 13 , wherein the creating of the representation of the cross-modal manifold alignment further comprises: analyzing a first manifold mapping file for the first domain and a second manifold mapping file for the second domain, generating, based on the analyzing of the first manifold mapping file and the second manifold mapping file, the representation cross-modal manifold alignment for the robotics data by superimposing a first manifold for the first domain with a second manifold for the second domain, which includes optimizing a translation, a scaling, and a rotation of the first manifold and the second manifold to generate the representation, and displaying the representation of a cross-modal manifold alignment, via a graphical user interface, for visual inspection of the cross-modal manifold alignment.
18 . The computer-implemented method of claim 17 , further comprising: recasting the representation of the cross-modal manifold alignment based on an evaluation of an output from the creating of the representation of the cross-modal manifold alignment.
19 . The computer-implemented method of claim 17 , wherein the robotics data spanning the plural heterogenous domains includes robotics sensor data in the form of at least one of: a third domain comprising domain data from a model of audio data, a fourth domain comprising domain data from a model for pressure data, a fifth domain comprising domain data from a model of temperature data, a sixth domain comprising domain data from a model of haptic data, and a seventh domain comprising domain data from a model of motion data, and wherein the robotics sensor data is included in the representation of the cross-modal manifold alignment.
20 . The computer-implemented method of claim 19 , wherein the training further comprises identifying one or more data pairs between the robotics sensor data and one of the first domain and the second domain.

Description

FIELD The present disclosure relates to methods and systems for cross-modal manifold alignment of data from different domains, and more particularly to using triplet loss for manifold alignment in the context of grounded language. BACKGROUND Artificial intelligence-enabled devices are becoming increasingly more advanced and affordable and thus ever more present in our daily lives. Therefore, there is a great interest in making such devices as intuitive and easy to interact with as possible. Language offers an approachable and relatively accessible interface without requiring prior training on the part of the user. We have seen the integration of voice-assistant speakers in homes drastically increase in recent years, and language may become a preferred method for interacting with AI-enabled assistants. However, understanding how such devices' recognition of natural language can be best grounded to the physical world is still very much an open problem. Combining language and robotics creates unique challenges that much of the current work on grounded language learning has not addressed. One such way of combining language and robotics is the use of manifold alignment which finds a mapping from heterogeneous representations to a shared structure in latent space. Manifold alignment makes the assumption that there is an underlying latent manifold that datasets share, which is obtained by leveraging correspondences between paired data elements. Current work in the area of manifold alignment as it is applied to learning groundings between language and physical context relies on extensive databases such as the Recipe 1M dataset that contains one million cooking recipes and eight hundred thousand food images. In the robotics domain, current approaches to language grounding are very limited in the number of object classes and are restricted to learning joint embeddings. Thus, there is a need for a novel and more effective approach of language grounding, particularly where only smaller datasets of ground truth are available and where the data spans different domains. SUMMARY A method for cross-modal manifold alignment of different data domain is disclosed. The method includes determining for a shared embedding space a first embedding function for data of a first domain and a second embedding function for data of a second domain using a triplet loss, wherein triplets of the triplet loss include an anchor data point from the first domain, and a positive and a negative data point from the second domain; creating a first mapping for the data of the first domain using the first embedding function in the shared embedding space; creating a second mapping for the data of the second domain using the second embedding function in the shared embedding space; and generating a cross-modal alignment for the data of the first domain and the data of the second domain. The generating of the cross-modal alignment can include: superimposing the first mapping and the second mapping to generate a cross-modal manifold alignment. The superimposing of the first mapping and the second mapping can include one or more of the following: translating the first mapping and the second mapping in the shared embedding space, scaling the first mapping and the second mapping in the shared embedding space, and/or rotating the first mapping and the second mapping in the shared embedding space. The method can also include inputting a first data input file and a second data input file into the cross-modal manifold alignment, the first data input file being of the first domain and the second data input file being of the second domain; determining a relationship between the first data input file and the second data input file, wherein the relationship indicates the first data input file and the second data input file represent the same object based on the cross-modal manifold alignment; and storing in a database the first data input file and the second data input file and the relationship between the first data input file and the second data input file. A system for cross-modal manifold alignment of different data domains is disclosed. The system includes a processor configured to: determine for a shared embedding space a first embedding function for data of a first domain and a second embedding function for data of a second domain using triplet loss, wherein triplets of the triplet loss include an anchor data point from the first domain, and a positive and a negative data point from the second domain; create a first mapping for the data of the first domain using the first embedding function in the shared embedding space; create a second mapping for the data of the second domain using the second embedding function in the shared embedding space; and generate a cross-modal alignment for the data of the first domain and the data of the second domain. The generating the cross-modal alignment involves the processor being configured to: superimpose the first mapping