CN-112970035-B - System and method for domain adaptation in a neural network using a domain classifier

CN112970035BCN 112970035 BCN112970035 BCN 112970035BCN-112970035-B

Abstract

The domain adaptation module (1800) is for optimizing a first domain (1802) derived from a second domain (1804) using respective outputs from respective parallel hidden layers of the domains.

Inventors

R.CHEN
M-H.Chen
J.YU
X.LIU

Assignees

索尼互动娱乐股份有限公司

Dates

Publication Date: 20260512
Application Date: 20190827
Priority Date: 20181031

Claims (20)

1. An apparatus for performing domain adaptation, the apparatus comprising: at least one processor, and At least one computer storage device that is not a transient signal and that includes instructions executable by the at least one processor to: accessing a first neural network for action recognition, the first neural network associated with a first data type, wherein the first data type is real world video from a real world video recording; Accessing a second neural network for action recognition, the second neural network associated with a second data type different from the first data type, wherein the second data type is a video game video from a rendering of the video game; Providing first training data as input to the second neural network; selecting a first layer from a plurality of hidden layers of the second neural network; identifying a spatial output from the first layer, the spatial output generated based on the first training data; Determining, using a third neural network, whether the spatial output from the first layer is from the first neural network, the third neural network being different from the first neural network and the second neural network; based on determining that the spatial output from the first layer is not from the first neural network, adjusting one or more weights of the first layer; selecting a second layer from a plurality of hidden layers of the second neural network; identifying a time output from the second layer, the time output generated based on the first training data; Determining whether the time output from the second layer is from the first neural network using a fourth neural network, the fourth neural network being different from the first, second, and third neural networks, and One or more weights of the second layer are adjusted based on determining that the temporal output from the second layer is not from the first neural network.
2. The apparatus of claim 1, wherein the instructions are executable by the at least one processor to: The second neural network is initially established by replicating the first neural network.
3. The apparatus of claim 1, wherein the instructions are executable by the at least one processor to: based on determining that the spatial output from the first layer is from the first neural network, one or more weights of the first layer are denied.
4. The apparatus of claim 3, wherein the spatial output is a first spatial output, and wherein the instructions are executable by the at least one processor to: selecting a third layer from a plurality of hidden layers of the second neural network based on determining that the first spatial output from the first layer is from the first neural network; identifying a second spatial output, the second spatial output being from the third layer; determining whether the second spatial output is from the first neural network using the third neural network, and One or more weights of the third layer are adjusted based on determining that the second spatial output is not from the first neural network.
5. The apparatus of claim 4, wherein the first layer and the third layer of the second neural network are randomly selected.
6. The apparatus of claim 1, wherein the instructions are executable by the at least one processor to: Before using the third neural network to determine whether the spatial output from the first layer is from the first neural network, one or more weights of one or more layers of the third neural network are adjusted such that the third neural network learns to correctly classify spatial outputs from layers of either of the first and second neural networks.
7. The apparatus of claim 6, wherein the third neural network operates in an unsupervised mode to correctly classify spatial output from layers of either of the first and second neural networks using labeled data learning.
8. A method for performing domain adaptation, the method comprising: accessing a first neural network for action recognition, the first neural network associated with a first data type, wherein the first data type is real world video from a real world video recording; Accessing a second neural network for action recognition, the second neural network associated with a second data type different from the first data type, wherein the second data type is a video game video from a rendering of the video game; Providing first training data as input to the second neural network; selecting a first layer from a plurality of hidden layers of the second neural network; identifying a spatial output from the first layer, the spatial output generated based on the first training data; Determining, using a third neural network, whether the spatial output from the first layer is from the first neural network, the third neural network being different from the first neural network and the second neural network; based on determining that the spatial output from the first layer is not from the first neural network, adjusting one or more weights of the first layer; selecting a second layer from a plurality of hidden layers of the second neural network; identifying a time output from the second layer, the time output generated based on the first training data; Determining whether the time output from the second layer is from the first neural network using a fourth neural network, the fourth neural network being different from the first, second, and third neural networks, and One or more weights of the second layer are adjusted based on determining that the temporal output from the second layer is not from the first neural network.
9. The method of claim 8, comprising: Using the third neural network, determining whether the spatial output from the first layer is from the first neural network at least in part by identifying the spatial output from the first layer as being related to the first data type using the third neural network.
10. The method of claim 8, comprising: based on determining that the spatial output from the first layer is from the first neural network, one or more weights of the first layer are denied.
11. The method of claim 10, wherein the spatial output is a first spatial output, and wherein the method comprises: selecting a third layer from a plurality of hidden layers of the second neural network based on determining that the first spatial output from the first layer is from the first neural network; Identifying a second spatial output from the third layer; determining whether the second spatial output is from the first neural network using the third neural network, and One or more weights of the third layer are adjusted based on determining that the second spatial output is not from the first neural network.
12. The method of claim 8, wherein the first layer is selected based on a command from a human supervisor.
13. The method of claim 8, comprising: Before using the third neural network to determine whether the spatial output from the first layer is from the first neural network, one or more weights of one or more layers of the third neural network are adjusted such that the third neural network learns to correctly classify spatial outputs from layers of either of the first and second neural networks.
14. The method of claim 8, wherein the third neural network operates in an unsupervised mode to correctly classify spatial output from layers of either of the first and second neural networks using labeled data learning.
15. The method of claim 8, comprising: The second neural network is initially established by replicating the first neural network.
16. An apparatus for performing domain adaptation, the apparatus comprising: at least one computer storage device that is not a transient signal and that includes instructions executable by at least one processor to: determining, using a first domain classifier, whether a spatial output from a first hidden layer of a first model is from the first model or from a second model different from the first model, the first model and the second model being related to different data domains; Based on determining that the spatial output is from the second model, adjusting one or more weights of the first hidden layer; Determining whether a temporal output of a second hidden layer from the first model is from the first model or the second model using a second domain classifier, the second domain classifier being different from the first domain classifier, and One or more weights of the second hidden layer are adjusted based on determining that the temporal output is from the second model.
17. The apparatus of claim 16, wherein the different data fields include a first field related to real world video from the real world video recording and a second field related to rendered computer game video from the video game.
18. The apparatus of claim 16, wherein the different data fields include a first field related to information derived from a first voice and a second field related to information derived from a second voice.
19. The apparatus of claim 16, wherein the different data fields include a first field related to standard font text and a second field related to cursive script.
20. The apparatus of claim 16, wherein the first domain classifier and the second domain classifier use a gradient inversion layer (GRL) that receives data from a spatial model and a temporal model to invert gradients.

Description

System and method for domain adaptation in a neural network using a domain classifier Technical Field The present application relates generally to technically inventive, non-conventional solutions which have to be rooted in computer technology and which lead to specific technical improvements. Background Machine learning (sometimes referred to as deep learning) may be used in a variety of useful applications related to data understanding, detection, and/or classification, including image classification, optical Character Recognition (OCR), object recognition, motion recognition, speech recognition, and emotion recognition. However, as understood herein, a machine learning system may not be sufficient to use a training dataset (e.g., movie video) from another domain to identify actions in, for example, one domain (such as a computer game). For example, in the computer gaming industry, video and audio are two independent processes. First a game without audio is designed and produced, then the audio team surveys the entire game video and inserts the corresponding SFX from the sound effects (SFX) database, which is time consuming. As understood herein, machine learning may be used to accelerate the process, but current motion recognition models are trained on real world video datasets so that they are affected by dataset shifts or dataset bias when used in a game video. Disclosure of Invention To overcome the domain mismatch problem described above, at least two generic domains of training data (image or video or audio) are used to classify the target data set. A pair of training data fields may be created by, for example, real world video and computer game video, first and second speaker voices (for voice recognition), standard font text and cursive script (for handwriting recognition), and the like. Thus, the generic domain adaptation module established by the loss function and/or the actual neural network receives inputs from a plurality of output points from two training domains of deep learning and provides output metrics so that one and possibly both of the two trajectories of the neural network can be optimized. A generic cross-domain feature normalization module may also be used and inserted into any layer of the neural network. Thus, in one aspect, an apparatus includes at least one processor and at least one computer storage device that is not a transient signal and that includes instructions executable by the at least one processor. The instructions are executable to access a first neural network associated with a first data type, access a second neural network associated with a second data type different from the first data type, and provide first training data as input to the second neural network. The instructions are also executable to select a first layer, wherein the first layer is a hidden layer of the second neural network. The instructions are then executable to identify an output from the first layer generated based on the first training data and determine, using a third neural network, whether the output from the first layer is from the first neural network. The third neural network is different from the first neural network and the second neural network. The instructions are further executable to adjust one or more weights of the first layer based on determining that the output from the first layer is not from the first neural network. In some examples, the instructions may be executable to initially establish the second neural network by replicating the first neural network. Also in some examples, the instructions may be executable to refuse to adjust one or more weights of the first layer based on determining that the output from the first layer is from the first neural network. In some embodiments, the output may be a first output and the instructions may be executable to select a second layer based on determining that the first output from the first layer is from the first neural network, wherein the second layer is also a hidden layer of the second neural network. The instructions may also be executable to identify a second output, wherein the second output is from a second layer, and determine, using a third neural network, whether the second output is from the first neural network. The instructions may then be executable to adjust one or more weights of the second layer based on determining that the second output is not from the first neural network. The first layer and the second layer of the second neural network may be randomly selected. Additionally, prior to determining, using the third neural network, whether the output from the first layer is from the first neural network, the instructions may be executable to adjust one or more weights of one or more layers of the third neural network such that the third neural network learns to correctly classify the output from the layers of either of the first and second neural networks. The third neural network may even operate in an un