CN-121260149-B - Training method of human-shaped robot language identification model and language identification method

CN121260149BCN 121260149 BCN121260149 BCN 121260149BCN-121260149-B

Abstract

The invention discloses a training method and a language identification method of a human-shaped robot language identification model, wherein the training method is used for obtaining a multi-mode training sample, the multi-mode training sample comprises an audio mode sample, a mouth mode sample and an environment mode sample, multi-layer cross-mode coding is conducted on the multi-mode training sample to obtain a plurality of cross-mode coding features, each cross-mode coding feature comprises an acoustic coding feature, a mouth mode coding feature and an environment coding feature, cross-mode fusion is conducted on all the cross-mode coding features to obtain a first fusion feature, and parameter updating is conducted on an initialized human-shaped robot language identification model according to the environment mode sample and the first fusion feature to obtain the trained human-shaped robot language identification model. The training method provides the human robot language identification model, which is beneficial to improving the accuracy and the robustness of language identification. The invention relates to the technical field of intelligent robots.

Inventors

LI WEICHONG
LI WEISHEN

Assignees

广州里工实业有限公司

Dates

Publication Date: 20260512
Application Date: 20251024

Claims (10)

1. A training method of a human-shaped robot language identification model is characterized by comprising the following steps: acquiring a multi-modal training sample, wherein the multi-modal training sample comprises an audio modal sample, a mouth model sample and an environment modal sample; Performing multi-layer cross-modal coding on the multi-modal training sample to obtain a plurality of cross-modal coding features, wherein each cross-modal coding feature comprises an acoustic coding feature, a mouth shape coding feature and an environment coding feature, the coding hierarchy of the acoustic coding feature in a first cross-modal feature is different from the coding hierarchy of the acoustic coding feature in a second cross-modal feature, the first cross-modal feature is any one of all the cross-modal coding features, and the second cross-modal feature is any one of all the cross-modal coding features except the first cross-modal feature; Performing cross-modal fusion on all the cross-modal coding features to obtain a first fusion feature; And according to the environmental modal sample and the first fusion characteristic, updating parameters of the initialized human-shaped robot language identification model to obtain a trained human-shaped robot language identification model.
2. The method of claim 1, wherein the multi-layer cross-modal encoding of the multi-modal training samples results in a number of cross-modal encoding features, including: carrying out multi-layer acoustic cascade coding on the audio mode sample to obtain a plurality of acoustic coding features, wherein the feature levels of each acoustic coding feature are different; Performing mouth shape visual coding on the mouth shape sample to obtain mouth shape coding characteristics; Performing environment visual coding on the environment modal sample to obtain the environment coding characteristics; And performing feature stitching on each acoustic coding feature according to the mouth shape coding feature and the environment coding feature to obtain a plurality of cross-modal coding features.
3. The method according to claim 2, wherein the multi-layer acoustic cascade coding of the audio modality samples results in a number of the acoustic coding features, comprising: Acquiring intermediate features, wherein the intermediate features are acoustic coding features of the audio mode sample or a previous feature level; and extracting the acoustic features from the intermediate features to obtain the acoustic coding features of the current feature level.
4. The method according to claim 1, wherein the cross-modal fusing all the cross-modal encoded features to obtain a first fused feature includes: cross attention analysis is carried out on all the cross-modal coding features to obtain attention weight of each cross-modal coding feature; according to each attention weight, carrying out feature weighting on the corresponding cross-modal coding feature to obtain a plurality of cross-modal weighted features; and carrying out feature fusion on all the cross-modal weighted features to obtain the first fusion feature.
5. The method of claim 1, wherein the performing parameter updating on the initialized human-shaped robot language identification model according to the environmental modal sample and the first fusion feature to obtain a trained human-shaped robot language identification model comprises: performing environment classification on the environment modal sample to obtain an environment prediction label; according to the environment prediction label, carrying out dynamic weight analysis processing to obtain acoustic feature weights and visual feature weights; According to the acoustic feature weight and the visual feature weight, carrying out weighted refining on the first fusion feature to obtain a target fusion feature; And according to the target fusion characteristics, updating parameters of the initialized human-shaped robot language identification model to obtain a trained human-shaped robot language identification model.
6. The method of claim 5, wherein the weighting the first fusion feature according to the acoustic feature weight and the visual feature weight to obtain a target fusion feature comprises: according to the acoustic feature weight and the visual feature weight, carrying out weight fusion on the first fusion feature to obtain a second fusion feature; Performing time attention fusion on the second fusion feature to obtain a third fusion feature; And carrying out modal attention fusion on the third fusion feature to obtain the target fusion feature.
7. A language identification method, comprising: Acquiring multi-mode data of a humanoid robot for collecting language identification; Inputting the multi-modal data into the trained humanoid robot language identification model according to any one of claims 1-6 for language identification to obtain a language identification result.
8. A training system for a human-shaped robotic language recognition model, comprising: The system comprises a first processing unit, a second processing unit and a third processing unit, wherein the first processing unit is used for acquiring a multi-modal training sample, and the multi-modal training sample comprises an audio modal sample, a mouth model sample and an environment modal sample; The multi-modal training sample multi-layer cross-modal coding system comprises a multi-modal training sample, a second processing unit, a first cross-modal processing unit, a second processing unit and a third processing unit, wherein the multi-layer cross-modal training sample is subjected to multi-layer cross-modal coding to obtain a plurality of cross-modal coding features, each cross-modal coding feature comprises an acoustic coding feature, a mouth shape coding feature and an environment coding feature, the coding hierarchy of the acoustic coding feature in a first cross-modal feature is different from the coding hierarchy of the acoustic coding feature in a second cross-modal feature, the first cross-modal feature is any cross-modal coding feature in all cross-modal coding features, and the second cross-modal feature is any cross-modal coding feature except the first cross-modal feature in all cross-modal coding features; the third processing unit is used for performing cross-modal fusion on all the cross-modal coding features to obtain a first fusion feature; And the fourth processing unit is used for updating parameters of the initialized human-shaped robot language identification model according to the environmental modal sample and the first fusion characteristic to obtain a trained human-shaped robot language identification model.
9. An electronic device, comprising: At least one processor; at least one memory for storing at least one program; the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any of claims 1-7.
10. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for implementing the method according to any of claims 1-7 when being executed by the processor.

Description

Training method of human-shaped robot language identification model and language identification method Technical Field The invention relates to the technical field of intelligent robots, in particular to a training method and a language identification method for a human-shaped robot language identification model. Background In recent years, the anthropomorphic robot has increasingly wide application in industrial and household scenes by virtue of anthropomorphic motion capability and interaction capability, wherein in the industrial scene, the anthropomorphic robot is required to finish equipment inspection, fault reporting, cooperation with multinational workers and other tasks, accurately recognize multilingual voices such as Chinese, english and the like, and in the household scene, the anthropomorphic robot is required to provide services such as accompanying education, household control and the like, and is required to adapt to household common languages such as Chinese, english and the like. Currently, the related art generally relies on a single acoustic feature or a combination of fixed modalities (e.g., fixed acoustic + single visual modality) to train a neural network model, and implements language identification through the trained model. However, because the humanoid robot has mobility and complexity of the environment, the model parameters obtained by training in the mode are fixed, the model parameters are easy to be interfered by scenes, the anti-interference capability of language identification is weak, and the accuracy and the robustness of the language identification are poor. Accordingly, there is a further need for solving and optimizing the problems associated with the related art. Disclosure of Invention The present invention aims to solve at least one of the technical problems existing in the related art to a certain extent. Therefore, an object of the embodiments of the present invention is to provide a training method and a language recognition method for a human-shaped robot language recognition model, where the training method provides a human-shaped robot language recognition model, which is beneficial to improving accuracy and robustness of language recognition. In order to achieve the technical purpose, the technical scheme adopted by the embodiment of the application comprises the following steps: In a first aspect, an embodiment of the present application provides a training method for a language identification model of a humanoid robot, including: acquiring a multi-modal training sample, wherein the multi-modal training sample comprises an audio modal sample, a mouth model sample and an environment modal sample; Performing multi-layer cross-modal coding on the multi-modal training sample to obtain a plurality of cross-modal coding features, wherein each cross-modal coding feature comprises an acoustic coding feature, a mouth shape coding feature and an environment coding feature, the coding hierarchy of the acoustic coding feature in a first cross-modal feature is different from the coding hierarchy of the acoustic coding feature in a second cross-modal feature, the first cross-modal feature is any one of all the cross-modal coding features, and the second cross-modal feature is any one of all the cross-modal coding features except the first cross-modal feature; Performing cross-modal fusion on all the cross-modal coding features to obtain a first fusion feature; And according to the environmental modal sample and the first fusion characteristic, updating parameters of the initialized human-shaped robot language identification model to obtain a trained human-shaped robot language identification model. In addition, the method according to the above embodiment of the present application may further have the following additional technical features: Further, in one embodiment of the present application, the method further comprises: carrying out data enhancement of dynamic scene adaptation on the multi-modal training sample to obtain an enhanced multi-modal training sample; The data enhancement for dynamic scene adaptation of the multi-modal training sample, to obtain an enhanced multi-modal training sample, includes: Based on a preset distance attenuation function, scene noise addition is carried out on the audio mode sample, and an audio mode sample after noise addition is obtained; Randomly shielding the mouth pattern sample to obtain an enhanced mouth pattern sample; and carrying out illumination adjustment and angle rotation on the environmental modal sample to obtain an enhanced environmental modal sample. Further, in an embodiment of the present application, the performing multi-layer cross-modal encoding on the multi-modal training sample to obtain a plurality of cross-modal encoding features includes: carrying out multi-layer acoustic cascade coding on the audio mode sample to obtain a plurality of acoustic coding features, wherein the feature levels of each acoustic c