CN-121996771-A - Large language model interaction method based on expression information

CN121996771ACN 121996771 ACN121996771 ACN 121996771ACN-121996771-A

Abstract

The invention discloses an intelligent robot based on a large language model, which relates to the field of large language models and comprises the following components: through collecting diversified english corpus, clear extraneous content of text is got rid of and data set is made to it, combines camera module to acquire expression information audio adapter board and acquires voice information, constitutes multimode training model, guides into training model with the data set and trains and test, promotes the generalization ability of model and the high-efficient, smooth voice interaction ability under various scenes. The combined master control and the coding motor provide motion capability, adapt to dynamic changes in various actual scenes, provide accurate and smooth human body tracking function, and greatly promote the interaction experience of the robot.

Inventors

XIA DONG
LI LANG
CHEN ZIXU
TIAN ZISHAN
GAO XINRUI
He Shenyu
GUO DONGXIN
Pang Juehui

Assignees

中国民航大学

Dates

Publication Date: 20260508
Application Date: 20241105

Claims (5)

1. An interaction method of a large language model based on expression information comprises the following steps: Collecting facial image data covering various emotional expressions, positioning a face area in an image by using a face detection algorithm, and cutting and aligning the face area to ensure that the positions, the sizes and the angles of the faces in the image are consistent; Normalizing the aligned face images, extracting expression features at key points of the face such as the corners of eyes and the corners of mouth to generate feature vectors of expression information; Collecting diversified English corpus including text data such as dialogue, speech and articles to cover different contexts, accents and language styles; removing connecting words and stop words with low information content, and carrying out emotion marking on words and sentences to obtain emotion categories and intensities; Introducing a time stamp into the text data, pairing the expression information with the language information, and mapping the expression feature vector and the language feature vector to the same vector space through a multi-mode embedding layer so as to finish fusion of vision and the language information; And step six, in the multi-mode model, taking the expression information as the pre-input of the voice information by using a model based on a transducer architecture, and generating comprehensive feedback by combining the expression and the language information.
2. The method according to claim 1, wherein in the step of extracting the expression feature, the extracted expression feature vector is incorporated into an emotion analysis model to provide emotion recognition input of the expression.
3. The method according to claim 1, wherein in the feature fusion step, an emotion recognition management module is built based on an emotion interaction layer to perform emotion response of the expression information.
4. The method according to claim 1, wherein in the emotion recognition and generation step, the RAG and LLM technology is integrated through a collaboration enhancement framework, and query is performed in a specific domain of a professional knowledge base for each part of the text input by the user, and complete emotional feedback is formed by combining answers generated by the large language model.
5. The interaction method of large language model based on expression information according to claim 4, wherein the collaboration enhancement framework splits text input by a user into a plurality of parts, queries knowledge base contents for each part respectively, and integrates the retrieved expertise with the generated part answers by utilizing the generation capability of the large language model to realize combined output of emotion and expertise.

Description

Large language model interaction method based on expression information Technical Field The invention relates to the field of large language models, in particular to the technical field of computer vision, and in particular relates to secondary development of large language models. Background Natural language processing techniques cover various levels from lexical, syntactic, to semantic in languages, with the goal of letting a computer understand, generate, and process human language. Traditional NLP methods rely on statistical, regular, and manually labeled corpora, and are difficult to handle complex, diverse language expressions due to their limitations. Emotion analysis is one of the important applications of natural language processing technology, and judges emotion tendencies of texts by analyzing elements such as vocabulary, sentence patterns, contexts and the like in the texts. While conventional emotion analysis methods rely on emotion dictionaries and simple machine learning algorithms, modern emotion analysis can capture complex emotion information more accurately by means of deep learning and large language models. With the development of multi-modal technology, emotion recognition gradually expands from single text analysis to multi-modal information processing including voice, image and the like, and more comprehensive emotion information can be obtained by comprehensively analyzing the expression, voice and text content of a user. Disclosure of Invention The invention provides a brand-new large language model interaction method, which aims to solve the problem that feedback which does not accord with the current situation possibly exists in an answer generated by the current large language model only through text input. And the pertinence and the adaptability of the feedback of the large language model are improved. In order to achieve the above object, the present invention provides a large language model interaction method based on expression information, including: The method comprises the steps of obtaining image data of rich facial expression information covering various emotional expressions, positioning a face area in an image by adopting a face detection algorithm, and cutting and aligning the face area so as to ensure that the positions, the sizes and the angles of the faces in the image are consistent. And carrying out normalization processing, and then carrying out feature extraction on the key points of the face parts such as the corners of eyes, corners of mouth and the like to obtain feature vectors representing expression information. A diverse english corpus is collected, including conversations, lectures, articles, etc., ensuring that the data encompasses different contexts, accents, and language styles as language data sets. Text data is obtained using natural language processing techniques. Word segmentation processing is carried out on the text data, word boundaries in the text are identified, and part-of-speech tags are attached to each word according to the context. And removing common connecting words and stop words with low information content, and improving the affective information proportion of the data set. And carrying out emotion marking on words and sentences in the text based on the existing emotion dictionary and emotion classification model so as to obtain emotion types and intensities. And introducing a time stamp into the text data, and finishing the expression information pairing of corresponding time to finish the feature fusion in the format of expression information and language information. Text data is subjected to text analysis, text is converted into vectors through a pre-trained word vector model, and expression information is combined with expression information vectors and used as pre-input of voice information. Visual and linguistic features are mapped into the same vector space through the multimodal embedding layer, incorporating the database. Further, a model is adopted, which is suitable for generating tasks based on a transducer architecture. Aiming at the fusion requirement of multi-mode information, a model framework combining vision and text input is selected, and the two information are input in a mode of combining the expression information serving as the pre-input mode before voice information. Further, the multi-modal resources and the related text fragments are subjected to association analysis and logic carding, an emotion recognition management module is constructed in the emotion interaction layer, and emotion response is carried out on emotion contained in emotion information. Further, RAG and LLM techniques are integrated under a collaborative enhancement framework. The original input text of the user is split into a plurality of parts, and each part is queried in a specific field of the professional knowledge base. And then, organically combining the retrieved expertise with partial answers generated by the LLM by utilizing t