CN-121981220-A - Server and method for generating multi-mode information

CN121981220ACN 121981220 ACN121981220 ACN 121981220ACN-121981220-A

Abstract

The application provides a server and a method for generating multi-mode information, wherein the method comprises the steps of receiving a user request input by a user; the method comprises the steps of adopting a multi-modal encoder to encode a user request into a shared semantic representation to obtain semantic information, searching cultural knowledge information corresponding to cultural metadata in a cultural knowledge graph, generating a cultural adaptation vector based on the cultural knowledge information, and generating multi-modal information corresponding to the user request according to the semantic information and the cultural adaptation vector. According to the method, through receiving the user request input by the user, identifying whether the user request contains cultural metadata or not, encoding cultural knowledge information associated with the user request into a cultural adaptation vector, and comprehensively generating multi-modal information related to the user request by combining semantic information corresponding to the user request and the cultural adaptation vector, the degree of agreement between the multi-modal information and the cultural metadata is further improved, and the problems that the multi-modal information lacks deep understanding of a specific regional cultural background and has low degree of agreement with regional culture are solved.

Inventors

CHEN CHANGXU
XIA WENHAN

Assignees

VIDAA(荷兰)国际控股有限公司

Dates

Publication Date: 20260505
Application Date: 20251208

Claims (10)

1. A server for a server, which comprises a server and a server, characterized by comprising the following steps: a memory module configured to store program instructions; a control module running the program instructions, the control module configured to: receiving a user request input by a user, and identifying whether the user request contains cultural metadata or not; Under the condition that the cultural metadata is contained in the user request, encoding the user request into a shared semantic representation by adopting a multi-mode encoder to obtain semantic information corresponding to the user request; Based on the culture metadata, retrieving culture knowledge information corresponding to the culture metadata in a culture knowledge graph; generating a culture adaptation vector based on the culture knowledge information, wherein the culture adaptation vector is a vectorized representation obtained by extracting features of the culture knowledge information; and generating multi-modal information corresponding to the user request according to the semantic information and the culture adaptation vector.
2. The server according to claim 1, wherein the server is provided with a multi-modal big model, the multi-modal big model comprises a shared backbone, an understanding branch and a generating branch, the shared backbone is used for encoding the user request into a shared semantic representation by adopting a multi-modal encoder to obtain semantic information corresponding to the user request when the cultural metadata is contained in the user request, the understanding branch is used for receiving the semantic information and analyzing the cultural metadata, and based on the cultural metadata, the cultural knowledge information corresponding to the cultural metadata is retrieved in a cultural knowledge graph, and the generating branch is used for receiving the semantic information and the cultural adaptation vector and generating multi-modal information corresponding to the user request according to the semantic information and the cultural adaptation vector.
3. The server according to claim 1, characterized in that the server is provided with a cultural adaptation module, the control module generating a cultural adaptation vector based on the cultural knowledge information, in particular configured to: Inputting the cultural knowledge information to the cultural adaptation module; the cultural knowledge information is encoded by the cultural adaptation module to generate a cultural adaptation vector.
4. The server of claim 3, wherein the control module performs encoding of the cultural knowledge information by the cultural adaptation module to generate a cultural adaptation vector, specifically configured to: And the cultural knowledge information is encoded into a cultural adaptation vector with a preset dimension by the cultural adaptation module through a graph neural network or a transducer encoder.
5. The server according to claim 2, wherein the control module, when the user request includes the culture metadata, encodes the user request into a shared semantic representation using a multi-modal encoder, so as to obtain semantic information corresponding to the user request, and is specifically configured to: and encoding the user request into a shared semantic representation by the shared backbone by adopting a multi-mode encoder so as to generate semantic information corresponding to the user request.
6. The server according to claim 2, wherein the control module retrieves cultural knowledge information corresponding to the cultural metadata in a cultural knowledge graph based on the cultural metadata, and is specifically configured to: parsing, by the understanding branch, the cultural metadata contained in the user request; generating a structured query statement according to the culture metadata; sending the structured query statement to the cultural knowledge graph through a retriever in the understanding branch; And executing the structured query statement by the cultural knowledge graph to retrieve cultural knowledge information corresponding to the cultural metadata.
7. The server according to claim 2, wherein the control module generates multimodal information corresponding to the user request from the semantic information and the cultural adaptation vector, specifically configured to: receiving the semantic information and the culture adaptation vector through the generating branch; and based on the semantic information, performing cross attention calculation with the culture adaptation vector to obtain multi-modal information corresponding to the culture metadata.
8. The server according to claim 1, wherein a multimodal vector database is provided in the server, the multimodal vector database is used to replace the cultural knowledge graph, and the control module performs retrieving cultural knowledge information corresponding to the cultural metadata, and is specifically configured to: Encoding the culture metadata into a query vector; searching a cultural feature vector with highest similarity with the query vector in the multi-modal vector database; and taking the cultural feature vector as the cultural knowledge information corresponding to the cultural metadata.
9. The server according to claim 1, wherein the cultural knowledge graph includes a relationship type defining a relationship between an entity type of the cultural knowledge information and the cultural knowledge information, the entity type including at least one of cultural information, country information, social specification information, value view information, tabu information, sign information, slang information, and aesthetic preference information.
10. A method for generating multi-mode information, applied to the server of any one of claims 1-9, characterized in that the method comprises: receiving a user request input by a user, and identifying whether the user request contains cultural metadata or not; Under the condition that the cultural metadata is contained in the user request, encoding the user request into a shared semantic representation by adopting a multi-mode encoder to obtain semantic information corresponding to the user request; Based on the culture metadata, retrieving culture knowledge information corresponding to the culture metadata in a culture knowledge graph; generating a culture adaptation vector based on the culture knowledge information, wherein the culture adaptation vector is a vectorized representation obtained by extracting features of the culture knowledge information; and generating multi-modal information corresponding to the user request according to the semantic information and the culture adaptation vector.

Description

Server and method for generating multi-mode information Technical Field The application relates to the technical field of artificial intelligence, in particular to a server and a method for generating multi-mode information. Background In the field of artificial intelligence, the artificial intelligence generation Content (ARTIFICIAL INTELLIGENCE GENERATED Content, AIGC) is rapidly developed, and a multi-modal large model is used as the core of AIGC technology, can generate high-quality text, image, audio and video Content, namely multi-modal information, according to complex natural language instructions, and provides an efficient and large-scale Content production solution for globalization service. However, in the actual scenario of globalization content production, there are significant differences in cultural background, social specifications, value, symbolism, etc. of different regions, which presents a serious challenge to the cross-cultural adaptability of AIGC systems. Currently AIGC has made significant progress in both the generation capability and knowledge retrieval. In terms of generating capacity, taking a multi-mode large model as an example, the expansion of the content creation boundary can be realized through unified processing and generating of multi-mode content. In the aspect of knowledge retrieval, a fact basis can be provided for model generation by retrieving related information from an external knowledge base, so that the problem of illusion or fact error of a large language model in the process of professional knowledge tasks is effectively solved. Despite the breakthroughs in generating capabilities and knowledge retrieval made by the prior art, the following drawbacks remain in cross-cultural content authoring. The existing knowledge base for RAG technology, such as a general knowledge graph, mainly stores objective and universal fact knowledge, lacks contents for defining and associating abstract cultural concepts, causes systematic deletion and fragmentation phenomena of the cultural knowledge in the construction process, and cannot form a systematic cultural reasoning knowledge structure. In the generation process of the existing multi-mode generation model, an effective mechanism is lacking to deeply and dynamically integrate the external cultural background knowledge into the generation decision. Conventional RAG techniques typically use the retrieved knowledge as a one-time, static context hint, simply spliced with user instructions, and then entered into a model. Therefore, in the field of artificial intelligence generation, the currently generated multi-mode information lacks depth understanding of a specific regional culture background and has low degree of agreement with the regional culture. Disclosure of Invention The application provides a server and a method for generating multi-modal information, which are used for solving the problems that the currently generated multi-modal information lacks deep understanding of a specific regional culture background and has low compliance with the regional culture in the field of artificial intelligence generation. In a first aspect, some embodiments of the present application provide a server, including. A memory module configured to store program instructions; a control module running the program instructions, the control module configured to: receiving a user request input by a user, and identifying whether the user request contains cultural metadata or not; Under the condition that the cultural metadata is contained in the user request, encoding the user request into a shared semantic representation by adopting a multi-mode encoder to obtain semantic information corresponding to the user request; Based on the culture metadata, retrieving culture knowledge information corresponding to the culture metadata in a culture knowledge graph; generating a culture adaptation vector based on the culture knowledge information, wherein the culture adaptation vector is a vectorized representation obtained by extracting features of the culture knowledge information; and generating multi-modal information corresponding to the user request according to the semantic information and the culture adaptation vector. The technical scheme has the advantages that the method and the device have the advantages that through receiving the user request input by the user, identifying whether the user request contains cultural metadata or not, encoding cultural knowledge information associated with the user request into the cultural adaptation vector, combining semantic information corresponding to the user request and the cultural adaptation vector to comprehensively generate multi-modal information related to the user request, further improving the fitness of the multi-modal information and the cultural metadata, and solving the problems that the multi-modal information lacks deep understanding of a specific regional cultural background and has low fit