CN-121997935-A - Multi-mode information visual content generation method based on semantic understanding

CN121997935ACN 121997935 ACN121997935 ACN 121997935ACN-121997935-A

Abstract

The invention relates to the technical field of semantic processing, and discloses a method and a system for generating multi-mode information visual content based on semantic understanding. The method comprises the steps of obtaining natural language instructions, structured data and context information, respectively carrying out semantic analysis, data labeling and context coding, generating unified characterization through a multi-mode semantic fusion coder, constructing a visual semantic decision tree, providing an interactive adjustment interface for a user to dynamically adjust semantic node parameters, updating the semantic characterization in response to adjustment operation, and driving a rendering engine to generate final visual content. The system comprises a multi-mode input acquisition module, a semantic analysis module, a fusion coding module, a decision tree construction module, an interaction regulation module, a content generation module and the like. The system realizes the interpretability and the real-time controllability of the semantic level, and remarkably improves the consistency of the visual result and the user intention.

Inventors

CHEN XINGYAN
BAI LEI

Assignees

河南大学

Dates

Publication Date: 20260508
Application Date: 20260127

Claims (10)

1. A multi-modal information visual content generation method based on semantic understanding is characterized by comprising the following steps: acquiring a natural language input instruction of a user, a structured data set to be visualized and contextual environment information; Carrying out semantic analysis on the natural language input instruction, extracting a theme keyword, an intention category, a focus of attention and a modification limiting word from the natural language input instruction, and forming an initial semantic element set; Performing field identification and semantic annotation on the structured dataset, establishing a mapping relation among field names, data types, numerical distribution characteristics and semantic roles, and generating a data semantic description vector; Performing context sensing coding on the context environment information, wherein the context environment information comprises a user history interaction record, a current task scene identifier, equipment display capability parameters and a time or geographic position context, and generating a context Wen Yuyi embedded vector; Inputting the initial semantic element set, the data semantic description vector and the upper and lower Wen Yuyi embedded vectors into a multi-modal semantic fusion encoder, and performing feature alignment and semantic coupling through a cross-modal attention mechanism to generate a unified multi-modal semantic fusion representation; Constructing a visual semantic decision tree based on the multi-mode semantic fusion characterization; Providing an interactive adjustment interface of a visual semantic decision tree for a user, wherein the interactive adjustment interface graphically presents each semantic node and the current value thereof, and allows the user to directly modify or weight the parameter value of any semantic node; responding to the adjustment operation of the user, updating the corresponding semantic components in the multi-mode semantic fusion characterization, and re-executing the subsequent visual generation flow; based on the updated multi-mode semantic fusion characterization, a visual rendering engine is called, and final visual content is generated and output to user terminal equipment for display.
2. The method for generating multi-modal information visual content based on semantic understanding according to claim 1, wherein the visual semantic decision tree comprises a plurality of levels of semantic nodes, each corresponding to a visual design decision, including chart type selection, coordinate axis mapping, color scheme configuration, label density control and interactive animation strategy.
3. The semantic understanding-based multi-modal information visual content generation method according to claim 2, wherein the semantic parsing of the natural language input instruction includes: Performing word segmentation processing on an input instruction by adopting a pre-trained language model, and identifying a named part-of-speech phrase as a subject keyword, a verb part-of-speech phrase as an intention category and an adjective or adverb part-of-speech modifier as a focus-of-attention intensity indicator; logical relationships between the semantic elements are determined through dependency syntax analysis, and a structured semantic element triplet set is formed.
4. The method for generating multi-modal information visual content based on semantic understanding according to claim 3, wherein performing field recognition and semantic annotation on the structured dataset comprises: Traversing all fields of the data set, and identifying geographic information fields, time fields, classification fields or numerical value fields according to character pattern matching predefined semantic dictionary of field names; Calculating the mean value, variance, skewness and kurtosis of the logarithmic value field, and counting the unique value quantity and distribution entropy of the logarithmic value field; and encoding the statistical features and the field types together into a data semantic description vector with fixed dimensions.
5. The semantic understanding based multi-modal information visual content generation method according to claim 4, wherein context-aware encoding the context environment information comprises: Aggregating the user history interaction records into behavior statistical vectors, wherein the behavior statistical vectors comprise frequency duty ratios of the types of the visual contents generated in the past thirty days, preference indexes of common chart configuration parameters and average adjustment amplitude of specific semantic dimensions; mapping the current task scene identification into a single-hot coding vector; Quantizing the device display capability parameters into numerical vectors, including screen resolution, color depth and touch control support zone bits; After the time or geographic position context is standardized, converting the time or geographic position context into a fixed-length character string through a geographic hash algorithm or time period coding; All the context sub-items are mapped to the unified dimension of the context Wen Yuyi embedded vector through the full connection layer after normalization processing.
6. The semantic understanding-based multi-modal information visual content generation method according to claim 5, wherein the multi-modal semantic fusion encoder adopts a cross-modal fransformer structure with three layers stacked, and each layer comprises a self-attention sub-layer and a cross-modal cross-attention sub-layer; In the self-attention sub-layer, carrying out associated modeling on semantic units in each mode; in the cross-modal cross-attention sub-layer, using a natural language semantic unit as a query vector to respectively perform key value retrieval on a data semantic description vector and an upper and lower Wen Yuyi embedded vector; the fusion of the final output is characterized as a weighted aggregation result of all the modal semantic units.
7. The semantic understanding-based multi-modal information visual content generation method according to claim 6, wherein constructing a visual semantic decision tree based on the multi-modal semantic fusion characterization comprises: determining a graph type candidate set of the root node according to the intention category in the multi-mode semantic fusion characterization; generating sub-nodes for each candidate graph type according to field types and distribution characteristics in the data semantic description vector, wherein the sub-nodes correspond to coordinate axis mapping schemes; Generating leaf nodes of a color scheme, a label density and an animation strategy according to the focus strength indicator of interest and the equipment capability parameters of the upper part and the lower part Wen Yuyi embedded in the vector; Each node is associated with an adjustable semantic weight parameter, and the initial value is determined by the corresponding component in the fusion token.
8. The semantic understanding based multimodal information visual content generating method of claim 7 wherein providing the user with an interactive tuning interface for a visual semantic decision tree comprises: Displaying a visual semantic decision tree in a tree-shaped graph mode, wherein each node is represented by a rectangular frame, and the current parameter value and the adjusting slide block are displayed in the frame; The user activates the adjusting panel by clicking the node, and a discrete option list or a continuous numerical slider is provided in the panel; After user modification, the system immediately calculates new semantic components and triggers local re-rendering, updating only the affected visual elements.
9. The semantic understanding based multi-modal information visual content generation method according to claim 8, wherein updating the corresponding semantic components in the multi-modal semantic fusion token in response to a user's adjustment operation comprises: mapping the parameter value regulated by the user back to the original semantic space to generate a corrected semantic vector; Carrying out weighted fusion on the correction vector and a vector at a corresponding position in the original fusion representation, wherein the weight is determined by the adjustment amplitude and the historical adjustment confidence; The fused representation is used as a new input to a parameter configuration module of the visual rendering engine.
10. The semantic understanding based multimodal information visual content generating method according to claim 9, wherein invoking the visual rendering engine to generate final visual content based on the updated multimodal semantic fusion token comprises: Sequentially executing graph type instantiation, data binding, visual channel mapping, layout optimization and interaction event registration; generating visual contents in a scalable vector graphic or bitmap format conforming to the Web standard; and adjusting the output size and the interaction mode through the device adaptation layer to match the characteristics of the terminal device.

Description

Multi-mode information visual content generation method based on semantic understanding Technical Field The invention belongs to the technical field of semantic processing, and particularly relates to a multi-mode information visual content generation method based on semantic understanding. Background Along with the wide application of multi-mode data in intelligent interaction, digital content creation, man-machine collaborative analysis and other scenes, an information visualization technology based on semantic understanding becomes a key bridge for connecting complex data with user cognition. The technology aims to visually present the internal semantic structure of the text, the image, the audio and other modal information through a graphical means after fusing, so that the understanding efficiency and the decision making capability of a user on high-dimensional data are improved. In the process, the semantic understanding module is responsible for analyzing the deep meaning of the original input, and the visualization engine generates a visual expression conforming to the cognitive logic according to the analysis result, and the visual expression together form a multi-modal information visualization infrastructure. The multi-modal information visual content generation method based on semantic understanding focuses on semantic mapping of natural language instructions or contexts into executable visual element configurations, and is characterized in that a precise mapping relation from abstract semantics to specific visual attributes (such as layout, color, shape and dynamic effect) is established. Ideally, the method should be able to dynamically adjust semantic weights according to user intent, so that the generated visual content is faithful to the data essence and also meets the personalized cognitive preference or task objective of the user. The prior art generally adopts a pre-trained end-to-end model to realize semantic to visual conversion, and semantic analysis and visual generation processes are highly coupled and parameters are solidified. Although the method can obtain good effects on a standard test set, a real-time intervention mechanism for semantic parameters is lacking in an actual interaction scene, and a user cannot dynamically adjust importance of keywords, association strength between concepts or emphasis proportion of modal fusion through visual operation (such as gesture sliding, dragging or voice instructions) in the visual generation process. This drawback results in the generation of results that, once deviated from the user's expectations, require re-entry of complete instructions or adjustment of the underlying code, greatly limiting the interactive flexibility and efficiency of use of the system. Especially in applications such as exploratory data analysis, creative design assistance or education demonstration and the like which emphasize instant feedback and iterative optimization, a static semantic mapping mechanism is difficult to support semantic optimization requirements with fine granularity, and key breakpoints in an understanding-expressing-feedback closed loop are formed. Disclosure of Invention The invention provides a multi-mode information visual content generation method based on semantic understanding, and aims to solve the technical problem that a user cannot conduct real-time fine adjustment on semantic parameters in the visual generation process, so that deviation exists between generated content and actual requirements. According to the method, a semantic adjustable multi-mode fusion representation space is constructed, natural language instructions, structured data and context are uniformly encoded, and an interactive adjustment mechanism with controllable semantic granularity is introduced, so that a user can accurately intervene in key semantic dimensions in different stages of visual generation, and visual content which is highly fit with the intention of the user is generated. The invention provides a multi-mode information visual content generation method based on semantic understanding, which comprises the following steps: acquiring a natural language input instruction of a user, a structured data set to be visualized and contextual environment information; Carrying out semantic analysis on the natural language input instruction, extracting a theme keyword, an intention category, a focus of attention and a modification limiting word from the natural language input instruction, and forming an initial semantic element set; Performing field identification and semantic annotation on the structured dataset, establishing a mapping relation among field names, data types, numerical distribution characteristics and semantic roles, and generating a data semantic description vector; Performing context sensing coding on the context environment information, wherein the context environment information comprises a user history interaction record, a current task scen