CN-122021550-A - Image-text semantic proofreading and generating typesetting method and system based on improved diffusion model

CN122021550ACN 122021550 ACN122021550 ACN 122021550ACN-122021550-A

Abstract

The application relates to the technical field of artificial intelligence and computer graphics, in particular to an image-text semantic proofreading and generating typesetting method and system based on an improved diffusion model. And then, carrying out joint generation by utilizing a double-flow semantic-layout coupling diffusion network, and dynamically aligning the semantic and layout through the cross attention and energy functions in the process of reverse denoising. Innovative reconstruction checking mechanism is introduced in the denoising middle stage, semantic conflict and text error are detected by calculating KL divergence, a large language model is called for correction, and finally, layout is adjusted in a self-adaptive mode and an optimized document is output. The application realizes 'generating, namely checking', and solves the problems of content and form cutting, poor semantic consistency and lack of automatic checking capability in the prior art.

Inventors

TIAN HONGJUAN
GUO ZINING
LI WENZE
PENG JIAXI
CHEN GANG

Assignees

河南师范大学

Dates

Publication Date: 20260512
Application Date: 20260225

Claims (10)

1. The image-text semantic proofreading and generating typesetting method based on the improved diffusion model is characterized by comprising the following steps of: Acquiring multi-mode image-text data to be processed, wherein the image-text data comprises a text sequence, image materials and preset layout constraint conditions; Inputting the text sequence and the image material to a multi-granularity semantic alignment encoder to obtain a text semantic vector space and a visual perception vector space, and constructing an initial layout structure based on the layout constraint condition; Inputting the text semantic vector space, the visual perception vector space and the initial layout structure into a double-flow semantic-layout coupling diffusion network to perform forward diffusion and noise adding to obtain a noise-containing hidden space state; executing a reverse denoising generation process, and calculating a layout prediction field and a semantic consistency field through a cross attention fusion mechanism in each denoising time step, and jointly updating coordinates of layout elements and graphic characteristics; in the middle state of reverse denoising, starting a reverse reconstruction checking mechanism, calculating the difference divergence of text generation probability distribution and an original text sequence, and identifying a semantic conflict area and text errors; self-adaptive correction is carried out on the text sequence based on the difference divergence, and layout is dynamically adjusted according to the corrected text length and semantic weight; And when the denoising process is finished, restoring by a decoder to obtain the target document subjected to semantic collation and generative typesetting.
2. The method of claim 1, wherein the dual stream semantic-layout coupling diffusion network comprises a content generation stream and a layout generation stream in parallel; the content generation stream is used for generating or enhancing image details to match text semantics, and the layout generation stream is used for generating coordinate parameters of a text box and an image box; joint loss function of the dual-flow semantic-layout coupling diffusion network Expressed as: ; Wherein, the The method is used for optimizing the accuracy of noise prediction for standard diffusion denoising loss; regression loss for layout bounding boxes for optimizing accuracy in generating layout coordinates A loss for text semantic reconstruction for ensuring semantic fidelity of the generated content; The method is used for enhancing semantic relevance of the graphics on the space layout for graphics layout alignment loss; To the point of Is an adaptive weight coefficient used to balance the contribution of different loss terms to model training.
3. The method of claim 2, wherein the teletext alignment loss Based on the construction of an energy function based on energy, the formula is expressed as: ; Wherein k is the index of the layout element; Boundary box coordinate vector representing the kth layout element, in particular ; And A visual feature vector and a text feature vector representing a kth element, respectively; Representing an attention mapping function which takes visual and text features as inputs and calculates the 'semantic barycentric coordinates' of the text semantic which should be most focused in the image space; the Euclidean distance function is used for calculating the space distance between the center of the generated boundary frame and the semantic barycenter coordinates; The temperature coefficient is used for adjusting the smoothness of the energy function and controlling the gradient amplitude of the loss. This penalty term aims to penalize layout schemes that place text boxes in image areas that are not related to text semantics.
4. The method of claim 1, wherein the inverse reconstruction collation mechanism comprises, in particular, at a denoising time step Belonging to a preset interval In, utilizing currently generated content stream features And layout flow features As a condition, calculate each of the original text sequences Conditional generation probability of (2) Calculating the empirical distribution of the original text Probability distribution with model prediction Kullback-Leibler divergence between: ; If the KL divergence value of a Token Exceeding an anomaly threshold dynamically calculated from a global divergence distribution And generating one or more corrected text sequence candidates by using a pre-trained Large Language Model (LLM) as a proofreader and using the latent semantic error markers, the context text thereof and the related image visual context information as inputs.
5. The method of claim 1, wherein the multi-granularity semantic pair encoder comprises a dual-tower structure, wherein a text encoding tower adopts a BERT model based on a Transformer architecture, hierarchical semantic features corresponding to word level, sentence level and paragraph level are extracted from a bottom layer, a middle layer and a top layer respectively through multi-layer self-attention mechanisms of the text encoding tower, a visual encoding tower adopts a ViT model to divide an input image into a plurality of Patches, and extracts the feature of each Patch and a [ CLS ] Token feature for representing global image information so as to obtain multi-scale visual features, and the text encoding tower and the visual encoding tower are subjected to end-to-end pre-training through a contrast learning loss function such as InfoNCE, wherein the aim is that the distance of the semantic related graphic feature pair is pulled and the distance of the semantic independent graphic feature pair is pushed away in a potential embedding space, so that semantic alignment across the semantic modes is realized.
6. The method of claim 1, wherein the sampling step in the inverse denoising generation process uses gradient-guided langevin dynamics-based sampling, and the single-step update formula is: ; Wherein, the For the moment of time A noise-containing hidden space state; predicting noise items for a dual-flow coupled diffusion network; And Parameters of a preset noise scheduling scheme; controlling the intensity of semantic guidance for guiding the scale coefficient; log likelihood for semantic matching with respect to current state The gradient points to a direction which can lead the image-text semantics to be more matched, thereby actively guiding the layout and the content to evolve towards the direction with higher semantic consistency in the generating process; The noise standard deviation for that time step; is a standard gaussian noise vector used to maintain the randomness of the samples.
7. An image-text semantic collating and generating typesetting system based on an improved diffusion model is characterized by comprising the following steps: the data acquisition module is used for receiving the multi-mode image-text data uploaded by a user or retrieved from the database and related layout constraint conditions; The semantic coding module is characterized in that the core of the semantic coding module is a multi-granularity semantic alignment encoder and is responsible for converting original image-text data into high-dimensional and semantic alignment feature vectors; The diffusion generation module is characterized by comprising a double-flow semantic-layout coupling diffusion network and is responsible for executing forward denoising and reverse denoising processes to generate preliminary typesetting layout and optimized image characteristics; The intelligent correction module is characterized in that the core of the intelligent correction module is a reverse reconstruction correction mechanism and is responsible for carrying out semantic consistency detection on text contents in the middle stage of reverse denoising and calling a large language model for correction; And the rendering output module is used for generating and exporting a target document through the graphic rendering engine according to the final layout coordinates output by the diffusion generation module and the text content corrected by the intelligent correction module.
8. The system of claim 7, wherein the diffusion generating module has embedded therein a spatial transformation network that receives text length change information that the intelligent proofing module may feed back in each denoising step, and calculates an affine transformation matrix based thereon for adaptively adjusting the width and height of the corresponding text box while fine-tuning the layout of surrounding elements to maintain the stability and visual harmony of the overall topology of the layout.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method according to any one of claims 1 to 6 when the computer program is executed.
10. A computer readable storage medium having instructions stored thereon which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 6.

Description

Image-text semantic proofreading and generating typesetting method and system based on improved diffusion model Technical Field The invention relates to the technical field of artificial intelligence and computer graphics, in particular to an image-text semantic proofreading and generating typesetting method and system based on an improved diffusion model. Background With the rapid development of the digital information age, the document of the atlas has become a mainstream carrier in the fields of information transmission, content marketing, news publishing, education science popularization and the like. The typesetting design of the complex documents is completed efficiently and with high quality, so that visual beauty and harmony are required, and the layout is required to accurately reflect and strengthen the inherent semantic logic of the image-text content. However, the current mainstream typesetting production and automation technology, in the face of increasing personalized, highly time-efficient content demands, exposes a number of bottlenecks and challenges. Traditional layout designs are highly dependent on professional designers using desktop publishing (DTP) software for manual operation. This process is not only time consuming and laborious, costly, but also severely limited by the personal experience and aesthetic level of the designer, making it difficult to achieve large-scale, mass-scale and style-consistent document production. To improve efficiency, template-based automation schemes have evolved. Although the method can quickly apply the preset format, the produced results are uniform, the deep understanding and flexible adaptation of the semantics of specific contents are lacking, and the requirements of creative expression and personalized customization cannot be met. Along with the progress of artificial intelligence technology, especially the development of deep learning generation models, a new idea is brought for automatic typesetting. Researchers have attempted to learn the layout rules of numerous excellent design samples using models such as generating a countermeasure network (GAN), variational self-encoders (VAE), etc., to automatically generate new layout schemes. However, these existing deep learning-based generative typesetting techniques still have several core problems that have not yet been fundamentally solved: The prior method mostly simplifies 'layout generation' into a pure geometric space optimization problem, and the model only pays attention to visual properties such as positions, sizes, alignment relations and the like of elements such as text frames, image frames and the like, and completely ignores the deep semantics of text and image contents carried by the model. This results in the generated layouts often being "shape-mind separated", for example, presenting serious academic report content in a relaxed and lively cartoon style, or placing key image bodies in secondary visual areas, severely compromising the effectiveness and expertise of information transfer. Lacking the ability to perceive and correct content errors, existing automated typesetting systems often default to the exact and consistent input of graphic material. However, in practical applications, the original material often contains misspellings, grammatical errors, ambiguities, and even direct contradiction of text and graphics semantics (such as "burning sun" in the description, but the matching is ice and snow scenes). The prior art lacks an inherent mechanism for actively detecting, identifying and correcting such content level errors in the typesetting process, so that invalid documents with attractive layout and spurious content, namely 'garbage in and garbage out', can be generated. The limitation and controllability of the generated model are challenged by the fact that the widely used GAN model has the problems of unstable training, collapse of modes and the like, so that the generated diversity is limited and the result is difficult to control. Although the diffusion model shows remarkable advantages in image generation quality and training stability, when the diffusion model is directly applied to a multi-mode heterogeneous data (continuous image features and discrete layout coordinates) joint generation task, a series of technical problems of how to effectively fuse semantic and geometric information, how to accurately constrain element boundaries, how to realize controllable generation under complex conditions and the like are still faced. In view of the foregoing, a new generation of intelligent typesetting solution capable of integrating deep semantic understanding, intelligent content proofreading and high quality creative generation into a whole is needed in the current technical field. The scheme needs to fundamentally break barriers of contents and forms, so that an automatic typesetting system can not only design a layout, but also understand and correct the contents, thereby