CN-121981127-A - Method for generating interpretable text semantic driving time sequence based on Diffusion Transformer model

CN121981127ACN 121981127 ACN121981127 ACN 121981127ACN-121981127-A

Abstract

The invention discloses an interpretable text semantic driving time sequence generation method based on Diffusion Transformer models, and relates to the technical field of artificial intelligence and time sequence generation methods. The method comprises the steps of converting natural language conditions into control signals generated by time sequences, dividing the time sequence characteristics into three parts of trend, season and residual, capturing long-term time dependence and period modes by combining a time sequence Transformer, obtaining a time sequence conforming to text semantics through an iterative sampling process of DDPM, outputting an interpretable trend, season and residual component decomposition result, fusing text condition codes and time embedding into joint condition vectors, and dynamically guiding the generation process by the text semantics through intermediate characteristics of a self-adaptive normalized modulation denoising network to generate an interpretable text time sequence conforming to the instructions. The method can improve the stability and diversity of the generation process and improve the interpretability and the analyzability of the generation result.

Inventors

PENG YAXIN
DUAN YUTING
PENG YAN
WEI HONGYU
KONG HAO
ZHOU YANG
ZHENG JIANYONG

Assignees

上海大学

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (5)

1. The method for generating the interpretable text semantic driving time sequence based on Diffusion Transformer model is characterized by comprising the following steps: S1, text time sequence alignment, namely converting an input natural language text condition into a control signal for a time sequence generation process, realizing global modulation of text semantics on a characteristic channel through channel gating, and realizing local dynamic alignment of text description and time sequence specific time through a cross-modal attention mechanism; s2, interpretable time sequence decomposition, namely, parallelly arranging a trend modeling module and a seasonal modeling module in a denoising network, and extracting trend components and seasonal components of a generated sequence in each denoising process to realize interpretable decomposition; S3, text condition diffusion training and interpretable output, namely directly predicting a clean sample by using a model by using DDPM as a frame, carrying out noise calculation, obtaining a time sequence conforming to text semantics through an iterative sampling process of DDPM, and simultaneously outputting interpretable trend, season and residual component decomposition results; S4, modulating and diffusing, namely fusing text condition codes and time embedding into joint condition vectors, and generating an interpretable text time sequence conforming to the instruction by leading the generation process to be dynamically guided by text semantics through the intermediate characteristics of the self-adaptive normalized modulation denoising network.
2. The method for generating an interpretable text semantic driven timing based on a Diffusion Transformer model as recited in claim 1, wherein the text timing alignment includes the steps of: Text vector First through linear projection Obtaining a representation consistent with the feature channel dimensions and then with a learnable channel gating vector Multiplying to generate final gating signal ; The time sequence features are used as Query, the text is embedded as Key/Value, and the attention weight is used for determining which descriptions in the text are related to which moments of the time sequence.
3. The method for generating an interpretable text semantic driven timing based on Diffusion Transformer models as set forth in claim 1, wherein the method for interpretable timing resolution includes the steps of: firstly, replacing U-Net backbone of diffusion model with time sequence transducer to make denoising network formed by stacking pure transducer modules, designing three-branch parallel decomposition structure, and time sequence Approximately decomposed into: ; Wherein the method comprises the steps of Trend to, Is seasonal, Actively separating and reconstructing the components from noise by a forced model for residual errors; Trend of Modeling by using polynomial regressor by modeling the first Layer(s) Channel mean value output by the t-th time step Polynomial basis for applying a linear mapping and normalizing a time vector Tensor element-by-element product is performed, wherein, Is the time axis coordinate of the time axis, Is a prescribed polynomial order to construct a smoothed low frequency trend component ; Seasonal nature From the first Layer(s) Output of the t-th time step Performing discrete Fourier transform, and dynamically selecting the maximum amplitude Frequency component and conjugate pair thereof, reconstructing periodic waveform Explicit modeling of complex seasonal patterns is achieved, and residuals are obtained The retention is given to self-attention treatment, and the trend is given Seasonal nature And residual error Combining the three components to obtain an original signal The formula is as follows: ; Wherein, the Representing the original sequence Through the process of The sequence obtained after the step of forward noise addition, A set of learnable parameters representing a denoising network; active reconstruction of structured dynamics from noise by multi-layer block parallel extraction of component representations by joint expectations Calculating time-frequency domain loss The loss formula is as follows: ; Wherein the method comprises the steps of Is a weight for balancing the time domain reconstruction error and the frequency domain reconstruction error, Is the time step weight coefficient of the time step, Representing a fast fourier transform.
4. A method for generating an interpretable text semantic driven timing based on a Diffusion Transformer model as recited in claim 3, wherein the text conditional diffusion training and interpretable output method includes the steps of: In the diffusion process of the first Step, for the original time sequence Forward denoising to obtain a noisy sequence Wherein , As a result of the standard gaussian noise, 、 Respectively is passed through Step-accumulated signal-to-noise ratio parameter Time step Text condition vector Inputting a time sequence transducer backbone model, and predicting an original signal by the model Obtained by the sum of three decomposition components: ; Wherein, the Modeling processes of the trend, season and residual items respectively according to the prediction Calculating corresponding noise Calculating noise errors : ; Iterative denoising generation by DDPM during the sampling phase : ; Classifier-Free guide is introduced to guide noise Is calculated as follows: ; Wherein the method comprises the steps of In order to guide the intensity coefficient, And Text conditions are used and not used, respectively Noise predicted by the time model; finally, a high-quality time sequence conforming to text semantics is obtained through an iterative sampling process of DDPM, and meanwhile, interpretable trend, season and residual component decomposition results are output.
5. The method for generating the interpretable text semantic driving time sequence based on Diffusion Transformer models according to claim 1, wherein the method for modulating diffusion comprises the following steps: the joint condition vector is processed through an adaptive AdaLN mechanism Scaling parameters projected as feature channels And scaling parameters I.e. ; And conditioning the normalized results of the layers of the transducer, wherein the formula is as follows: ; Wherein the method comprises the steps of And Respectively is The mean and standard deviation in the feature dimension are such that the feature profile of each denoising step is co-modulated by text and time.

Description

Method for generating interpretable text semantic driving time sequence based on Diffusion Transformer model Technical Field The invention relates to the technical field of artificial intelligence and time sequence generation methods, in particular to an interpretable text semantic driving time sequence generation method based on Diffusion Transformer models. Background Existing methods of time series data generation can be classified into three types, a method based on a variational self-encoder (VAE), a method based on generating a countermeasure network (GAN), and a method based on a Diffusion model (Diffusion Models). The VAE-based method generally uses learned approximate reasoning to effectively generate a synthetic sample, the reasoning problem uses values of certain variables or probability distributions to predict other values or probability distributions, the GAN-based method realizes implicit modeling of complex distribution through countermeasure training of a generator and a discriminator, the time sequence GAN can be roughly divided into two types of discrete (discrete time point data applicable to event sequences, transaction records and the like) and continuous (continuous time sequences applicable to sensor signals, meteorological data and the like), and the method based on the diffration Models reconstructs high-quality samples from random noise through a gradual noise adding and inverse denoising process. The existing diffusion model-based text timing methods can also be divided into three types, namely tag-based conditions and text-based conditions, and researchers try to introduce high-level semantic information into a time sequence generation process and cross-domain text guidance timing generation. The method based on the variational self-encoder is easy to generate excessive smoothness when generating a time sequence with complex long-range dependence or strong noise disturbance, is unstable in a training process and easy to generate pattern collapse to cause the lack of diversity of a generated sequence based on the defect of the method of generating an countermeasure network, and has the problems of low generating efficiency, high reasoning cost, difficulty in directly carrying out semantic level control and the like, has limited modeling interpretability on real dynamics in a complex scene, and is insufficient to meet the strict requirements on real-time performance and controllability in practical application. The text time sequence method based on the diffusion model can be insufficient, the method based on the label condition lacks deep semantic understanding capability, the method based on the text condition can be only applicable to specific fields, and the existing cross-field text guiding time sequence generation method has the defects of lacking an interpretable generation process, insufficient generation precision and the like. Disclosure of Invention The invention aims to solve the technical problem of providing an interpretable text time sequence generation method which can enhance long-term dependence, improve the stability and diversity of the generation process, realize accurate regulation and control of text semantic time sequence generation in cross fields and improve the interpretability and the analyzability of the generation result. In order to solve the technical problems, the technical scheme adopted by the invention is that the method for generating the interpretable text semantic driving time sequence based on Diffusion Transformer model comprises the following steps: S1, text time sequence alignment, namely converting an input natural language text condition into a control signal for a time sequence generation process, realizing global modulation of text semantics on a characteristic channel through channel gating, and realizing local dynamic alignment of text description and time sequence specific time through a cross-modal attention mechanism; s2, interpretable time sequence decomposition, namely, parallelly arranging a trend modeling module and a seasonal modeling module in a denoising network, and extracting trend components and seasonal components of a generated sequence in each denoising process to realize interpretable decomposition; S3, text condition diffusion training and interpretable output, namely directly predicting a clean sample by using a model by using DDPM as a frame, carrying out noise calculation, obtaining a time sequence conforming to text semantics through an iterative sampling process of DDPM, and simultaneously outputting interpretable trend, season and residual component decomposition results; S4, modulating and diffusing, namely fusing text condition codes and time embedding into joint condition vectors, and generating an interpretable text time sequence conforming to the instruction by leading the generation process to be dynamically guided by text semantics through the intermediate characteristics of the self-adaptive normalize