KR-20260066727-A - Data item generation using a diffusion neural network

KR20260066727AKR 20260066727 AKR20260066727 AKR 20260066727AKR-20260066727-A

Abstract

A method, system, and apparatus comprising a computer program encoded in a computer storage medium for generating data items using a diffusion neural network or other generative neural network.

Inventors

내시 찰리 토마스 커티스
딜레만 샌더 에티엔 레아
가닌 이아로슬라브
더칸 코너 마이클
딩 펭닝
반 덴 오드 애론 제라드 안토니우스

Assignees

지디엠 홀딩 엘엘씨

Dates

Publication Date: 20260512
Application Date: 20240821
Priority Date: 20230821

Claims (14)

A method performed by one or more computers, wherein the method is, A step of receiving a first input that identifies a plurality of context data items for generating a first output data item; A step of obtaining an individual embedding of each context data item of the plurality of context data items above; A step of generating a combined embedding by combining the individual embeddings of each context data item of the plurality of context data items; and A method comprising the step of processing an input including the combined embedding using a diffusion neural network to generate the first output data item.
In paragraph 1, A step of receiving a second input that identifies a specific data item for generating a second output data item; A step of obtaining an embedding of the above-mentioned specific context data item; and A method further comprising the step of processing an input including the embedding of the specific context data item using the diffusion neural network to generate the second output data item.
In paragraph 2, the step of obtaining the embedding of the specific context data item is, A method comprising the step of processing a specific context data item using an embedding neural network to generate an embedding of the specific context data item.
In any one of claims 1 to 3, the step of combining the individual embeddings of each context data item of the plurality of context data items to generate a combined embedding is A method comprising the step of averaging the individual embeddings.
In any one of claims 1 to 3, the step of combining the individual embeddings of each context data item of the plurality of context data items to generate a combined embedding is A step of receiving user input that specifies individual weights for each context data item of the above context data items; and A method comprising the step of calculating a weighted sum of the individual embeddings of the context data items according to the individual weights for the corresponding context data items.
In any one of claims 1 to 5, the step of obtaining an individual embedding of each context data item of the plurality of context data items is, A method comprising the step of processing each of a plurality of specific context data items using an embedding neural network to generate the individual embeddings above.
In any one of claims 1 to 6, the diffusion neural network was trained by performing operations, and said operations are, An operation to maintain individual embeddings for each training context data item of multiple training context data items; An operation of maintaining data that clusters the individual embeddings for the plurality of training context data items into a plurality of clusters; An operation to obtain an input specifying a target data item from the above plurality of context data items; An operation to select a context embedding for training the above-mentioned diffusion neural network - the operation to select the context embedding includes (i) selecting the embedding of the above-mentioned target context data item or (ii) selecting the center of the cluster to which the above-mentioned embedding of the above-mentioned target context data item belongs - ; and A method comprising the operation of training the diffusion neural network using the selected context embedding and the target data item.
In claim 7, the operation of selecting (i) the embedding of the target context data item or (ii) the center of the cluster to which the embedding of the target context data item belongs is, (i) an action of selecting the embedding of the target context data item with probability p ; and A method comprising (ii) selecting the center of the cluster to which the embedding of the target context data item belongs with a probability of 1- p .
In any one of claims 1 to 8, the step of processing an input including the combined embedding using a diffusion neural network to generate the first output data item is, A method comprising the step of updating a current data item in each of a plurality of update iterations by processing a first diffusion input for the update iteration, which includes the current data item and the combined embedding, using the diffusion neural network in each update iteration to generate a first denoising output for the update iteration.
A method according to claim 9, wherein the first denoising output defines the prediction of the residual error between the analysis estimate of the noise component when the noise component of the current data item and the first diffusion input are given.
A method according to claim 9 or 10, wherein the step of updating the current data item in each of the plurality of update iterations further comprises: a step of generating an additional denoising output for the update iteration by processing, in each update iteration, an additional diffusion input for the update iteration that includes the current data item and does not include the combined embedding, using the diffusion neural network; a step of generating a denoising output by combining the first diffusion output for the update iteration and the additional denoising output according to guidance weights for the update iteration; and a step of updating the current data item using the denoising output.
A method according to any one of claims 1 to 11, wherein the data item or target data item comprises image, video, or audio data.
As a system, One or more computers; and A system comprising one or more storage devices coupled to communicate with one or more of the above-mentioned computers, wherein the one or more storage devices store instructions that, when executed by the one or more of the above-mentioned computers, cause the one or more of the above-mentioned computers to perform operations of the method of any one of claims 1 to 12.
One or more non-transient computer storage media for storing instructions, wherein the instructions, when executed by one or more computers, cause the one or more computers to perform the operations of the method of any one of claims 1 through 12.

Description

Data item generation using a diffusion neural network Cross-reference regarding related applications This application claims priority to U.S. Provisional Application No. 63/533,906 filed on August 21, 2023, the disclosure of which is incorporated herein by reference in its entirety. This specification relates to generating a conditioned output for a conditioning input using a neural network. A neural network is a machine learning model that utilizes one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to the output layer. The output of each hidden layer is used as an input to one or more other layers within the network, namely one or more other hidden layers, the output layer, or both. Each layer of the network generates an output from the received input based on the current values of an individual set of parameters. This specification describes a system implemented by a computer program on one or more computers at one or more locations that generates an output data item conditioned for a conditioning input. Generally, conditioning inputs characterize one or more desired attributes for a data item, that is, they characterize one or more attributes that the final data item generated by the system must possess. More specifically, the system can generate output data items using a diffusion neural network. Specific embodiments of the subject of the invention described in this specification may be implemented to realize one or more of the following advantages. Compared to conventional techniques that use a diffusion neural network to generate data items, the described system can increase the quality of output data items generated by a diffusion neural network by modifying one or more of the following: (i) training of the diffusion neural network, (ii) input to the diffusion neural network, or (iii) the method by which the diffusion neural network is used to generate output data items after training. As an example, the system can be configured to process a diffusion neural network for an update iteration containing noisy data items to generate a denoising output that, in each update iteration, unlike other techniques, defines an estimate of the residual error between the analytical estimate of the noise component of the noisy data item and the actual noise component of the noisy data item. That is, other techniques generally use a diffusion neural network to generate different types of denoising outputs, for example, a denoising output that is an estimate of the noise component or a denoising output that is an estimate of the ground truth data item. Generating a denoising output that defines an estimate of the residual error between the analytical estimate of the noise component of a noisy data item and the actual noise component of the noisy data item can increase the quality of the output data item, particularly when high guidance weights for classifier-free guidance are used as part of the generation process. For example, using high guidance weights can result in generated data items that align more strongly with the provided conditioning input, which is advantageous; however, using high guidance weights can also reduce the overall quality of the generated data item. For example, when the data item is an image, using high guidance weights can generate an image with very high saturation. By using the denoising output described above, the system can generate data items that align with the conditioning input while maintaining overall quality, and, for example, reduce the saturation of the generated image to a realistic-looking level. As another example, a diffusion neural network can be configured to receive context embeddings that can represent multiple different context data items or a single context data item. Context data items are typically of the same type as output data items and are used to guide the generation process. By allowing the user to condition the generation process for a variable number of context data items, the system can use the diffusion neural network to generate output data items that accurately reflect the context provided by the user, for example, by matching specific attributes of a single data item or reflecting attributes aggregated across multiple data items. As another example, the system can utilize bound conditioning. When utilizing bound conditioning, the system conditions the diffusion neural network with respect to a lower or upper bound of an input scalar value representing the value of a specific attribute of the generated data item. That is, rather than requiring the diffusion neural network to generate an output data item having the exact value of the specific attribute represented by the input scalar value, the system can provide the diffusion neural network with the flexibility to generate any appropriate data item having the value of the specific attribute that is appropriatel