KR-102962337-B1 - Apparatus and method for AI-based single-cell transcriptome generation using bulk transcriptome and metadata
Abstract
The present invention relates to an apparatus and method for generating single-cell transcripts from bulk transcripts. Conditional latent expressions reflecting biological context are extracted from an input bulk transcript, and using these as a guide, a denoiser derives an optimal latent vector from random Gaussian noise. Finally, cellular heterogeneity within a tissue is reconstructed by restoring the latent vector into a plurality of single-cell transcripts through a pre-trained decoder.
Inventors
- 이춘경
- 주재일
Assignees
- 바이오리버트 주식회사
Dates
- Publication Date
- 20260508
- Application Date
- 20260123
Claims (15)
- As a single-cell transcript generation device that generates a single-cell transcript from a bulk transcript, One or more processors; and Memory combined with one or more of the above processors; Includes, When the above memory is executed by the above one or more processors, the above one or more processors allow, A step of receiving a bulk transcript (402) containing expression information for a plurality of genes and metadata (401) derived from the bulk transcript; A step of extracting a conditional latent expression (450) containing biological context information from the bulk transcriptome using a pre-trained single-cell bulk interpreter (200); A step of generating random Gaussian noise (411) and inputting it into a pre-trained denoiser (300); A step of generating a latent vector (430) from which noise has been removed from the random Gaussian noise by utilizing the conditional latent representation as condition information using the denoiser; and A step of generating a plurality of single-cell transcripts (455) in which the cellular heterogeneity of the bulk transcript is reconstructed from the generated potential vector using a pre-trained decoder (140); Includes commands that cause to execute, Single-cell transcriptome generation device.
- In paragraph 1, The above single-cell bulk analyzer is trained by receiving a pseudo-bulk transcript (202) generated by mean pooling a training single-cell transcript (110) as input. (i) a self-supervised learning process that minimizes the restoration error between the pseudo-bulk transcript (271) generated through the data restoration unit (270) based on feature information extracted from the pseudo-bulk transcript; and (ii) a classification learning process that determines the tissue type of the data through the tissue classifier (290) from the extracted feature information, thereby performing the parameters of the single-cell bulk analyzer to generate the conditional latent expression reflecting the biological context information from the bulk transcript data input to the single-cell bulk analyzer. Single-cell transcriptome generation device.
- In paragraph 1, The above denoiser is trained by utilizing a noise-mixed latent vector (310) in which noise (301) is added to a latent vector (130) encoded through a pre-trained encoder (120) of a training single-cell transcript (110) as input data. (i) a process of receiving the noise-mixed potential vector and the conditional potential expression derived from the single-cell bulk analyzer; and (ii) a learning process in which the denoiser utilizes the conditional potential expression as guide information to predict the noise component included in the noise-mixed potential vector and minimizes the error between the predicted noise (351) and the added noise (301); thereby, the parameters of the denoiser are determined to derive the potential vector (430) corresponding to the biological characteristics of the bulk transcript from the input random Gaussian noise. Single-cell transcriptome generation device.
- In paragraph 3, The above denoiser includes a multi-head cross-attention unit and is characterized by performing a process of updating the potential vector by using the noise-mixed potential vector as a query and receiving a conditional potential expression extracted from the single-cell bulk analyzer as a key and value. Single-cell transcriptome generation device.
- In paragraph 1, The above decoder (140) is learned as a configuration of a variable autoencoder (VAE)-based single-cell compressor (100) that takes a training single-cell transcript (110) as input, and (i) a process of minimizing the restoration error between the training single-cell transcript (110) and the single-cell transcript (155) restored through the decoder (140) from the latent vector (130) compressed by the encoder (120); (ii) a process of performing principal component analysis (PCA) on the training single-cell transcript (110) and the restored single-cell transcript (155) and minimizing the perceptual loss between the extracted feature values; and (iii) a process of improving the data fidelity of the restored single-cell transcript (155) through adversarial learning using a discriminator (160); thereby, the parameters of the decoder are determined to generate the plurality of single-cell transcripts (455) from the latent vector generated by the denoiser. Single-cell transcriptome generation device.
- In paragraph 5, The above perceptual loss is characterized by being calculated based on the Euclidean distance or cosine similarity between projected coordinates when the above training single-cell transcript and the above restored single-cell transcript are projected into the space of the top k principal components (where k is an integer greater than or equal to 2). Single-cell transcriptome generation device.
- In paragraph 1, The above conditional latent expression (450) is characterized by being a feature vector generated by the interaction of a gene expression token embedding the gene expression pattern of the bulk transcript and a meta token embedding tissue information included in the metadata through a multihead attention mechanism. Single-cell transcriptome generation device.
- In Paragraph 7, The above single-cell bulk analyzer includes a trainable latent array having a preset size and parameters that are updated during the learning process, and Characterized by the above-mentioned learnable latent array becoming a query to extract biological features from the above-mentioned gene expression token and the above-mentioned meta token, Single-cell transcriptome generation device.
- In paragraph 1, The plurality of single-cell transcripts (455) generated in the step of generating the plurality of single-cell transcripts are individual cell unit expression matrix restored from n (n > 1) different latent vectors pre-set for one input bulk transcript data, and are characterized by statistically reproducing the expression distribution of a plurality of different cell clusters inherent in the bulk transcript. Single-cell transcriptome generation device.
- A single-cell transcript generation method performed by a single-cell transcript generation device, The above single-cell transcriptome generation device receives a bulk transcriptome (402) containing expression information for a plurality of genes and metadata (401) derived from the bulk transcriptome; The above single-cell transcriptome generation device extracts a conditional latent expression (450) containing biological context information from the bulk transcriptome using a pre-trained single-cell bulk interpreter (200); The above single-cell transcriptome generation device generates random Gaussian noise (411) and inputs it into a pre-trained denoiser (300); The above single-cell transcriptome generation device uses the denoiser to utilize the conditional latent expression as condition information to generate a latent vector (430) from which noise has been removed from the random Gaussian noise; and The above single-cell transcript generation device generates a plurality of single-cell transcripts (455) in which the cellular heterogeneity of the bulk transcript is reconstructed from the generated potential vector using a pre-trained decoder (140); including, Single-cell transcript generation method.
- In Paragraph 10, The above single-cell bulk analyzer is trained by receiving a pseudo-bulk transcript (202) generated by averaging pooling a training single-cell transcript (110). (i) a self-supervised learning process that minimizes the restoration error between the pseudo-bulk transcript (271) generated through the data restoration unit (270) based on feature information extracted from the pseudo-bulk transcript; and (ii) a classification learning process that determines the tissue type of the data through the tissue classifier (290) from the extracted feature information, thereby performing the parameters of the single-cell bulk analyzer to generate the conditional latent expression reflecting the biological context information from the bulk transcript data input to the single-cell bulk analyzer. Single-cell transcript generation method.
- In Paragraph 10, The above denoiser is trained by utilizing a noise-mixed latent vector (310) in which noise (301) is added to a latent vector (130) encoded through a pre-trained encoder (120) of a training single-cell transcript (110) as input data. (i) a process of receiving the noise-mixed potential vector and the conditional potential expression derived from the single-cell bulk analyzer; and (ii) a learning process in which the denoiser utilizes the conditional potential expression as guide information to predict the noise component included in the noise-mixed potential vector and minimizes the error between the predicted noise (351) and the added noise (301); thereby, the parameters of the denoiser are determined to derive the potential vector (430) corresponding to the biological characteristics of the bulk transcript from the input random Gaussian noise. Single-cell transcript generation method.
- In Paragraph 10, The above decoder (140) is learned as a configuration of a variable autoencoder (VAE)-based single-cell compressor (100) that takes a training single-cell transcript (110) as input, and (i) a process of minimizing the restoration error between the training single-cell transcript (110) and the single-cell transcript (155) restored through the decoder (140) from the latent vector (130) compressed by the encoder (120); (ii) a process of performing principal component analysis (PCA) on the training single-cell transcript (110) and the restored single-cell transcript (155) and minimizing the perceptual loss between the extracted feature values; and (iii) a process of improving the data fidelity of the restored single-cell transcript (155) through adversarial learning using a discriminator (160); thereby, the parameters of the decoder are determined to generate the plurality of single-cell transcripts (455) from the latent vector generated by the denoiser. Single-cell transcript generation method.
- In Paragraph 10, In the step of extracting the above conditional latent expression (450), the single-cell transcriptome generation device generates a gene expression token embedding the gene expression pattern of the bulk transcriptome and a meta token embedding tissue information included in the metadata, and generates the above conditional latent expression (450) by calculating the interaction between the gene expression token and the meta token through a multi-head attention mechanism. Single-cell transcript generation method.
- In Paragraph 14, The single-cell transcriptome generation device described above is characterized by utilizing a trainable latent array having a preset size and parameters updated during the learning process as a query to perform a process of extracting biological features from the gene expression token and the meta token. Single-cell transcript generation method.
Description
Apparatus and method for AI-based single-cell transcriptome generation using bulk transcriptome and metadata The present invention relates to a gene expression analysis technology based on bioinformatics and artificial intelligence, and more specifically, to an apparatus and method for generating single-cell RNA-seq data, which is high-resolution information at the individual cell level, from relatively easy-to-obtain bulk RNA-seq data using a generative AI model, and for restoring cellular heterogeneity within a tissue. Bulk transcriptome analysis calculates the average of gene expression across all cells within a tissue, which limits its ability to identify heterogeneity among individual cell populations, such as cancer cells or immune cells. However, analyzing such cellular heterogeneity is essential for understanding disease mechanisms and discovering novel biomarkers. While single-cell transcriptome experiments offer high resolution, they entail significant costs and time compared to bulk analysis, thereby restricting their clinical application. Consequently, there is a need for a technology that can cost-effectively acquire information at the single-cell level to address these limitations. As a technology related to the present invention, there is U.S. Patent Publication No. US 2025-0125014 (April 17, 2025) “Method and system for deconvolution of bulk rna-sequencing data”. FIG. 1 is a diagram illustrating the learning process of a single-cell compressor based on the VAE-WGAN-GP architecture provided according to one embodiment of the present invention. FIG. 2 is a diagram showing the structure of a single-cell bulk interpreter that learns biological context from pseudo-bulk data provided according to one embodiment of the present invention. FIG. 3 is a diagram showing the learning step of a denoiser that utilizes a conditional latent representation provided according to an embodiment of the present invention as a guide. FIG. 4 is a diagram illustrating an inference process for deriving a single-cell transcript from an actual bulk transcript provided according to an embodiment of the present invention. FIG. 5 is a UMAP visualization graph illustrating the process of expanding a single point of bulk data into a cluster at the single-cell level according to an embodiment of the present invention. FIG. 6 is a flowchart showing each step of a method for generating a single-cell transcript from a bulk transcript according to an embodiment of the present invention. FIG. 7 is a block diagram showing the internal hardware configuration of a single-cell transcriptome generation device according to one embodiment of the present invention. Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. However, the present invention is not limited to the embodiments described herein and may be implemented in various other forms. The terms used in this specification are intended to aid in understanding the embodiments and are not intended to limit the scope of the present invention. Furthermore, singular forms used below include plural forms unless the phrases clearly indicate otherwise. Figure 1 shows a data processing process of a single-cell compressor provided according to one embodiment of the present invention. A single-cell compressor (100) according to one embodiment of the present invention is a functional module configured to compress high-dimensional single-cell transcriptome data into a low-dimensional latent space to extract significant biological features and to precisely restore the original transcriptome data from the extracted latent vector. The single-cell compressor (100) above may be implemented based on a VAE-WGAN-GP architecture in which a gradient penalty (GP) is combined with a variable autoencoder (VAE) and a Wasserstein Generative Adversarial Network (WGAN), but the scope of the present invention is not limited thereto, and various modified compression algorithms capable of efficiently compressing and restoring data may be applied. A high-dimensional single-cell transcript (110) (scRNA-seq) input from the outside passes through an encoder (120) (Encoder) to extract statistical features, and in this process, the mean (131) and standard deviation (132), which are parameters of the data distribution, can be calculated. Based on this, a low-dimensional latent vector (130) summarizing the key gene expression characteristics of individual cells can be generated. The decoder (140) can receive the latent vector (130) as input, undergo a restoration process, and restore and output a “generated single-cell transcript” (155) (scRNA-seq(output)) having a distribution statistically similar to actual data. The generated single-cell transcript (155) is the result of the decoder (140) receiving the latent vector (130) (z) as input and restoring it to be similar to actual data. The PCA analysis unit (190) can extract the macroscopic structure of high-dimens