Search

CN-122029605-A - Gene expression matrix optimization method, electronic device and storage medium

CN122029605ACN 122029605 ACN122029605 ACN 122029605ACN-122029605-A

Abstract

The method comprises the steps of obtaining cell data of a plurality of cells, determining a preset number of nearest neighbor cells of target cells in the plurality of cells according to the cell data, determining a gene expression distance between the target cells and each nearest neighbor cell, determining a distance parameter according to a target non-zero value in the gene expression distance, determining a weight parameter corresponding to each nearest neighbor cell based on the distance parameter, and obtaining a target gene expression matrix after smoothing treatment of the target cells based on the weight parameter corresponding to each nearest neighbor cell and the initial gene expression matrix of each nearest neighbor cell. The method can reduce noise in the gene expression matrix to improve the accuracy of downstream analysis according to the gene expression matrix.

Inventors

  • LV TONGXUAN
  • ZHANG YING
  • KANG QIANG
  • LI MEI
  • ZHANG YONG
  • XU XUN

Assignees

  • 深圳华大生命科学研究院
  • 深圳华大基因科技有限公司

Dates

Publication Date
20260512
Application Date
20231013

Claims (10)

  1. A method of optimizing a gene expression matrix, the method comprising: obtaining cell data for a plurality of cells, the cell data comprising an initial gene expression matrix for each cell in the plurality of cells; Determining a preset number of nearest neighbor cells of a target cell of the plurality of cells based on the cell data; Determining a gene expression distance between the target cell and each nearest neighbor cell, determining a distance parameter according to a target non-zero value in the gene expression distance, and determining a weight parameter corresponding to each nearest neighbor cell based on the distance parameter; And obtaining a target gene expression matrix after smoothing the initial gene expression matrix of the target cells based on the weight parameters corresponding to each nearest neighbor cell and the initial gene expression matrix of each nearest neighbor cell.
  2. The gene expression matrix optimization method of claim 1, further comprising preprocessing the cell data before determining a preset number of nearest neighbors of a target cell of the plurality of cells from the cell data, comprising: And carrying out normalization processing on the cell data, and carrying out dimension reduction on the cell data after normalization processing.
  3. The gene expression matrix optimization method of claim 1, the cell data further comprising spatial location coordinates of each of the plurality of cells, the determining a preset number of nearest neighbor cells of the target cell from the cell data comprising: Determining, based on the spatial location coordinates of each cell, other cells of the plurality of cells than the target cell, determining a spatial location distance of each of the other cells from the target cell; and sequencing the space position distances according to the sequence from small to large, and selecting cells corresponding to the space position distances with the preset number in front of the sequence obtained by sequencing as nearest neighbor cells of the target cells.
  4. The method of optimizing a gene expression matrix according to claim 1, wherein the determining the gene expression distance between the target cell and each nearest neighbor cell comprises: And calculating the Euclidean distance between the initial gene expression matrix of the target cell and the initial gene expression matrix of each nearest neighbor cell, and taking the Euclidean distance as the gene expression distance.
  5. The method for optimizing a gene expression matrix according to claim 1, wherein the method for determining a target non-zero value comprises: Constructing a non-negative distance matrix according to non-zero values in the gene expression distance, wherein the non-negative distance matrix is a one-dimensional matrix, and the numerical value of a preceding element in the one-dimensional matrix is smaller than or equal to the numerical value of a following element; And extracting the percentage digit of the percentage position in the non-negative distance matrix based on a preset percentage, and taking the percentage digit as the target non-zero value.
  6. The gene expression matrix optimization method according to claim 1, wherein the formula used to determine the distance parameter from the target non-zero value in the gene expression distance comprises: Wherein c represents the distance parameter, DDT represents the target non-zero value, gs represents a preset average value of a weight formula of the smoothing process, and a and b represent preset super-parameters.
  7. The gene expression matrix optimization method according to claim 1, wherein the formula used for determining the weight parameter corresponding to each nearest neighbor cell based on the distance parameter comprises: Wherein w i represents a weight parameter corresponding to nearest neighbor cell i, R (x, i) represents a spatial position distance between target cell x and nearest neighbor cell i, i ε N x (k),N x (k) represents a set of k nearest neighbor cells of target cell x, c represents the distance parameter, and a and b represent preset super parameters.
  8. The method for optimizing gene expression matrix according to claim 1, wherein the obtaining the target gene expression matrix after smoothing the initial gene expression matrix of the target cell based on the weight parameter corresponding to each nearest neighbor cell and the initial gene expression matrix of each nearest neighbor cell, the formula used comprises: Wherein exp' x represents the target gene expression matrix of target cell x, w i represents the weight parameter corresponding to nearest neighbor cell i, n x (k) represents the set of k nearest neighbor cells of target cell x, exp i represents the initial gene expression matrix of nearest neighbor cell i.
  9. An electronic device comprising a processor and a memory, wherein the processor is configured to implement the gene expression matrix optimization method according to any one of claims 1 to 8 when executing a computer program stored in the memory.
  10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the gene expression matrix optimization method according to any one of claims 1 to 8.

Description

Gene expression matrix optimization method, electronic device and storage medium Technical Field The application relates to the technical field of biological medicine, in particular to a gene expression matrix optimization method, electronic equipment and a storage medium. Background In studying the response of genes to external factors in different cell states, or the law of interactions between genes, it is necessary to obtain a cell gene expression matrix. Due to technical limitations, for example, the problem of gene diffusion between cells is not considered, and only the gene expression information in the range of the cell nucleus is considered, which results in that the obtained gene expression matrix is highly sparse and contains much noise, thereby affecting the accuracy of downstream feature extraction and biological information mining of cells according to the gene expression matrix. Disclosure of Invention In view of the above, it is necessary to propose a gene expression matrix optimization method, an electronic device, and a storage medium capable of solving the problem that the obtained gene expression matrix is highly sparse and contains much noise. The embodiment of the application provides a gene expression matrix optimization method, which comprises the steps of obtaining cell data of a plurality of cells, determining a preset number of nearest neighbor cells of target cells in the plurality of cells according to the cell data, determining a gene expression distance between the target cells and each nearest neighbor cell, determining a distance parameter according to a target non-zero value in the gene expression distance, determining a weight parameter corresponding to each nearest neighbor cell based on the distance parameter, and obtaining a target gene expression matrix after smoothing the initial gene expression matrix of the target cells based on the weight parameter corresponding to each nearest neighbor cell and the initial gene expression matrix of each nearest neighbor cell. In one embodiment, the method further comprises preprocessing the cell data before determining a preset number of nearest neighbors of the target cells in the plurality of cells from the cell data, including normalizing the cell data and dimension reducing the normalized cell data. In one embodiment, the cell data further comprises spatial position coordinates of each cell in the plurality of cells, and the determining the preset number of nearest neighbor cells of the target cell according to the cell data comprises determining other cells except the target cell in the plurality of cells based on the spatial position coordinates of each cell, determining spatial position distances between each cell in the other cells and the target cell, sorting the spatial position distances in order from small to large, and selecting cells corresponding to the preset number of spatial position distances in the sequence obtained by sorting next to the nearest neighbor cells of the target cell. In one embodiment, the determining the gene expression distance between the target cell and each nearest neighbor cell includes calculating the Euclidean distance between the initial gene expression matrix of the target cell and the initial gene expression matrix of each nearest neighbor cell, and taking the Euclidean distance as the gene expression distance. In one embodiment, the method for determining the target non-zero value comprises the steps of constructing a non-negative distance matrix according to the non-zero value in the gene expression distance, wherein the non-negative distance matrix is a one-dimensional matrix, the numerical value of a preceding element in the one-dimensional matrix is smaller than or equal to that of a following element, and extracting the percentage digit of the percentage position in the non-negative distance matrix based on a preset percentage, and taking the percentage digit as the target non-zero value. In one embodiment, the formula used to determine the distance parameter from the target non-zero value in the gene expression distance comprises: Wherein c represents the distance parameter, DDT represents the target non-zero value, gs represents a preset average value of a weight formula of the smoothing process, and a and b represent preset super-parameters. In one embodiment, the formula used to determine the weight parameter for each nearest neighbor cell based on the distance parameter includes: Wherein w i represents a weight parameter corresponding to nearest neighbor cell i, R (x, i) represents a spatial position distance between target cell x and nearest neighbor cell i, i ε N x(k),Nx (k) represents a set of k nearest neighbor cells of target cell x, c represents the distance parameter, and a and b represent preset super parameters. In one embodiment, the obtaining the target gene expression matrix after smoothing the initial gene expression matrix of the target cell based on the weight