CN-120032757-B - Cross-modal searching method and system for compound structure

CN120032757BCN 120032757 BCN120032757 BCN 120032757BCN-120032757-B

Abstract

The invention provides a compound structure cross-modal searching method and system, which are used for vectorizing compound structure data in a text form or a picture form based on a CLIP model so as to map the compound structure data to the same semantic space, and can realize cross-modal searching on a compound structure by searching a preset compound vector database. The compound vector database realizes the efficient cross-modal search function by constructing a joint index and cross-modal similarity search algorithm. The joint index can simultaneously support the storage and retrieval of images and text vectors, and the searching speed and efficiency are improved. The cross-modal similarity searching algorithm can comprehensively consider the similarity of the image and the text information, and improves the searching accuracy. By constructing the sliced storage of the vector, the function of storing the vector data in a distributed mode is realized, the problems of storage and retrieval caused by the increase of the data quantity in the later period can be flexibly solved, and the expandability of the system is enhanced.

Inventors

ZHANG JINGLE
WANG LEI
ZHAO XIAOYONG
ZHANG LI
FANG ZHIJUN

Assignees

北京信息科技大学

Dates

Publication Date: 20260512
Application Date: 20250107

Claims (7)

1. The compound structure cross-modal searching method is characterized by comprising the following steps of: Vectorizing target compound structure data in a text form or an image form based on a pre-trained CLIP model to obtain a query vector, wherein the CLIP model is obtained by training the same compound positive sample pair and different compound negative sample pairs in the text form and the image form, and the text form comprises a natural language description form and a preset character string form; Searching the query vector based on similarity measurement by adopting a preset compound vector database, and outputting a set number of search results with highest similarity, wherein the compound vector database carries out vectorization on the existing compound structure data of a plurality of preset data sources through the CLIP model and stores the data in a form of fragment storage; The pre-training step of the CLIP model comprises the steps of obtaining a training sample set, wherein the training sample set comprises a plurality of positive sample pairs based on the same compound and a plurality of negative sample pairs based on different compounds, the positive sample pairs comprise text form and image form molecular structure data for the same compound, and the negative sample pairs comprise text form and image form molecular structure data for different compounds; training an initial CLIP model by adopting the training sample set, wherein the initial CLIP model comprises an image encoder and a text encoder, the image encoder is used for vectorizing molecular structure data in an image form in the positive sample pair and the negative sample pair, and the text encoder is used for vectorizing molecular structure data in a text form in the positive sample pair and the negative sample pair; The loss function adopts InfoNCE loss functions, and comprises the following steps: And calculating the similarity of the vectorized molecular structure data of the image form and the text form in all the positive sample pairs and the negative sample pairs, wherein the calculation formula is as follows: ; Wherein, the An image vector i representing molecular structure data in image form, A text vector j representing text-form molecular structure data; The temperature parameter is introduced for adjustment, and the calculation formula is as follows: ; Wherein τ is a temperature parameter; the loss of the image vector is: ; the loss of the text vector is as follows: ; The total loss is: ; the method further comprises the steps of adopting a hash-based slicing scheme, taking vectorized existing compound structure data as input, calculating a hash value based on an MD5 hash algorithm, and storing slices according to the hash value.
2. The method of claim 1, wherein the image encoder uses Vision Transformer model and the text encoder uses BERT model.
3. The method of compound structure cross-modal searching of claim 1, wherein the step of deploying the compound vector database comprises: acquiring the structure data of the existing compound based on a plurality of preset data sources, wherein the structure data of the existing compound comprises a text form and an image form, the text form comprises a natural language description form and a preset character string form, and the image form comprises a compound molecular structure diagram obtained directly or obtained through conversion of a preset chemical informatics tool; Performing data cleaning on the existing compound structure data to delete missing value records and fill missing values, performing outlier processing to delete outliers, scaling or replacing outliers, performing data integration to combine and store the existing compound structure data acquired from each preset data source, performing format unification, and performing data protocol to remove redundant data and reduce data volume; carrying out chemical diagram formatting on the structure data of the existing compound in a text form by adopting a preset chemical informatics tool RDKit; And carrying out vector representation on the existing compound structure data comprising text forms and image forms by adopting the CLIP model, and storing and constructing a compound vector database, wherein the compound vector database adopts Faiss, milvus or HNSWlib databases.
4. The method of claim 1, wherein searching the query vector based on the similarity measure using a predetermined compound vector database comprises searching in combination using cosine similarity, euclidean distance, and/or Jaccard similarity.
5. A compound structure cross-modal search system, comprising: the multi-modal neural network pre-training module is used for pre-training a CLIP model for vectorizing compound structure data in a text form and an image form, wherein the CLIP model is obtained by training the same compound positive sample pair and different compound negative sample pairs in the text form and the image form; the search library construction module is used for vectorizing the existing compound structure data of a plurality of preset data sources through the CLIP model and storing the data in a form of fragment storage; the searching module is used for receiving target compound structure data to be queried submitted by a user, executing the compound structure cross-modal searching method according to any one of claims 1 to 4 and outputting a searching result.
6. A computer readable storage medium having stored thereon a computer program/instruction which when executed by a processor performs the steps of the method according to any of claims 1 to 4.
7. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 4.

Description

Cross-modal searching method and system for compound structure Technical Field The invention relates to the technical field of chemical information processing, in particular to a compound structure cross-modal searching method and system. Background The search of the current compound structure data mainly depends on information of a single modality, such as similarity matching (e.g. SMILES or molecular diagram) or text keyword search based on molecular structure. Although the method meets the basic requirement to a certain extent, the method has the remarkable problem that the method only depends on chemical structures or text descriptions, ignores complementarity of images and other modal information, and is difficult to capture high-level semantic information such as compound functions, applications and the like. In addition, the comprehensiveness and accuracy of searching are further limited by the ambiguous text expressions, inconsistent term usage and complex relevance between structure and function, and the requirements of individuation and diversification of users in drug development are difficult to meet. Disclosure of Invention In view of this, the embodiment of the invention provides a compound structure cross-modal searching method and system, so as to eliminate or improve one or more defects existing in the prior art, and solve the problem that the prior art cannot realize cross-modal searching of compound structure data. One aspect of the present invention provides a compound structure cross-modal search method comprising the steps of: Vectorizing target compound structure data in a text form or an image form based on a pre-trained CLIP model to obtain a query vector, wherein the CLIP model is obtained by training the same compound positive sample pair and different compound negative sample pairs in the text form and the image form, and the text form comprises a natural language description form and a preset character string form; And searching the query vector based on the similarity measurement by adopting a preset compound vector database, and outputting a set number of search results with highest similarity, wherein the compound vector database carries out vectorization on the existing compound structure data of a plurality of preset data sources through the CLIP model and stores the data in a form of fragment storage, and the compound vector database carries out search by establishing HNSW indexes. In some embodiments, the pre-training step of the CLIP model comprises: Obtaining a training sample set comprising a plurality of positive sample pairs based on the same compound comprising text form and image form molecular structure data for the same compound and a plurality of negative sample pairs based on different compounds comprising text form and image form molecular structure data for different compounds; The method comprises the steps of training an initial CLIP model by adopting a training sample set, wherein the initial CLIP model comprises an image encoder and a text encoder, the image encoder is used for vectorizing molecular structure data in an image form in a positive sample pair and a negative sample pair, the text encoder is used for vectorizing molecular structure data in a text form in the positive sample pair and the negative sample pair, calculating the similarity of the molecular structure data in the image form and the text form in the positive sample pair and the negative sample pair after vectorizing, and constructing a loss function by maximizing the similarity of the positive sample pair and minimizing the similarity of the negative sample pair, and carrying out parameter updating on the initial CLIP model to obtain the CLIP model. In some embodiments, the loss function employs InfoNCE loss functions, including the steps of: And calculating the similarity of the vectorized molecular structure data of the image form and the text form in all the positive sample pairs and the negative sample pairs, wherein the calculation formula is as follows: ; Wherein, the An image vector i representing molecular structure data in image form,A text vector j representing text-form molecular structure data; The temperature parameter is introduced for adjustment, and the calculation formula is as follows: ; Wherein τ is a temperature parameter; the loss of the image vector is: ; the loss of the text vector is as follows: ; The total loss is: 。 In some embodiments, the image encoder employs a Vision Transformer model and the text encoder employs a BERT model. In some embodiments, the deploying step of the compound vector database comprises: acquiring the structure data of the existing compound based on a plurality of preset data sources, wherein the structure data of the existing compound comprises a text form and an image form, the text form comprises a natural language description form and a preset character string form, and the image form comprises a compound molecular structure diag