CN-115909328-B - Small molecular chemical structure image recognition method based on transformation neural network
Abstract
The invention discloses a small molecular chemical structure image identification method based on a transformation neural network, which comprises the following steps of S1, obtaining a small molecular chemical structure image, preprocessing the small molecular chemical structure image, S2, taking the preprocessed small molecular chemical structure image as the input of a MobileViT _2 network, extracting the characteristic vector of the small molecular chemical structure image, S3, taking an original SMILES sequence as the input of a decoding part of a Conditional DETR network together with the characteristic vector of the small molecular chemical structure image, obtaining a SELFIES sequence, converting the SELFIES sequence into a new SMILES sequence through a selfies program package, and outputting the novel SMILES sequence as an identification result. The method combines MobileViT _2 network and Conditional DETR network, and solves the problems of low precision of complex sequence prediction results, slow model convergence speed, unstable learned weight and the like.
Inventors
- JIANG WENBO
- LIU XUEMEI
- Xue Zijia
Assignees
- 西华大学
Dates
- Publication Date
- 20260508
- Application Date
- 20221129
Claims (3)
- 1. The small molecular chemical structure image recognition method based on the transformation neural network is characterized by comprising the following steps of: s1, acquiring a small molecular chemical structure image, and preprocessing the small molecular chemical structure image; s2, taking the preprocessed small molecular chemical structure image as the input of MobileViT _2 network, and extracting the feature vector of the small molecular chemical structure image; In the step S2, the MobileViT _2 network includes a first convolution module, a first MobileV2 block module, a second MobileV2 block module, a third MobileV2 block module, a fourth MobileV2 block module, a fifth MobileV2 block module, a first MobileViT _2 block module, a sixth MobileV2 block module, a second MobileViT _2 block module, a seventh MobileV2 block module, a third MobileViT _2 block module, a second convolution module, and an average pooling module, which are sequentially connected; the first MobileV block module, the second MobileV block module, the third MobileV block module, the fourth MobileV block module and the fifth MobileV block module are used for extracting characteristic information of a small molecular chemical structure image, wherein the second MobileV block module and the fifth MobileV block module are also used for downsampling the small molecular chemical structure image, and the first 3932_23932 module, the second 3932_23932 module and the third 3932_23932 module have the same structures and comprise a first convolution layer, a second convolution layer, an unfolding layer, a first group of normalization layers, mobileV layers, a second group of normalization layers, a folding layer, a third convolution layer and a fourth convolution layer which are sequentially connected; S3, acquiring an original SMILES sequence, taking the original SMILES sequence as a tag and taking the tag and a feature vector of a small molecular chemical structure image as input of a decoding part of a Conditional DETR network together to obtain a SELFIES sequence, converting the SELFIES sequence into a new SMILES sequence through a selfies program package, and completing small molecular chemical structure image recognition as a recognition result; the decoding part of the Conditional DETR network comprises a first sublayer connecting module, a second sublayer connecting module and a third sublayer connecting module; the first sub-layer connecting module and the second sub-layer connecting module are used for carrying out normalization processing, feature learning and splicing processing on the original SMILES sequence, and the third sub-layer connecting module is used for decoding the processing result of the second sub-layer connecting module to generate SEFILES format characters to obtain SELFIES sequence and converting the SELFIES sequence into a new SMILES sequence through a selfies program package; The first sublayer connecting module comprises a first layer normalization layer, a first multi-head self-attention sublayer and a first residual error layer which are connected, the second sublayer connecting module comprises a second layer normalization layer, a second multi-head attention sublayer and a second residual error layer which are connected, and the third sublayer connecting module comprises a third layer normalization layer, a feedforward full-connection sublayer and a third residual error layer which are connected.
- 2. The method for identifying the small molecular chemical structure image based on the transformation neural network according to claim 1, wherein in the step S1, the specific method for preprocessing the small molecular chemical structure image is to convert the small molecular chemical structure image into an RGB image and randomly rotate the RGB image.
- 3. The method for identifying small molecular chemical structure image based on a transformed neural network according to claim 1, wherein the first MobileV block module, the second MobileV block module, the third MobileV block module, the fourth MobileV block module, the fifth MobileV block module, the sixth MobileV2 block module and the seventh MobileV block module have the same structure and include a convolution layer, a SiLU activation layer, a normalization layer and a deformable convolution block which are sequentially connected.
Description
Small molecular chemical structure image recognition method based on transformation neural network Technical Field The invention belongs to the technical field of image processing, and particularly relates to a small molecular chemical structure image recognition method based on a transformation neural network. Background The application field of the small molecule medicine is very wide, and the small molecule medicine can be used for tumor, nervous system, infection, metabolism, cardiovascular diseases, non-central pain relieving, antipyretic, anti-inflammatory, immune or allergic diseases, skin diseases, digestive system diseases, bone diseases and the like. It is counted that in the usual drugs, the amount of small molecule drugs may be 98% of the total amount. In the field of data management for life sciences, it has been difficult and time consuming to extract chemical structures from published sources such as journal papers and patents. In recent years, with rapid development of computer vision and natural language processing technologies based on deep learning algorithms, extraction of valuable information from images using the deep learning technology is increasingly widely used. The deep neural network can automatically extract the characteristics, and has better robustness and generalization capability on the chemical structure image. Because the structural details of the small molecular compounds in a plurality of small molecular related documents are presented in JPEG, PNG, GIF and BMP image formats, the original chemical significance is lost. The chemical structure images are automatically analyzed and converted into a computer-recognizable format, such as a SMILES (systematic indicator system) representation, and the method has practical application value for analyzing and discovering small-molecule drugs. Researchers at home and abroad have carried out a great deal of work and have made great progress in the aspect of small molecule chemical structure image recognition research, but the following problems still exist: The first and complex sequence prediction accuracy is not high, and the small molecular chemical structure image recognition effect is represented by subjective evaluation indexes and objective evaluation indexes. At present, an image feature extraction network is combined with a transform neural network, the algorithm has a good effect of predicting simple sequences with few character types and fewer characters, but a small molecular database also contains a plurality of more complex sequences with more character types and more characters, and as the image feature information carried by a chemical structure image is fewer and sparser, the feature expression capability is weak, the feature information which can be extracted in the feature extraction process is fewer or incomplete, and the character and chemical bond recognition errors in the subsequent prediction effect are caused. Although the presently used algorithm has a good prediction effect on simple sequences, the prediction effect on complex sequences is very general in terms of subjective evaluation index or objective evaluation index. Secondly, the model convergence speed is low, and in the research of a chemical structure image recognition algorithm, a plurality of scholars improve the accuracy of a prediction sequence through a classical coding-decoding structure transformation neural network model, so that good effects are obtained. However, because the classical transformation neural network model has higher complexity and large parameter quantity, a large amount of data is required in the training process, and the convergence rate of the model is lower and the practical value is lower although the overall effect of the final sequence prediction is improved to a certain extent. Thirdly, model learning fluctuates, and the learned weight is unstable because the deep learning model used at present is deeper and larger, and the data distribution variance in the mini-Batch is particularly large in the training process, so that the model learning fluctuates severely and the learned weight is unstable. Disclosure of Invention The invention provides a small molecular chemical structure image recognition method based on a transformation neural network in order to solve the problems. The technical scheme of the invention is that the small molecular chemical structure image recognition method based on the transformation neural network comprises the following steps: s1, acquiring a small molecular chemical structure image, and preprocessing the small molecular chemical structure image; s2, taking the preprocessed small molecular chemical structure image as the input of MobileViT _2 network, and extracting the feature vector of the small molecular chemical structure image; and S3, acquiring an original SMILES sequence, taking the original SMILES sequence as a tag and taking the tag and a feature vector of a small molecular che