CN-122023913-A - Double-mask multi-mode tongue analysis method and system for fusing pulse condition priori

CN122023913ACN 122023913 ACN122023913 ACN 122023913ACN-122023913-A

Abstract

The invention discloses a double-mask multi-mode tongue analysis method and a double-mask multi-mode tongue analysis system for fusing pulse condition prior. The method comprises the steps of firstly constructing an interactive double-flow trunk based on a convolutional neural network and a transducer to extract multi-scale visual features, secondly designing a double-mask feature decoupling module, filtering a background by using an inner mask to purify internal textures, constructing edge enhancement branches by using an outer mask expanded in a large scale and a Scharr operator to accurately capture high-frequency features such as tooth marks and cracks, then constructing a pulse condition guiding module, introducing a numerical pulse condition vector through a FiLM mechanism to dynamically modulate the visual features at a channel level, and finally performing model optimization by adopting a two-stage deep supervision course learning and cosine annealing hot restarting strategy. The invention effectively solves the conflict between the edge feature and the internal texture extraction, overcomes the limitation of a single visual mode, and improves the recognition accuracy of tongue fine morphological features.

Inventors

ZHANG WENBIN
YAN YUHE
GAO XINYU
SHA XIAOPENG
LV XIAOYONG

Assignees

东北大学秦皇岛分校

Dates

Publication Date: 20260512
Application Date: 20260202

Claims (9)

1. A double-mask multi-mode tongue image analysis method for fusing pulse condition prior is characterized by comprising the following steps of 1, acquiring tongue images of an object to be detected and corresponding numerical pulse condition feature vectors; step 2, constructing a multi-mode fusion backbone network, namely extracting shallow layer features (S1) and middle layer features (S2) of the tongue image by utilizing convolutional neural network branches, and extracting global semantic features by utilizing a transducer branch based on a feature injection mechanism; Step 4, constructing an edge enhancement and feature fusion module, extracting gradient amplitude values of the shallow features (S1) by using a Scharr operator, generating edge enhancement features by combining the outer mask, and performing channel splicing with the middle layer features (S2) filtered by the inner mask to generate visual fusion features; And 5, constructing a pulse condition guided feature modulation module, generating affine transformation parameters by using the numerical pulse condition feature vector, carrying out channel-level dynamic correction on the visual fusion features through a feature linear modulation (FiLM) mechanism, and outputting a result, namely inputting the multi-mode fusion features into a pre-constructed classification decoding network and outputting tongue body segmentation masks and tongue image feature classification results.
2. The method for analyzing the double-mask multi-mode tongue picture with the pulse condition fusion priori according to the claim 1 is characterized in that the step 1 comprises the following steps of 1-1, obtaining an original tongue picture of an object to be detected in a natural illumination environment by utilizing an image acquisition module of a portable intelligent terminal, 1-2, performing region-of-interest clipping and resolution normalization processing on the original tongue picture to generate a tongue picture to be detected with a fixed size, 1-3, obtaining pulse condition description text of the object to be detected, and extracting keywords describing the morphology and the property of the pulse condition; and step 1-4, constructing a 5-dimensional pulse condition orthogonal feature space containing positions, numbers, shapes, potentials and forces, and step 1-5, mapping the keywords to the feature space to generate a 5-dimensional numerical pulse condition feature vector containing the dimensions of deficiency fineness, phlegm dampness, qi stagnation, excess heat and interior cold.
3. The method for analyzing the double-mask multi-mode tongue picture with the pulse condition fusion priori according to claim 1 is characterized in that step 2 specifically comprises the steps that a hierarchical downsampling structure is adopted by a convolution neural network branch, 4 times downsampled shallow layer characteristics (S1), 16 times downsampled middle layer characteristics (S2) and 32 times downsampled middle layer characteristics (S3) are output, the characteristic injection mechanism specifically comprises the steps that the number of channels of the middle layer characteristics (S2) is adjusted through a convolution layer and flattened into serialized words, the signals are embedded and injected into the transition branch as input, and global context modeling is conducted by using a multi-head self-attention mechanism.
4. The method for analyzing the double-mask multi-mode tongue picture with the pulse condition fusion priori according to claim 1, wherein the step 3 specifically comprises the steps of generating an inner mask, namely performing binarization processing on the segmentation probability map, performing morphological corrosion operation by using structural elements, and generating the inner mask, wherein the inner mask is used for shielding an oral cavity background outside a tongue body contour during feature extraction. And generating an outer mask, namely performing morphological expansion operation on the segmentation probability map to generate an outer mask, wherein the coverage range of the outer mask is extended outwards from the tongue body substantially and is used for retaining the concave characteristics of the edge of the tongue body.
5. The method for analyzing the double-mask multi-mode tongue picture with the pulse condition fusion priori according to claim 1 is characterized by comprising the following steps of carrying out Scharr convolution operation on the shallow layer feature (S1) by utilizing a horizontal convolution kernel and a vertical convolution kernel respectively, calculating to obtain a gradient amplitude map, carrying out element-by-element multiplication on the gradient amplitude map and the outer mask, filtering background noise to obtain an edge enhancement feature, carrying out element-by-element multiplication on the inner mask and the middle layer feature (S2) to obtain a pure internal feature, and splicing the pure internal feature and the edge enhancement feature in a channel dimension to generate a visual fusion feature.
6. The method for analyzing the double-mask multi-mode tongue picture fused with pulse condition priori according to claim 1, wherein the step 5 comprises the following steps of step 5-1, carrying out nonlinear encoding on the numerical pulse condition characteristic vector by utilizing a multi-layer perceptron, and outputting a scaling coefficient And offset coefficient Step 5-2, according to the formula For visual fusion features Channel transformation is carried out to obtain multi-mode fusion characteristics Wherein the transformation is used to enhance or suppress a particular visual characteristic channel response based on the pulse condition a priori.
7. A pulse condition prior-fused double-mask multi-mode tongue condition analysis system is characterized by comprising a data acquisition module, a characteristic extraction module, a double-mask generation module, an edge enhancement module, a multi-mode fusion module, an output module and a control module, wherein the data acquisition module is used for acquiring tongue condition images and numeric pulse condition characteristic vectors; wherein the system is for performing the method of any one of claims 1 to 6.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 6 when the program is executed by the processor.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 6.

Description

Double-mask multi-mode tongue analysis method and system for fusing pulse condition priori Technical Field The invention relates to the technical field of image processing, in particular to a tongue image segmentation and classification method based on a convolutional neural network and a transducer mixed architecture. Background The limitation of traditional Chinese medicine diagnosis and treatment is that the tongue manifestation of traditional Chinese medicine is taken as the core link of inspection and diagnosis, and is an important basis for judging the qi and blood abundance and insufficiency of viscera of a human body. For a long time, tongue manifestations have been mainly dependent on visual observation and clinical experience of physicians. However, this mode has subjectivity and uncertainty, such as: (1) Non-standardized-descriptions of the same tongue by different physicians may have cognitive deviations, such as descriptions of "light red", "dark red". (2) The environment dependence is that the intensity of light and the color temperature change can influence the judgment of a doctor on tongue color and tongue coating color. (3) The inheritance is difficult, the identification of the tiny abnormal morphological characteristics often depends on the experience accumulation of doctors, and the inheritance is difficult to quantify. The state of the art is that the computer-aided tongue technique is an important research direction for objectively diagnosing traditional Chinese medicine. Currently, the mainstream automatic tongue image method is mainly based on a deep Convolutional Neural Network (CNN) or a visual Transformer (ViT) architecture, and performs semantic segmentation and classification on the acquired tongue image. However, the prior art still has significant drawbacks in dealing with complex abnormal morphological features. First, there is a spatial constraint conflict in tongue segmentation and feature extraction. The characteristic of tooth mark in tongue is represented as concave at the edge of tongue, belonging to the edge high-frequency characteristic. The traditional semantic segmentation network tends to generate a smooth mask closely attached to the tongue body entity, so that tooth trace features at the edge concave part are filtered as a background, and if the mask range is simply expanded for retaining the tooth trace, background noise such as lips, teeth, oral shadows and the like is introduced into a feature extraction area, so that the subsequent classification judgment of tongue color and tongue coating color is interfered. Second, the downsampling mechanism of deep neural networks results in micro-texture loss. To obtain global semantic information, existing models typically downsample feature maps multiple times. Although the macro structure of the image is reserved in the process, high-frequency space details such as microcracks, pricks and the like can be filtered out, so that the detection sensitivity of the model to micro pathological targets is reduced. Furthermore, it is difficult for a single visual modality to distinguish between patterns that are visually similar but have different pathological essence. In the diagnosis of traditional Chinese medicine, complex syndromes such as 'true cold and false heat' exist, and misjudgment is easy to generate only by relying on image features. The existing multi-mode fusion method is usually only simple scoring fusion of a decision layer, lacks a mechanism for dynamically weighting and correcting a visual characteristic channel by utilizing non-visual data (such as pulse conditions) in a characteristic extraction stage, and cannot effectively utilize physiological priori knowledge to improve classification accuracy. Disclosure of Invention In order to solve the technical problems, the invention provides a double-mask multi-mode tongue image analysis method for fusing pulse condition prior, which mainly comprises the following steps: first, data acquisition and vectorization. And acquiring a tongue image of the object to be detected and a corresponding numerical pulse characteristic vector. The pulse condition feature vector is obtained by extracting and mapping the pulse condition description text to a preset multidimensional orthogonal feature space. Second, a dual stream feature extraction step. A multi-modal fusion backbone network is constructed that includes convolutional neural network branches and transducer branches. And extracting shallow high-resolution features and middle-layer semantic features of the tongue image by utilizing convolutional neural network branches, and extracting global context features by utilizing a transducer branch based on a feature injection mechanism. Third, a double mask feature decoupling step. Segmentation probability map based on network outputRespectively constructing inner masksAnd outer mask. Specifically, for probability mapAfter binarization treatment, morphological corrosion ope