CN-116467675-B - Viscera attribute coding method and system integrating multi-modal characteristics
Abstract
The invention discloses a method and a system for encoding visceral organ attributes by fusing multimodal features, wherein the method comprises a data acquisition module, a data processing module, a multimodal fusion feature encoding model and a model construction training module, wherein tongue images and patient sounds are acquired and corresponding visceral organ attribute tags are marked, data processing is respectively carried out, individual features of tongue image modes and sound modes are respectively extracted by using a deep neural network model, the individual features of tongue image data and sound data are fused by taking consistency and complementarity of representing space as constraints, supervision learning is carried out on the individual features of tongue image data and sound data by using the visceral organ tags and organ attribute tags, priori guiding knowledge of visceral organ attributes is embedded, a multimodal fusion feature encoding model is obtained, the acquired tongue images and patient sounds are processed, and then the corresponding visceral organ attribute tags are obtained by using the multimodal fusion feature encoding model, so that accuracy and objectivity of visceral organ attribute encoding are improved.
Inventors
- CHEN JIAWEI
- WEN GUIHUA
Assignees
- 华南理工大学
Dates
- Publication Date
- 20260505
- Application Date
- 20230417
Claims (9)
- 1. The method for encoding the viscera attribute by fusing the multi-modal features is characterized by comprising the following steps of: s1, collecting tongue images and patient sounds, and labeling and obtaining visceral organ attribute tags corresponding to the tongue images and the sounds, wherein the tags comprise visceral organ category tags and organ attribute tags corresponding to each visceral organ; S2, respectively carrying out data processing on tongue image data and patient sound data to obtain processed batch tongue image data and sound data spectrograms; s3, taking the processed batch tongue images as input image data, taking the converted batch spectrograms as input sound data, and respectively extracting the individual characteristics of tongue image modes and the individual characteristics of sound modes by using a deep neural network model; S4, taking consistency and complementarity of the representation space as constraints, carrying out multi-mode feature fusion on individual features of tongue image modes and individual features of sound modes, and carrying out supervised learning by using the internal organ labels and organ attribute labels so as to embed priori guiding knowledge of internal organ attributes and obtain a multi-mode fusion feature coding model of the internal organ attributes embedded with priori knowledge; the specific content of the step S4 is as follows: S41, calculating Euclidean distance between individual features of a tongue image mode and individual features of a sound mode in an Euclidean space, calculating hyperbolic distance between individual features of the tongue image mode and individual features of the sound mode in a hyperbolic space, and calculating cosine similarity between the Euclidean distance and the hyperbolic distance, namely representing consistency of the space; s42, fusing the individual characteristics of the tongue image mode and the individual characteristics of the voice mode by utilizing a cross-mode bridging fusion strategy, and outputting the attribute characteristics of the internal organs by adopting a Sigmoid activation function; s43, mapping the output viscera attribute characteristics to Euclidean space and hyperbolic space respectively, and calculating cross entropy loss in each representing space, namely representing space complementarity; s44, obtaining a loss function of the internal organ attribute by combining consistency and complementarity of the representation space, and obtaining the multi-mode fusion feature coding model by updating the loss function through model training parameters after repeated iterative training.
- 2. The method for encoding an organ attribute that incorporates multi-modal features of claim 1, further comprising: s5, collecting tongue images and patient sounds, respectively carrying out data processing on tongue image data and patient sound data to obtain processed batch tongue image data and sound data spectrograms, and inputting the processed batch tongue image data and sound data spectrograms into a multimodal fusion feature coding model of the viscera attributes embedded with priori knowledge to obtain viscera attribute tags corresponding to the tongue images and the sounds.
- 3. The method for encoding the visceral organ attributes with the multi-modal feature fused as set forth in claim 1, wherein the specific contents of step S2 include: carrying out tongue coating target detection and target region cutting on tongue image data by adopting a target detection model, expanding an image in a bilinear interpolation mode, carrying out random cutting according to the original size to obtain an output image copy with the same size as the original size, carrying out horizontal overturning on the output image copy, and respectively carrying out normalization processing on three basic color channels of red, green and blue of the image; And (3) adopting a voice denoising model to perform audio denoising, using an audio and music signal processing tool to reject a mute frame, randomly intercepting a sound fragment, performing pre-emphasis, framing and windowing processing, and converting a time domain sound signal into a spectrogram through time-frequency transformation.
- 4. The method for encoding the visceral organ attributes with the multi-modal feature fused according to claim 1, wherein the deep neural network model in S3 includes a convolutional neural network combination-like MLP model and a cyclic neural network combination-like MLP model; taking the batch tongue images after data processing as input image data, and extracting the individual features of the image data by using a convolutional neural network combination type MLP model; And taking the batch spectrograms as input sound data, and extracting the individual characteristics of the sound data by using a cyclic neural network combined class MLP model.
- 5. The method for encoding the visceral organ attributes with the multi-modal feature fused as claimed in claim 4, wherein the convolutional neural network combined MLP-like model comprises a plurality of convolutional layers, a normalization layer, a downsampling layer and a full connection layer; the cyclic neural network combined MLP-like model comprises a plurality of cyclic units, a normalization layer, a downsampling layer and a full-connection layer.
- 6. The method for encoding the visceral attribute with the multi-modal feature fused according to claim 1, wherein the specific contents of S41 indicating the spatial consistency are: ; Wherein, the Mapping the individual characteristics of the tongue image mode in Euclidean space, Mapping in euclidean space for the personality of the sound modality, Mapping individual characteristics of tongue image modes in hyperbolic space, Mapping the personality characteristics of the sound modes in hyperbolic space; For the euclidean space distance metric: ; the hyperbolic spatial distance metric is: ; Wherein, the Is a constant value of the space curvature, ; Taking cosine distance loss: 。
- 7. The method for encoding the attributes of the internal organs fused with the multi-modal characteristics according to claim 6, wherein the specific contents representing the complementarity of the space are: the euclidean spatial distance similarity measure is: ; the hyperbolic space structure similarity measure is: ; Wherein, the Is a mapping of the organ property predictions in euclidean space, Mapping the property label of the viscera in Euclidean space; is a mapping of visceral organ property predictions in hyperbolic space, Is the mapping of the viscera attribute label in the hyperbolic space; Taking cross entropy loss , Taking hyperbolic space distance : ; ; Wherein, the Is a constant value of the space curvature, 。
- 8. The method for encoding an internal organ attribute with multi-modal feature fusion as claimed in claim 7, wherein the specific contents of S44 are: the consistency constraint of the modal personality characteristics in the Euclidean space and the hyperbolic space is as follows: ; the complementarity constraint of the modal personality characteristics in the Euclidean space and hyperbolic space is: ; The loss function of the multi-modal fusion feature coding model is: ; where W is the weight of each sub-item, in particular, Is that Is used for the weight of the (c), Is that Is used for the weight of the (c), Is that Is a weight of (2).
- 9. An viscera attribute coding system fusing multi-modal features, which is based on the viscera attribute coding method fusing multi-modal features as defined in any one of claims 1-8, and is characterized by comprising a data acquisition module, a data processing module, a multi-modal fusion feature coding model and a model construction training module; the model building training module comprises a labeling unit, a feature extraction unit, a modal feature fusion unit and a supervised learning unit; the data acquisition module is used for acquiring tongue images and patient sounds; the data processing module is used for respectively carrying out data processing on tongue image data and patient sound data to obtain processed batch tongue image data and sound data spectrograms; The multi-mode fusion feature coding model is used for obtaining visceral organ attribute labels corresponding to tongue images and voice according to the processed batch tongue image data and voice data spectrograms; The model construction training module is used for constructing and training to obtain a multi-mode fusion feature coding model; the labeling unit is used for labeling and acquiring visceral organ attribute labels corresponding to tongue images and sounds, wherein the labels comprise visceral organ category labels and organ attribute labels corresponding to each visceral organ; The feature extraction unit is used for taking the processed batch tongue images as input image data, taking the converted batch spectrograms as input sound data, and extracting the individual features of tongue image modes and the individual features of sound modes respectively by using the deep neural network model; the modal feature fusion unit is used for fusing the individual features of the tongue image mode and the individual features of the sound mode by taking consistency and complementarity of the representation space as constraints; and the supervised learning unit performs supervised learning by using the internal organ labels and the organ attribute labels so as to embed priori guiding knowledge of the internal organ attributes and obtain a multimodal fusion feature coding model of the internal organ attributes embedded with the priori knowledge.
Description
Viscera attribute coding method and system integrating multi-modal characteristics Technical Field The invention relates to the technical field of machine learning, in particular to a method and a system for encoding an internal organ attribute by fusing multi-modal characteristics. Background In recent years, the development of artificial intelligence is rapid, and particularly, deep neural network machine learning technology based on big data is greatly developed and widely applied, wherein Western medicine has been widely applied in the field of artificial intelligence and has achieved a series of breakthrough results, such as image diagnosis based on deep learning, medical record automation processing and the like, and the application of the technology greatly improves the efficiency and accuracy of medical diagnosis, so that the application of the deep learning technology in medical diagnosis has become a trend. At present, the application of traditional Chinese medicine in the field of artificial intelligence is increasingly focused, and the diagnosis of traditional Chinese medicine is based on the whole concept and dialectical thinking, and the diagnosis method of traditional Chinese medicine is focused on diagnosis based on dialectical treatment, and comprises the steps of looking, smelling, asking and cutting four diagnoses, and the condition of a patient is determined by comprehensively judging the face, eyes, tongue, sound, pulse and other symptoms of the patient, wherein tongue images and voice are the contents of the inspection and the smelling in the diagnosis method of traditional Chinese medicine and are also used as the basis of internal organs and attributes of the internal organs. However, the research on tongue images and voice multi-mode internal organs and attribute coding thereof based on deep learning still has a plurality of defects that on one hand, the current diagnosis model is mainly based on the machine-learned internal organs and attribute coding thereof, the algorithms cannot consider the inherent attribute of traditional Chinese medicine characteristics, the classification and diagnosis standards of traditional Chinese medicine on internal organs and other organs are different from those of Western medicine, data and algorithms need to be properly adjusted and optimized, for example, factors such as tongue morphology, color, tongue coating and the like, and factors such as sound adjustability, tone, audio time domain, frequency domain and the like are considered, and the characteristics cannot be processed by the traditional machine learning algorithm, on the other hand, according to the principle of traditional Chinese medicine thinking and dialectical treatment, the multi-diagnosis is a diagnosis method special for traditional Chinese medicine, and the current diagnosis of internal organs and attribute of multi-mode combined parameter is not considered, although the current single-mode coding and diagnosis model has good effect, certain subjectivity and inconsistency exist in the treatment process, and comprehensive diagnosis is required to be carried out from different angles. Therefore, from the point of view of multi-modal diagnosis, the combination of tongue image data and voice data for encoding the internal organ attribute is a problem that needs to be solved by those skilled in the art. Disclosure of Invention In view of the above, the present invention provides a method and a system for encoding the properties of internal organs by fusing multi-modal features to solve the problems mentioned in the background art. In order to achieve the above purpose, the present invention adopts the following technical scheme: an internal organ attribute coding method integrating multi-modal characteristics comprises the following steps: S1, collecting tongue images and patient sounds, and labeling and obtaining visceral organ attribute tags corresponding to the tongue images and the sounds, wherein the tags comprise visceral organ category tags and organ attribute tags corresponding to each visceral organ; S2, respectively carrying out data processing on tongue image data and patient sound data to obtain processed batch tongue image data and sound data spectrograms; S3, taking the processed batch tongue images as input image data, taking the converted batch spectrograms as input sound data, and respectively extracting the individual characteristics of tongue image modes and the individual characteristics of sound modes by using a deep neural network model; S4, taking consistency and complementarity of the representation space as constraints, fusing modal features of individual features of tongue image data and individual features of sound data, and performing supervised learning by using the internal organ labels and the organ attribute labels so as to embed priori guiding knowledge of internal organ attributes and obtain a multi-modal fusion feature coding model of the internal