CN-121982396-A - Temporomandibular joint image recognition method based on triple decoupling report learning
Abstract
The invention discloses a temporomandibular joint dual-position image recognition method based on three-level decoupling report learning, which introduces an attention mechanism-based SAF module to derive global visual representation capable of capturing fine two-position dynamics through position change between closed and open features, carries out information prompt based on priori knowledge related to tasks in advance based on a decomposition module LID of LLM, decomposes an original report into three-level components through structural query guidance LLM, and implements a globally guided three-level alignment pre-training strategy by utilizing three-level representations from SAF and LID so as to ensure hierarchical visual-language alignment between corresponding levels. When operated in a supervised manner, the GTA strategy not only enables precise alignment between features and ground truth tag class semantics, but also enables combining local alignment with global context by cross-level interaction weights. The invention supports report-aided training in the learning stage and unreported inference in the deployment stage.
Inventors
- WANG YAN
- XU FEI
- YANG WENBIN
- JIANG NAN
- WANG PENG
Assignees
- 四川大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260122
Claims (10)
- 1. The temporomandibular joint double-position image recognition method for three-level decoupling report learning is characterized by comprising the following steps: For image branches, acquiring temporomandibular joint image data comprising an opening image and a closing image, sending the opening image and the closing image into an image encoder to respectively generate specific position features F c and F o , modeling inter-position space dynamics through an SAF fusion module to generate global visual features F g , and carrying out global average pooling treatment on F c 、F o and F g to generate concentrated embedded vectors F c 、f o and F g ; Meanwhile, for text branches, the original diagnosis report T raw is processed through an information decomposition module LID based on a large language model, after pre-loading the prior knowledge related to TDD, the LID uses a query template to guide the LLM large language model to decompose the T raw into three components including a closed description T c , an open description T o and a global description T g , and then the three components are matched with a learnable CLS mark and a projection layer based on attention in a text encoder to generate corresponding text embedded vectors T c 、t o and T g ; Finally, based on three-level embedding construction from two branches, pretraining is carried out by adopting a globally guided three-level alignment strategy, a supervised visual language pair production process is executed by a globally guided three-level alignment GTA module, local alignment is combined with global context by cross-level interactive weights, and three-level visual embedding is input into a joint classifier in a fine tuning and reasoning stage to generate a predicted value P and is supervised by a label Y.
- 2. The method for identifying the temporomandibular joint double-position image for triple-level decoupling report learning according to claim 1, wherein a sample of each patient is obtained, wherein the samples are denoted as (I o ,I c ,T raw , Y), wherein I o and I c represent the opening and closing MRI images of the target mandibular joint respectively, T raw is a single-sided raw diagnostic report containing descriptions of differences and common features between I o and I c , Y corresponds to one of three identification categories, namely a normal disc position NDP, a reduced disc displacement DDWR or a non-reposition disc displacement DDWoR.
- 3. The method for recognizing temporomandibular joint dual position images for triple-level decoupling report learning of claim 1, wherein said SAF fusion module comprises a displacement sensing portion and an adaptive fusion portion.
- 4. A temporomandibular joint double-position image recognition method of triple-level decoupling report learning according to claim 3, characterized in that in the displacement sensing section, it comprises: to encode spatial displacement between different locations and capture fine granularity displacement information, two complementary displacement representations are calculated from F o and F c , one based on an element-level product representation for modeling structural overlapping correlations, And the other is a representation based on element level differences, for modeling displacement at a location, ; The two representations are connected in series along the channel dimension Together, thereby forming a global displacement feature, which is then further refined by two 3D convolution layers with 1x1x1 filters, ultimately forming a shared displacement-aware query : ; Next, F c and F o are mapped into the key value matrix by linear projection, resulting in K c and V c values for the closed state and K o and V o values for the open state; Subsequently, interactions between the displacement-aware query Q SA and the respective position K/V matrices are modeled using multi-headed cross-attention MHCA, generating features F c ' and F o ' that fuse position-specific details with shared cross-position displacement contexts; 。
- 5. A method for recognizing a temporomandibular joint bi-position image for triple-level decoupling report learning according to claim 4, wherein in the adaptive fusion section, comprising: The weighted channel attention is adopted to fuse the two-position features F c ' and F o ' , and the channel with larger contribution to global context perception is given higher weight when the attention weight is calculated by element-level multiplication according to the corresponding bit volume feature, so that a feature map with finer and richer global characterization is generated; then, two feature maps are spliced and processed by two layers of convolution layers to reduce the channel dimension, and finally a unified representation F g is formed: ; Where w o and w c are learnable parameters.
- 6. A method of temporomandibular joint dual position image recognition for triple level decoupling report learning according to claim 5, wherein the customized weighted channel attention is employed to fuse the dual position features F c ' and F o ' , comprising the steps of: Implicitly modeling internal channel dependencies using a weighting module W SE ()' in SE-Net; The weights are readjusted by a Softmax function to balance the contributions between locations.
- 7. The method for recognizing the temporomandibular joint double-position image for triple-level decoupling report learning according to claim 1, wherein the information decomposition module LID based on the large language model is operated by means of the general large language model to grasp two kinds of front knowledge related to TDD in advance so as to improve understanding of tasks; by means of configured priori P, the pre-trained large language model is guided to decouple the original report into LLM report based on priori information through query instruction Q ; ; Wherein T o ,T c and T imp represent descriptions of opening, closing and overall impression respectively, T o ,T c and T g are fed into a fixed text encoder with each level of marker sequence spliced with a learnable CLS marker and then fed into a shared projector containing two learnable self-attention layers, and the refined output CLS markers are the final text embeddings for each level, denoted as T o , t c and T g respectively.
- 8. The method for recognizing a temporomandibular joint dual position image for triple-level decoupling report learning of claim 7, wherein the first knowledge covers the pre-information related to the TDD task including input and output of model pre-forms, definition TDLIP, disease pre-forms, field expertise on mandibular joint disease, anatomical pre-forms, anatomical morphology and position information, imaging pre-forms, MRI visual manifestations of various mandibular joint structures; The second class provides output references and constraints specific to LID, including a wording specification tailored for three TDD diagnostic classes as a standardized vocabulary reference for LLM, and output constraints specifying that the format of the structured report requires separate content to be organized into three different descriptions, closed state, open state, and global impression coverage location sharing location independent information.
- 9. The method for recognizing temporomandibular joint dual position image for triple-level decoupling report learning according to claim 1, wherein real data points are integrated to generate positive samples out of diagonal, thereby ensuring that features from the same class are retained in the same batch and features from different classes are pushed away, and after completion of the calculation of visual-text similarity based on inner product Sim vt ()), a position global supervised contrast loss is constructed: ; Wherein the value of the indicator function 1 y i =y j is taken as 1 or 0;B as the total number of samples depending on whether the i-th sample and the j-th sample belong to the same class, In order for the cross-entropy loss to occur, For the calculation of the degree of similarity, For visual embedding of the i-th sample, Text for the i-th sample is embedded.
- 10. The method for recognizing temporomandibular joint dual-position image for triple-level decoupling report learning of claim 9, wherein a global guiding local alignment mechanism is introduced, defining a weight To characterize the global semantic consistency of the i, i dimensions from both visual and textual aspects between two local specific locations: ; Wherein, the Using visual inset And L2 distance Sim v (. Cndot.) of (C) Using text embedding And Cosine similarity Sim c (·) between; the global semantic constraint is generalized to the local alignment process, so that the matching result of the local layer is ensured to accord with global clinical logic and avoid deviating from the whole context, and then, the matching result is assisted by To construct a location specific loss function for the local slice: ; finally, the total GTA penalty of pre-training L GTA integrates the global position term and two position terms, wherein, Is a hyper-parameter for balancing these two terms; ; Wherein, the In order to classify the loss of the device, In order to achieve the goal of loss, Is a super parameter.
Description
Temporomandibular joint image recognition method based on triple decoupling report learning Technical Field The invention belongs to the technical field of image processing, and particularly relates to a temporomandibular joint double-position image recognition method for three-level decoupling report learning. Background Temporomandibular joint (TMJ) disease is a common oral facial condition that affects chewing function and oral health, burdening the patient's daily lives, a task of identifying joint disc displacement, temporomandibular joint disorder diagnosis (TDD), essentially relies on the evaluation of dual-state (closed and open) MRI images on one side of a single mandibular joint. Manual TDD requires the radiologist to compare the spatial relationship of the mandibular malleoli and the articular disc in these scans. This process is not only time consuming, but also prone to subjective bias, highlighting the need for efficient automated diagnostic tools. Early automated TDD generally relied on hand-designed features, but because of the inability of these methods to capture complex anatomy, their applicability to MRI scan images is generally poor, benefiting from advances in Deep Learning (DL) techniques, and significant advances in the development of medical imaging diagnostic models. For TDD tasks, existing DL-based models generally follow two approaches. The first method involves a method of classifying the unit-position images individually. Although this is a straightforward approach, its performance and clinical applicability are limited because it deviates from standard diagnostic procedures. The second approach classifies the two-position images with 3D pixel-level labeling (e.g., mask of mandibular malleolus or articular disc) to emphasize joint structure, which, while quite effective, is not feasible or efficient in clinical TDD workflow because these procedures focus on rapid assessment of native MRL scan results that are not deeply artificially labeled. For practical TDD deployments, it is crucial to develop a framework that can utilize dual-position MRI images, so that both compliance with standard clinical procedures and accurate classification can be achieved without complex additional labeling. However, the existing solutions provided to achieve this goal remain deficient, traditional TDD approaches perform well, but rely on costly pixel-level labeling, and there are two key obstacles that make feasible classification difficult to achieve (1) two-position images featuring inherent position change dynamics rather than pattern differences, which makes existing multi-view fusion paradigms look unfeasible in modeling such fine spatial correlations. (2) TDD workflow is often accompanied by diagnostic reports that contain comprehensive diagnostic insight, but the mainstream visual language approach either fails to achieve accurate alignment with the two-position dynamics or relies heavily on these label-dependent reports for training, thereby affecting the non-reporting aided reasoning capabilities in real-world scenarios. Disclosure of Invention In order to solve the problems, the invention provides a temporomandibular joint double-position image identification method for three-level decoupling report learning, which not only supports layering pre-training in a report-assisted manner, but also supports classification without report assistance. In order to achieve the purpose, the technical scheme adopted by the invention is that the temporomandibular joint double-position image identification method for three-level decoupling report learning comprises the following steps: For image branches, acquiring temporomandibular joint image data comprising an opening image and a closing image, sending the opening image and the closing image into an image encoder to respectively generate specific position features F c and F o, modeling inter-position space dynamics through an SAF fusion module to generate global visual features F g, and carrying out global average pooling treatment on F c、Fo and F g to generate concentrated embedded vectors F c、fo and F g; Meanwhile, for text branches, the original diagnosis report T raw is processed through an information decomposition module LID based on a large language model, after pre-loading the prior knowledge related to TDD, the LID uses a query template to guide the LLM large language model to decompose the T raw into three components including a closed description T c, an open description T o and a global description T g, and then the three components are matched with a learnable CLS mark and a projection layer based on attention in a text encoder to generate corresponding text embedded vectors T c、to and T g; Finally, based on three-level embedding construction from two branches, a globally guided three-level alignment strategy is adopted for pre-training, a supervised visual language pair production process is executed through a globally guided three-level