CN-121482861-B - Sarcopenia detection method and device based on chain prompt and multi-mode large model

CN121482861BCN 121482861 BCN121482861 BCN 121482861BCN-121482861-B

Abstract

The invention discloses a method and a device for detecting sarcopenia based on chain prompt and a multi-mode large model, belonging to the technical field of video processing, wherein the method comprises the steps of collecting a patient action video, dividing each frame into a plurality of small blocks and mapping the small blocks into a characteristic vector sequence; the method comprises the steps of carrying out information fusion between the same frame and different moments, outputting global visual feature vectors with space and time contexts, segmenting a continuous feature sequence into a plurality of action phases with continuous semanteme, retrieving corresponding segmented prompt texts from a predefined mapping table based on test types, generating natural language description containing quantitative indexes and action features at one time, outputting sarcopenia diagnosis and detailed reasons thereof, taking probability values as confidence degrees of diagnosis, and outputting a complete diagnosis report containing quantitative qualitative analysis by one key. The method and the system can greatly improve the universality and the robustness of cross-scene and cross-equipment, and remarkably improve the interpretability of model decisions.

Inventors

XIAO JING
JI YUXUAN
LIU QINGJIE
CAO BINGYAN

Assignees

中国中医科学院西苑医院

Dates

Publication Date: 20260512
Application Date: 20251023

Claims (9)

1. The utility model provides a sarcopenia detection method based on chain prompt and multi-mode big model, which is characterized by comprising the following steps: step S1, collecting a patient action video under a fixed viewing angle and shooting parameters, decoding the video into frames, then cutting, unifying the sizes and normalizing pixels of each frame of image, dividing each frame of image into a plurality of small blocks, and mapping the small blocks into a feature vector sequence; S2, inputting the feature vector sequence into a multi-layer alternative coding network, carrying out information fusion between the same frame and different moments, and outputting a global visual feature vector with space and time context for subsequent action segmentation; Step S3, automatically detecting action conversion points based on global visual feature vector differences of adjacent frames, dividing a continuous feature sequence into a plurality of semantically coherent action stages, obtaining indexes of each sub-stage, and providing accurate time sequence segments for subsequent type discrimination and description generation; Step S4, determining a global prompt template set according to the test category to which the current video belongs, then indexing each sub-stage, retrieving a corresponding segmented prompt text from the global prompt template set based on the test type, mapping the segmented prompt text into prompt embedding with the same dimension as the global visual feature vector through a text encoder, and providing differentiated semantic input for subsequent description generation; step S5, for each section of action, embedding and splicing the global visual feature vector mean value of the section of action and a corresponding prompt, inputting a visual-text large model, and generating natural language description containing quantitative indexes and action characteristics at one time to obtain each section of description, thereby providing complete interpretable semantic context for the subsequent less sample reasoning; Step S6, connecting a small number of marked examples and the descriptions of each section in series to form a thinking chain prompt, inputting the visual-text large model, completing multi-step logic reasoning under the guidance of examples, and directly outputting the diagnosis of sarcopenia and the detailed reason thereof; wherein, the step S3 includes: Step S31, calculating Euclidean distance of global visual feature vectors of two adjacent frames ; Step S32, pair By applying a moving average smoothing process to obtain ; Step S33, counting the difference distribution of the whole video, and calculating the average value And standard deviation of And with adaptive threshold As a criterion for motion switching, the continuous frame feature is segmented into several motion phases, wherein, Is a preset coefficient.
2. The method for detecting sarcopenia based on the chain prompt and the multimodal big model according to claim 1, wherein in the step S1, the motion video includes a motion video corresponding to a sitting test, a single-foot standing test, or a walking test; and/or, the step S1 includes: rearranging the image after pixel normalization from a BGR channel to an RGB channel, and then performing standardization on the pixel value of each channel based on the channel mean and standard deviation used by the ImageNet pre-training model, and adjusting the data to be zero mean and unit variance distribution; and/or, the step S1 includes: Each patch is flattened and projected onto a learning linear layer Maintaining an embedding space to obtain the patch embedding of each patch; And inserting a trainable classification mark token before embedding all the patches of each frame to obtain a small block mark patch token sequence corresponding to the action video, namely the feature vector sequence.
3. The method for detecting sarcopenia based on the chain hint and the multimodal big model according to claim 2, wherein the step S2 includes: Step S21, embedding the patches of all frames into a classification token in time sequence to form an initial sequence; Step S22, in the first step In the layer coding process, firstly, performing layer normalization on upper layer output, eliminating magnitude differences among different dimensions, then calculating association weights among classification token through multi-head self-attention, and reserving original input information to obtain an intermediate vector; step S23, carrying out layer normalization on the intermediate vector again so as to adapt to the input distribution of a subsequent feedforward network, then carrying out residual fusion on the intermediate vector and the input through a position sensing feedforward network formed by two layers of full connection and nonlinear activation, and generating the output of the layer; And step S24, repeatedly executing the step S22 to the step S23, completing the calculation of all the coding layers, obtaining final output, and extracting the classification token corresponding to each frame from the final output as the global visual feature vector.
4. The method for detecting sarcopenia based on the chain hint and the multimodal big model according to claim 1, wherein in the step S33, all of the conditions are satisfied Or (b) Is considered as a phase boundary and together with the first and last frames constitutes a complete set of segment boundary indices.
5. The method for detecting sarcopenia based on the chain hint and the multi-modal large model as set forth in claim 4, wherein in the step S33, if the number of frames between two adjacent phase boundaries is less than a preset lower limit, the corresponding two segments are merged.
6. The method for detecting sarcopenia based on the chain hint and the multimodal big model according to claim 2, wherein in the step S4, For a sitting test, the corresponding segmented prompt texts in the global prompt template set comprise relevant descriptions of 'sitting-to-standing speed', 'posture stability' and 'upper limb assistance condition'; for a single-foot standing test, the corresponding segmented prompt texts in the global prompt template set comprise related descriptions of 'preparation action speed', 'upper limb stability', 'balance time', 'gravity center dynamic change' and 'body shaking amplitude'; For walking trials, the corresponding segmented hint texts in the global hint template set include "gait features", "torso stability", "upper limb swing" and "stride frequency versus stride" related descriptions.
7. The method for detecting sarcopenia based on chain hints and multimodal big models according to any one of claims 1-6, further comprising: and S7, carrying out exponential normalization processing on the original scores of the model pair of normal and sarcopenia to obtain probability distribution, selecting a category with higher probability as final diagnosis, and directly using the probability value as the confidence level of the diagnosis to provide a quantized confidence level for the report.
8. The method for detecting sarcopenia based on chain hints and multimodal big models according to claim 7, further comprising: and step S8, integrating the descriptions of each section in the step S5, the reasoning step and the diagnosis result in the step S6 and the confidence level in the step S7 into a structural report, and outputting a complete diagnosis report containing quantitative qualitative analysis by a last key.
9. A sarcopenia detection device based on chain prompting and multi-modal large model, comprising: The video acquisition and preprocessing module is used for acquiring a patient action video under a fixed viewing angle and shooting parameters, decoding the video into frames, then cutting each frame of image, unifying the size and the pixel normalization, dividing each frame of image into a plurality of small blocks and mapping the small blocks into a feature vector sequence; The visual feature coding module is used for inputting the feature vector sequence into a multi-layer alternating coding network, carrying out information fusion between the same frame and different moments, and outputting a global visual feature vector with space and time context for subsequent action segmentation; the implicit action segmentation module is used for automatically detecting action conversion points based on global visual feature vector differences of adjacent frames, segmenting a continuous feature sequence into a plurality of semantically coherent action stages, obtaining indexes of each sub-stage, and providing accurate time sequence segmentation for subsequent type discrimination and description generation; The test type and segmentation prompt mapping module is used for determining a global prompt template set according to the test type to which the current video belongs, then indexing each sub-stage, searching a corresponding segmentation prompt text from the global prompt template set based on the test type, mapping the segmentation prompt text into prompt embedding with the same dimension as the global visual feature vector through a text encoder, and providing differentiated semantic input for subsequent description generation; The segment description generation module is used for embedding and splicing the global visual feature vector mean value of each segment of action with the corresponding prompt, inputting a visual-text large model, generating natural language description containing quantitative indexes and action characteristics at one time to obtain each segment of description, and providing complete interpretable semantic context for the follow-up few sample reasoning; The few sample reasoning module is used for connecting a small number of marked examples and the descriptions of each section in series to form a thinking chain prompt, inputting the visual-text large model, completing multi-step logical reasoning under the guidance of the examples, and directly outputting the sarcopenia diagnosis and the detailed reason thereof; wherein the implicit action segmentation module is further configured to calculate euclidean distance of global visual feature vectors of two adjacent frames To (3) pair By applying a moving average smoothing process to obtain Calculating the average value by counting the difference distribution of the whole video And standard deviation of And with adaptive threshold As a criterion for motion switching, the continuous frame feature is segmented into several motion phases, wherein, Is a preset coefficient.

Description

Sarcopenia detection method and device based on chain prompt and multi-mode large model Technical Field The invention relates to the technical field of video processing, in particular to a sarcopenia detection method and device based on chain prompt and a multi-mode large model. Background Along with the continuous deep application of video analysis technology in the field of medical rehabilitation, the existing video-based sarcopenia detection method mainly depends on a traditional vision module or a shallow learning algorithm, such as human skeleton key point detection and manually designed gait feature extraction, and mainly has the following defects: 1. Poor versatility and robustness Traditional skeleton key point detection algorithm is often optimized for specific shooting angles, illumination conditions or human body postures, and once shooting environment, shooting equipment or patient posture change, data are required to be remarked and parameters are required to be readjusted, so that the method is difficult to adapt to clinical application of multiple scenes and multiple crowds. 2. Insufficient interpretation The output of the existing method is usually only classified as 'sarcopenia probability' or 'health/risk', quantitative explanation and logic explanation of model decision process are lacking, quantitative movement indexes and inference chain reports cannot be produced, and clinicians and patients have difficulty in producing enough trust on the results. 3. Dependence on large scale labeling The traditional algorithm is easy to fall into overfitting under a small number of labeling samples, and the fine tuning for a large-scale pre-training model can improve the performance, but has extremely high requirements on labeling, calculation and deployment cost, and does not have the possibility of being popularized to primary medical treatment or home rehabilitation monitoring. 4. Lack of complete diagnostic procedures The prior art can not connect the steps of action time sequence segmentation, characteristic quantification, multi-step logic reasoning and visual report in series, and can not generate a comprehensive diagnosis report comprising time sequence segment analysis, key indexes and chain reasoning by one key, thereby limiting the actual landing of scenes such as remote medical treatment, intelligent monitoring and the like. The Chinese patent with the bulletin number of CN113488163B discloses a machine vision-based sarcopenia identification method, a device, equipment and a medium, wherein the technology adopts a machine vision scheme to identify the sarcopenia, and the technical process has obvious limitations. The scheme sequentially comprises the steps of acquiring video streams, acquiring an original frame image sequence set, processing images based on a human body posture recognition algorithm to extract human body key points and gait contour diagrams, carrying out affine transformation on the human body key points to acquire multi-angle data, calculating gait data according to the multi-angle data and the like, and finally building and training a gait recognition network to output recognition results. Each link needs to be tightly matched, errors in any link can affect the final recognition result, the requirements on the operation proficiency of technicians and the stability of the system are extremely high, the debugging and maintenance difficulties are high, and the whole process is difficult to realize. The chinese patent application publication No. CN117333932a discloses a method, apparatus, equipment and medium for identifying sarcopenia based on machine vision, which has a certain advantage in solving the existing diagnostic problem, but also has the following drawbacks: Firstly, the judgment standard of the scheme is single, the sarcopenia is mainly identified according to the standing time of the chair, and the judgment standard is relatively limited. Although studies have shown that standing time of a standard chair can be used as a sensitive sign of sarcopenia, and experiments have shown that the chair standing time is related to pace, verifying its effectiveness as a gait speed substitute. However, the pathogenesis of sarcopenia is complex, involving multiple changes in muscle mass, strength, function, etc., and it is difficult for a single chair standing time index to cover these complex factors. Second, the solution requires explicit modeling of the human skeleton, which increases the complexity and cost of the technical implementation. When key points of the whole body of the human body under different angles are acquired and skeleton feature diagrams are calculated, complex operations such as affine transformation, space coordinate calculation and the like are involved, and the requirements on the calculation capacity of equipment and the accuracy of algorithms are high. Not only hardware devices with stronger performance are needed to support operation, but also co