CN-121662295-B - Intelligent evaluation method for surgical skills based on endoscope video
Abstract
The invention provides an intelligent evaluation method for surgical skills based on endoscope videos, and relates to the technical field of artificial intelligence. The invention provides a complete intelligent evaluation flow from an original endoscopic surgery video to a quantization skill score, which comprises the steps of carrying out standardized pretreatment and automatically removing an irrelevant fragment frame, adopting self-supervision field fine adjustment to obtain frame-level characteristic representations adapting to different surgery and equipment conditions, realizing frame-level identification of multiple types of anomalies such as bleeding, smog, surgery field exposure and the like by utilizing a multi-scale characteristic pyramid and channel-space attention on the basis of the frame-level characteristic representations, carrying out time sequence coding on an anomaly process by a time sequence coding network based on a selective state space model so as to extract event-level time sequence characteristics, and finally, carrying out joint modeling on the characteristics and structural metadata such as covering surgery type and the like, and outputting normalized skill scores and corresponding grades. According to the invention, through a unified quantization frame and a self-adaptive fusion mechanism, the accuracy, generalization and interpretation of intelligent evaluation of the surgical skills are greatly improved.
Inventors
- DING SHUAI
- ZHU YUANBO
- WANG HAO
- XU RUI
- YANG YUXUAN
- KE SHUIZHOU
Assignees
- 合肥工业大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260205
Claims (10)
- 1. An intelligent evaluation method for surgical skills based on endoscope video is characterized by comprising the following steps: Acquiring and standardizing preprocessing endoscopic surgery video, and smoothly identifying an effective frame sequence by combining a discrimination model with a time window; performing self-supervision domain fine tuning to convergence on a feature extraction network based on each effective frame and a timestamp thereof, wherein the feature extraction network is a deep neural network with multi-level feature representation capability; The method comprises the steps of taking each effective frame as the input of a converged feature extraction network, respectively extracting a plurality of feature graphs with different resolutions from different layers of the network, taking the feature graphs as the input of a multi-channel discrimination head combining a multi-scale feature pyramid and a channel-space attention mechanism, and obtaining multi-scale comprehensive features of each effective frame and frame level anomaly probabilities of a plurality of preset anomaly categories; The method comprises the steps of calculating first-order differences of frame-level anomaly probabilities of adjacent effective frames on the same anomaly category, obtaining frame-level anomaly probability differences of each effective frame on each anomaly category, splicing multi-scale comprehensive characteristics, normalization time of each effective frame, frame-level anomaly probability and frame-level anomaly probability differences of each anomaly category, obtaining fusion vectors corresponding to each anomaly category one by one, taking the fusion vectors as input of a time sequence coding network based on a selective state space model, and extracting event-level time sequence characteristics in each anomaly interval after context correction; and taking the comprehensive feature vector as the input of each base learner, and carrying out weighted fusion on the output of each base learner through a gating network taking the surgical type as the input to obtain the normalized skill score and the corresponding skill level.
- 2. The intelligent surgical skill assessment method according to claim 1, wherein said identifying valid frame sequences by discriminant modeling in combination with temporal window smoothing comprises: converting the preprocessed standardized frame into a brightness map, and defining a global brightness mean value, a brightness variance, a gradient intensity mean value and an edge pixel proportion based on the brightness map to construct a frame-level appearance feature vector; Taking the frame-level appearance feature vector of each standardized frame as the input of a discrimination model to predict the probability that each standardized frame belongs to an irrelevant frame; performing temporal window smoothing on the predicted irrelevant frame probabilities, and calculating the smoothed irrelevant frame probabilities to define corresponding validity masks; Based on each validity mask, a set of valid frame indices is obtained and the sets are ordered in chronological order to define a valid frame sequence.
- 3. The intelligent surgical skill assessment method according to claim 1, wherein the self-supervised domain fine tuning to convergence of the feature extraction network based on each valid frame and its timestamp comprises: Taking a feature extraction network to be finely tuned as a student network, and constructing a teacher network with the same structure; for each effective frame, generating two different enhancement views which are respectively used as input of a student network and a teacher network to acquire corresponding embedded representations; Constructing time weight by using the time stamp of each effective frame, constructing distribution consistency loss by taking the embedded representation output by the teacher network as a pseudo tag, constructing a time sequence smooth regular term based on the variation amplitude of the embedded representation output by the student network by the adjacent frame, constructing alignment loss based on the activity standard of each effective frame, and constructing total loss of self-supervision training based on the distribution consistency loss, the time sequence smooth regular term and the alignment loss; And performing self-supervision domain fine tuning to convergence on the student network by utilizing the total loss.
- 4. The intelligent surgical skill assessment method according to claim 1, wherein the step of obtaining the multi-scale integrated feature of each effective frame and the frame-level anomaly probabilities for a plurality of preset anomaly categories by using the feature map as an input to a multi-channel discrimination head combining a multi-scale feature pyramid and a channel-space attention mechanism comprises: based on the extracted feature graphs with different resolutions, a multi-scale feature pyramid structure is constructed, and low-resolution features are fused step by step to high-resolution features so as to obtain corresponding scale feature graphs; introducing a channel-space attention mechanism for each scale feature map to obtain corresponding scale feature vectors, cascading all the scale feature vectors on feature dimensions to obtain multi-scale comprehensive feature representations of each effective frame; Constructing a multi-class anomaly detection head on the multi-scale comprehensive feature representation, and acquiring frame-level anomaly probabilities of the effective frames on different classes through linear mapping and Sigmoid activation function conversion, wherein the multi-class anomaly detection head constructs an optimization target through weighted binary cross entropy loss and total variation regular terms in a training stage.
- 5. The intelligent surgical skill assessment method according to claim 1, wherein obtaining the temporal representations of all active frames on each anomaly class using the fusion vector as input to a temporal coding network based on a selective state space model, comprises: for each abnormal category, constructing a corresponding discrete time state space model; Calculating a corresponding gating vector based on the fusion vector of the current frame, and updating parameters of the discrete time state space model based on the gating vector; Mixing the hidden state and the fusion vector of the current effective frame by using the updated model, and adaptively updating the hidden state to acquire the time sequence representation of the next effective frame; Traversing the active frame sequence to obtain the time sequence representation of all active frames on each abnormal category.
- 6. The surgical skill intelligent assessment method according to claim 1, wherein the context-corrected extraction of event-level timing features within each anomaly interval comprises: Calculating the context corrected anomaly strengths based on the time sequence representation of the active frames output by the time sequence encoding network on each anomaly class; performing binary judgment on each effective frame based on the abnormal strength to screen out a plurality of abnormal sections; calculating interval level time sequence characteristics based on abnormal intensity, time stamp and time sequence representation in each abnormal interval, and carrying out average pooling on the time sequence representation to obtain event level comprehensive representation, wherein the interval level time sequence characteristics are duration time, intensity integration, evolution trend slope, peak intensity and peak relative position; And cascading the interval level time sequence characteristics and the event level comprehensive representation on the characteristic dimension to acquire the event level time sequence characteristics.
- 7. The intelligent surgical skill assessment method according to claim 6, wherein the structured metadata further includes a total duration of the surgery and a surgeon's qualification, wherein the obtaining of the anomaly statistics based on the event-level timing characteristics of each anomaly category, in combination with the structured metadata, comprises: Based on the duration time and the intensity integral of each abnormal category in all abnormal intervals, respectively calculating the abnormal time duty ratio and the abnormal intensity integral duty ratio by combining the total operation duration; Based on the evolution trend slope, peak intensity and peak relative position of each abnormal category in all abnormal intervals, respectively calculating average evolution trend, maximum peak intensity and average peak relative position; recording the number of events, and calculating the average duration time in combination with the duration time of all abnormal intervals based on each abnormal category; Comparing the overlapping time lengths of the abnormal intervals of different abnormal categories in pairs, and calculating an abnormal overlapping ratio by combining the total operation time length; Carrying out average pooling on event-level comprehensive representations of various abnormal categories to obtain category-level representations; vectorization type, total duration of surgery and operator qualification, respectively; Summarizing abnormal time duty ratio, abnormal intensity integral duty ratio, average evolution trend, maximum peak intensity, average peak relative position, event number, average duration, abnormal overlap ratio and class level representation, and constructing a comprehensive feature vector by combining vectorization results of operation type, total operation duration and operator qualification.
- 8. An endoscope video-based intelligent surgical skill assessment system, comprising: The acquisition and preprocessing module is used for acquiring and standardizing preprocessing endoscopic surgery video, and smoothly identifying an effective frame sequence by combining a discrimination model with a time window; The fine tuning module is used for carrying out self-supervision field fine tuning on the feature extraction network to convergence based on each effective frame and the timestamp thereof, wherein the feature extraction network is a deep neural network with multi-level feature representation capability; The device comprises a recognition module, a multi-scale feature pyramid, a channel-space attention mechanism, a multi-channel judging head, a multi-scale feature extraction module, a multi-scale recognition module and a multi-scale recognition module, wherein the recognition module is used for taking each effective frame as the input of a converged feature extraction network, extracting a plurality of feature graphs with different resolutions from different layers of the network respectively, and taking the feature graphs as the input of the multi-channel judging head combining a multi-scale feature pyramid and the channel-space attention mechanism to acquire multi-scale comprehensive features of each effective frame and frame-level abnormal probabilities of a plurality of preset abnormal categories; The system comprises an analysis module, a context correction module, a time sequence analysis module and a time sequence analysis module, wherein the analysis module is used for calculating first-order differences of frame-level anomaly probabilities of adjacent effective frames for the same anomaly category to obtain frame-level anomaly probability differences of each effective frame for each anomaly category; The evaluation module is used for acquiring abnormal statistical characteristics based on event-level time sequence characteristics of each abnormal category and constructing a comprehensive characteristic vector by combining structured metadata, taking the comprehensive characteristic vector as the input of each base learner, and carrying out weighted fusion on the output of each base learner through a gate control network taking the surgical type as the input to acquire normalized skill scores and corresponding skill levels.
- 9. A storage medium storing a computer program, wherein the computer program causes a computer to execute the surgical skill intelligent assessment method according to any one of claims 1 to 7.
- 10. An electronic device, comprising: The intelligent surgical skill assessment method according to any one of claims 1-7, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors.
Description
Intelligent evaluation method for surgical skills based on endoscope video Technical Field The invention relates to the technical field of artificial intelligence, in particular to an intelligent evaluation method for surgical skills based on endoscope videos. Background Evaluation of surgical skills is critical in medical training and quality control. With the popularization of minimally invasive surgery, an endoscope video becomes an important carrier for recording the surgical process, which provides a data base for objectively and accurately quantifying the surgical skills. In the overall skill scoring link, existing protocols often construct regression or classification models around a single type of feature (e.g., bleeding-related features or tool movement features). A video-based surgical skill assessment method using tool tracking is disclosed, for example, in chinese patent application CN119317972 a. During operation, the method receives a tool motion trajectory that includes a sequence of detected tool motions of a surgeon performing a surgical procedure with a surgical tool. The method then generates a sequence of multi-channel feature matrices to mathematically represent the tool motion trajectory. Next, the method performs a one-dimensional (1D) convolution operation on the sequence of multi-channel feature matrices to generate a sequence of context-aware multi-channel feature representations of the tool motion trajectory, which are then processed by a transducer model to generate the skill classification. Therefore, the modeling mode of the skill score of the related technology on the multisource information and the operation type difference is single, the multisource information and the operation type difference are only introduced as additional variables in a scattered manner, and a unified quantification and fusion framework is not formed yet. Disclosure of Invention (One) solving the technical problems Aiming at the defects of the prior art, the invention provides an intelligent evaluation method for surgical skills based on endoscope video, which solves the technical problem of single modeling mode of skill scoring on multisource information and surgical difference. (II) technical scheme In order to achieve the above purpose, the invention is realized by the following technical scheme: an intelligent evaluation method for surgical skills based on endoscope videos comprises the following steps: Acquiring and standardizing preprocessing endoscopic surgery video, and smoothly identifying an effective frame sequence by combining a discrimination model with a time window; performing self-supervision domain fine tuning to convergence on a feature extraction network based on each effective frame and a timestamp thereof, wherein the feature extraction network is a deep neural network with multi-level feature representation capability; The method comprises the steps of taking each effective frame as the input of a converged feature extraction network, respectively extracting a plurality of feature graphs with different resolutions from different layers of the network, taking the feature graphs as the input of a multi-channel discrimination head combining a multi-scale feature pyramid and a channel-space attention mechanism, and obtaining corresponding multi-scale comprehensive features and frame level abnormal probabilities; splicing multi-scale comprehensive characteristics, frame-level abnormal probability, probability difference and normalization time of each effective frame to obtain fusion vectors corresponding to each abnormal category one by one; the fusion vector is used as the input of a time sequence coding network based on a selective state space model, and event-level time sequence characteristics are extracted in each abnormal interval after context correction; and taking the comprehensive feature vector as the input of each base learner, and carrying out weighted fusion on the output of each base learner through a gating network taking the surgical type as the input to obtain the normalized skill score and the corresponding skill level. Preferably, the identifying the valid frame sequence by combining the discriminant model with the time window smoothing includes: converting the preprocessed standardized frame into a brightness map, and defining a global brightness mean value, a brightness variance, a gradient intensity mean value and an edge pixel proportion based on the brightness map to construct a frame-level appearance feature vector; Taking the frame-level appearance feature vector of each standardized frame as the input of a discrimination model to predict the probability that each standardized frame belongs to an irrelevant frame; performing temporal window smoothing on the predicted irrelevant frame probabilities, and calculating the smoothed irrelevant frame probabilities to define corresponding validity masks; Based on each validity mask, a set of valid frame indices is ob