CN-121982602-A - Dynamic decision-making method of surgical robot

CN121982602ACN 121982602 ACN121982602 ACN 121982602ACN-121982602-A

Abstract

The invention provides a dynamic decision method of a surgical robot, which comprises the steps of obtaining surgical video data and a corresponding surgical gesture label sequence in the operation process of the surgical robot, generating a structured text description based on the gesture label sequence, respectively extracting a visual feature sequence and a text feature sequence aiming at the surgical video data and the structured text description, performing cross-modal semantic alignment on the visual feature sequence and the text feature sequence, performing action segmentation on the obtained fusion features, dynamically integrating multi-scale space-time features through a self-adaptive weight learning mechanism to obtain a predicted surgical action sequence, inputting the predicted surgical action sequence into a preset causal rule, performing causal time sequence inspection, correcting a predicted result when the causal rule is violated, and giving out causal interpretation of a detected surgical state transition point and an abnormal event. The invention can cooperatively solve ambiguity of semantic understanding and self-adaptability of multi-modal fusion, introduces a causal inspection mechanism and provides a reliable cognition basis for autonomous decision of the surgical robot.

Inventors

GUO NA
LIU YANG
TAN MINGHUI
CUI XINNAN

Assignees

北京科技大学

Dates

Publication Date: 20260505
Application Date: 20251231

Claims (10)

1. A surgical robot dynamic decision method, comprising: Step S100, acquiring surgical video data and a corresponding surgical gesture label sequence in the operation process of a surgical robot, preprocessing the surgical video data to obtain a key frame sequence, and generating a structural text description for the surgical gesture label sequence through semantic enhancement and prompt engineering; Step 200, extracting visual features from the key frame sequence frame by frame through a pre-trained operation action sequence prediction model, encoding the structured text description to obtain a text feature sequence, performing cross-modal semantic alignment on the obtained visual feature sequence and the text feature sequence, constructing a fusion sequence containing an operation action counting label and a time sequence label according to the cross-modal semantic alignment, fusing the visual features and the text features in the fusion sequence to obtain a semantically enhanced visual feature sequence, performing action segmentation on the semantically enhanced visual feature sequence, and dynamically integrating operation details and multi-scale space-time features of an operation flow through a self-adaptive weight learning mechanism to obtain a primarily predicted operation action sequence; step S300, inputting the preliminary predicted operation action sequence into a preset causal rule for causal time sequence inspection, correcting the preliminary predicted operation action sequence when the inspection result violates the causal rule to obtain a corrected operation action sequence, and giving causal explanation of the detected operation state transition point and abnormal event; the preset cause and effect rule device is constructed according to the following steps: Step S310, decomposing a surgical action sequence into a standard operation and an emergency, respectively encoding the standard operation and the emergency into a standard operation state and a random event state, and jointly forming a phase variable, encoding an executable operation of the standard operation or after the emergency occurs into an executable operation state and forming an action variable, and forming a structured data representation by the phase variable and the action variable; step S320, modeling the time sequence relation between the historical stage variable and the current action variable by using a VAR model to obtain a stage-action model based on the VAR so as to capture the continuous effect of the historical state on the current operation of the surgical robot; Step S330, carrying out the Grangel causal examination on the variables in the VAR-based phase-action model, and confirming causal dependency relationship among the variables, thereby constructing a causal relationship network in the operation process of the surgical robot and forming a causal rule device.
2. The surgical robot dynamic decision method of claim 1 wherein preprocessing the surgical video data includes image resizing and normalization processing and key frame sampling.
3. The surgical robot dynamic decision method of claim 1, wherein generating a structured text description of the surgical gesture tag sequence by semantic enhancement and prompt engineering comprises: establishing a dynamic template library containing 14 basic text templates, wherein each template aims at language expression modes of different angles, and { } is a placeholder and is used for filling specific operation action description; establishing a surgical technical term mapping dictionary for mapping surgical gesture labels into surgical technical terms, wherein the mapping relation is used for converting the surgical gesture labels into professional descriptions containing instruments, actions and direction information based on clinical semantics of surgical operations; using the dynamic template library and the surgical technical term mapping dictionary for each key frame sequence And corresponding gesture tag sequences thereof Dynamically generating four types of complementary text descriptions, including sequence count information Description of one by one action Description of complete sequence And location information 。
4. The surgical robot dynamic decision method of claim 1, wherein the surgical action sequence prediction model comprises a multi-modal feature extraction network, a cross-modal semantic alignment and fusion network, and a time sequence action recognition network connected in sequence; The multi-modal feature extraction network adopts a visual-language double encoder which is adaptive to an operation scene, the key frame sequence is input into the visual encoder to obtain a visual feature sequence, and the structured text description is input into the language encoder to obtain a text feature sequence; The cross-modal semantic alignment and fusion network takes the fusion sequence as input, so that the extracted visual feature sequence and the text feature sequence are mutually enhanced and understood on a plurality of semantic levels, and the obtained semantic enhanced visual feature sequence is transmitted to the time sequence action recognition network; the time sequence action recognition network takes MS-TCN++ as a basic time sequence modeling network, and a dynamic weighting fusion module is introduced into the time sequence action recognition network, wherein the dynamic weighting fusion module learns the feature importance weight by adopting an attention mechanism to obtain a fusion feature sequence, an initial action category prediction score is generated frame by frame through a subsequent level of the MS-TCN++ of the fusion feature sequence, and an output layer of the MS-TCN++ normalizes the initial action category prediction score to obtain the probability that each frame belongs to each action category; The operation action sequence prediction model is trained in two stages, wherein in the first stage, the multi-modal feature extraction network and the cross-modal semantic alignment and fusion network are jointly trained, the time sequence action recognition network does not participate in training, and in the second stage, parameters of the multi-modal feature extraction network and the cross-modal semantic alignment and fusion network trained in the first stage are frozen, and only the time sequence action recognition network is trained.
5. The surgical robot dynamic decision method of claim 4, wherein the fusion sequence is constructed according to the following formula : Wherein, the Representing feature stitching operations; 、 And For the predefined 3 learnable markers, For capturing and fusing count and global semantic information at the level of the entire fusion sequence, For injecting timing order and timing position information of video frames on the time axis, As a separator, for distinguishing text guidance information from visual feature sequences; is a text feature sequence; Is a sequence of visual features; the cross-modal semantic alignment and fusion network processes the fusion sequence and then splits the fusion sequence into the following two parts: Sequence counting feature sequence Corresponding to Output of position, including information about key frame sequence Global information of action type and number; semantically enhanced visual feature sequences An output corresponding to the original visual feature sequence position; the dynamic weighted fusion module comprises a global average pooling layer, a1 multiplied by 1 convolution layer and a Sigmoid activation function which are sequentially connected.
6. The surgical robot dynamic decision method according to claim 4, wherein a multi-level contrast learning strategy is adopted in the first stage training, and a multi-level contrast learning loss function is used The definition is as follows: Wherein, the For the global semantic alignment loss, In order for the action level alignment to be lost, For the loss of sequence level alignment, Is a weight corresponding to each alignment loss; and optimizing the time sequence action recognition network by adopting a classification loss function in the first stage of training.
7. The surgical robot dynamic decision method of claim 1 wherein in step S310, the robot is positioned at The state variable for the moment is described as: In the formula, Is that A phase variable of time, consisting of Standard operating state And Status of random events Composition; Is that Action variable of time, by Status of executable operations Composition of, wherein, in Time of day: Is the first A standard operating state for indicating the first Whether or not a standard operation is to be performed, ; Representation of The robot is executing the first time Standard operations; Representation of The robot does not execute the first time Standard operations; Is the first A random event state for representing the first Whether a random event has occurred or not, ; Representation of Time of day is at A random event; Representation of No time of day occurs A random event; Is the first An executable operating state for representing the first Whether or not an executable operation is triggered, ; Representation of Time of day trigger And can execute the operation, then The moment robot will perform the executable operation; Representation of The moment is not triggered The operations may be performed.
8. The surgical robot dynamic decision method of claim 7, wherein in step S320, the VAR-based stage-action model is expressed as follows: Wherein, the A historical stage variable sequence consisting of p stage variables in the past; As a linear function; The parameter matrix to be estimated; Is that Random disturbance term of moment; Will be Each executable operating state of the moment Expressed as: Wherein, the Representation of Time of day P is the hysteresis total order, q is the hysteresis order; indicating the past moment Is the first of (2) Standard operation state pair Status of executable operations Is a function of (1); is at the past moment First, the The number of the standard operating states is one, ; Indicating the past moment J-th random event state pair of (2) Status of executable operations Is a function of (1); is at the past moment First, the A random number of event states, ; Is that Time of day (time) Random noise terms of the executable operating states.
9. The method according to claim 8, wherein in step S330, Performing a gladhand causal check on variables in the VAR-based phase-action model, confirming causal relationships between variables, comprising: Setting a time dimension n, a hysteresis total order p and a significance level alpha, and utilizing a historical stage variable sequence Action variable Constructing time sequence data; Respectively constructing a constraint model without causal variable hysteresis terms and an unconstrained model with causal variable hysteresis terms, wherein the expression of the constraint model is as follows: Wherein, the Constant terms in the constraint model; representing constraint models To itself lag the q-order term Coefficient at the time of regression; the expression of the unconstrained model is as follows: Wherein, the Constant terms in the unconstrained model; Representing unconstrained models To itself lag the q-order term Coefficient at the time of regression; Calculating F statistics to evaluate whether the prediction capability improvement of the unconstrained model compared with the constrained model has statistical significance, wherein the F statistics are calculated based on residual square sum: Wherein, the And The sum of squares of residuals for the constrained and unconstrained models respectively, Q is the sum of hysteresis terms for the constrained causal variable, ; If the calculated F statistic is greater than the F distribution critical value corresponding to the significance level alpha Indicating a history phase variable sequence Constitutes Time action variable Or else, indicating the history phase variable sequence Is not formed of Time action variable Is a causal driver of (a).
10. The surgical robot dynamic decision method of claim 1 wherein in step S330, a causal relationship network during robot operation is constructed from the results of the glauca causal test to form a causal rule, comprising: extracting significant causal relationship pairs from the results of the gland causal test; representing all the extracted causal relation pairs as edges of the directed graph, and constructing the causal graph by taking each variable as a node; checking whether there is a cyclic or redundant causal relationship in the causal graph to ensure logical consistency; and converting paths and relations in the causal graph with logic consistency into a rule form to form the causal rule.

Description

Dynamic decision-making method of surgical robot Technical Field The invention belongs to the technical field of surgical robots, and particularly relates to a surgical robot dynamic decision method based on multi-mode semantic alignment perception and causal time sequence inference, which is mainly applied to an autonomous cognition and operation decision and surgical quality evaluation system of a surgical robot. Background Advanced autonomous operation capability of surgical robots requires accurate perception of surgical scene status as a basic support. Clinical practice shows that the decision quality of the surgical robot is closely related to the accuracy of gesture recognition in the operation, and accurate state perception can effectively reduce the operation risk and improve the overall operation quality. Traditionally, the assessment of the operative condition relies on manual interpretation of real-time video pictures by the surgeon, which is not only inefficient, but also has the inherent drawbacks of being subjective and lacking in standardization. In recent years, artificial intelligence technology is gradually applied to the field of surgical state sensing, and a main technical route comprises single visual perception, multi-mode fusion and time sequence flow modeling, but the following significant technical bottlenecks still exist in the prior art when dealing with unstructured features of a real surgical scene: First, a single visual modality lacks deep semantic understanding capabilities. The existing method mainly maps continuous video frames into discrete action labels, and is difficult to distinguish operations with similar visual modes but distinct clinical intentions. For example, the "positioning tip" (preparatory actions) of the instrument is visually highly similar to the "push-needle penetration" (performing actions), and the purely visual model is highly confusing. In addition, when sudden events such as instrument drop and bleeding occur, the pure vision model often cannot understand the clinical semantics of the sudden events, so that error correction decisions such as needle picking or hemostasis cannot be triggered. Secondly, the existing multi-mode fusion method has the problems of poor field suitability and stiff fusion mechanism. While the introduction of text information can aid visual understanding, there is a tremendous semantic gap between specialized terminology systems (e.g., "tissue penetration depth") in the surgical field and the generic vision-language model. Meanwhile, the existing fusion mostly adopts feature stitching with fixed weight, and focus points (such as needle tip details during suturing and global vision during hemostasis) cannot be dynamically adjusted according to an operation stage, so that the effectiveness of feature fusion is insufficient. Again, existing timing modeling techniques generally lack causal logic verification and are difficult to accommodate for non-sequential tasks. The current mainstream robot skill learning and state recognition methods are mostly based on markov process assumptions, i.e. the current state is considered to depend only on a limited preamble state. The assumption has two major defects that firstly, the statistical correlation and the real causality cannot be distinguished, a model easily learns false correlation in data (such as misjudging environmental noise as an operation signal), secondly, the dependence of the long Cheng Yinguo cannot be captured, the step dynamic adjustment is usually caused by anatomical differences of patients or sudden anomalies (such as needle dropping) in real operations, the influences of the anomalies often have long-range hysteresis effects, and a causal chain of the abnormal event-preface operation-error correction action cannot be established like a human expert in the traditional model, so that the decision robustness is poor. Finally, existing architectures typically treat semantic understanding, timing modeling, and logical reasoning as independent modules. This decoupling design ignores the synergistic relationship between tasks that the semantic descriptions should direct the sequential logic, while the sequential context should back-feed the semantic understanding. The lack of a unified perception and causality inference framework makes it difficult for robots to implement human-like cognition and decision-making in complex dynamic environments. Disclosure of Invention The present invention aims to solve at least one of the technical problems existing in the related art to some extent. Therefore, the surgical robot dynamic decision method based on multi-mode semantic alignment sensing and Granges causal time sequence inference provided by the invention can cooperatively solve ambiguity of semantic understanding and self-adaptability of multi-mode fusion, and introduces a strict causal inspection mechanism to break through a logic blind area of a traditional probability model