EP-4736178-A1 - SYSTEMS AND METHODS FOR UNIVERSAL PHASE RECOGNITION FOR INTRAOPERATIVE AND POSTOPERATIVE APPLICATIONS

EP4736178A1EP 4736178 A1EP4736178 A1EP 4736178A1EP-4736178-A1

Abstract

Systems and methods for universal phase recognition for intraoperative and postoperative applications are provided. The system receives a video stream that captures a procedure over a time interval with a robotic medical system. The system generates, from the video stream, sets of consecutive frames. A first set of consecutive frames can include a first temporal resolution, and a second set of consecutive frames can include a second temporal resolution. The system determines, via the sets of consecutive frames input into a first model trained with machine learning, phases of the procedure on a moment-to-moment basis over the time interval. The system inputs the phases of the procedure into a second model to generate phase segment over the time interval. The second model can be trained with machine learning based on historical workflows. The system provides an action based on a metric of the phase segment.

Inventors

WANG, ZIHENG
BERNIKER, Samuel Max
FULMER, Sarah Ivey
JARC, ANTHONY M.
LIU, XI
PERREAULT, CONOR
SONG, Alfred
TROXLER, Casey

Assignees

Intuitive Surgical Operations, Inc.

Dates

Publication Date: 20260506
Application Date: 20240627

Claims (20)

1. A system, comprising: one or more processors, coupled with memory, to: receive a video stream that captures a procedure over a time interval with a robotic medical system; generate, from the video stream, a plurality of sets of consecutive frames comprising a first set of consecutive frames with a first temporal resolution and a second set of consecutive frames with a second temporal resolution; determine, via the plurality of sets of consecutive frames input into a first model trained with machine learning, phases of the procedure on a moment-to-moment basis over the time interval; input the phases of the procedure determined on the moment-to-moment basis into a second model, trained with machine learning based on historical workflows, to generate at least one phase segment over the time interval; and provide an action based on a metric of the at least one phase segment.
2. The system of claim 1, comprising the one or more processors to: execute, prior to generation of the plurality of sets of consecutive frames, one or more pre-processing functions on the video stream, the one or more pre-processing functions comprising at least one of a central crop transform, frame resizing, filtering of non-surgical frames, or filtering of noisy frames.
3. The system of claims 1 or 2, comprising the one or more processors to: generate the first set of consecutive frames with the first temporal resolution that is greater than or equal to a first threshold; and generate the second set of consecutive frames with the second temporal resolution that is less than or equal to a second threshold that is less than the first threshold.
4. The system of any one of claims 1-3, comprising the one or more processors to: generate a third set of consecutive frames with a varying temporal threshold that varies based at least in part on a function.
5. The system of any one of claims 1-4, wherein the phases of the procedure comprise at least one of exposure, dissection, transection, extraction, or reconstruction.
6. The system of any one of claims 1-5, comprising the one or more processors to: determine the phases of the procedure on the moment-to-moment basis via a the first model configured with a multi-pathway spatial-temporal decoding unit comprising an attention-based deep learning model.
7. The system of any one of claims 1-6, comprising the one or more processors to: input the plurality of sets of consecutive frames into a corresponding plurality of parallel streams to generate corresponding numerical feature outputs; and fuse the numerical feature outputs to generate the phases of the procedure on the moment-to-moment basis.
8. The system of any one of claims 1-7, comprising the one or more processors to: determine a variance in the phases of the procedure determined on the moment-to- moment basis; generate, based on the variance and a phase transition map configured with a plurality of prior probabilities, a plurality of uncertainty-aware phase boundaries throughout the time interval; and generate the at least one phase segment that corresponds to a phase boundary of the plurality of uncertainty-aware phase boundaries.
9. The system of any one of claims 1-8, comprising the one or more processors to: generate the at least one phase segment based at least in part on one or more of an average duration of phases, an ordering of phases, a type of the procedure, or a site location of the procedure.
10. The system of any one of claims 1-9, comprising: the one or more processors to determine the phases of the procedure on the moment-to- moment basis based on a combination of the video stream and at least one of a stream of system events data or a stream of kinematics data.
11. The system of any one of claims 1-10, comprising: the one or more processors to provide the action indicating a level of performance of the procedure during a phase segment of the at least one phase segment determined over the time interval.
12. The system of any one of claims 1-11, comprising the one or more processors to: determine a phase segment of the at least one phase segment at a current time; identify a tool used in the phase segment based on a stream of system events; determine that the tool does not match any of a predetermined set of tools configured for the phase segment; and provide an alert during the phase segment responsive to the determination that the tool does not match any of the predetermined set of tools configured for the phase segment.
13. A non-transitory computer-readable medium storing processor executable instructions that, when executed by one or more processors, cause the one or more processors to: receive a video stream that captures a procedure over a time interval with a robotic medical system; generate, from the video stream, a plurality of sets of consecutive frames comprising a first set of consecutive frames with a first temporal resolution and a second set of consecutive frames with a second temporal resolution; determine, via the plurality of sets of consecutive frames input into a first model trained with machine learning, phases of the procedure on a moment-to-moment basis over the time interval; input the phases of the procedure determined on the moment-to-moment basis into a second model, trained with machine learning based on historical workflows, to generate at least one phase segment over the time interval; and provide an action based on a metric of the at least one phase segment.
14. The non-transitory computer-readable medium of claim 13, wherein the instructions further include instructions to: execute, prior to generation of the plurality of sets of consecutive frames, one or more pre-processing functions on the video stream, the one or more pre-processing functions comprising at least one of a central crop transform, frame resizing, filtering of non-surgical frames, or filtering of noisy frames.
15. The non-transitory computer-readable medium of claims 13 or 14, wherein the instructions further include instructions to: generate the first set of consecutive frames with the first temporal resolution that is greater than or equal to a first threshold; and generate the second set of consecutive frames with the second temporal resolution that is less than or equal to a second threshold that is less than the first threshold.
16. The non-transitory computer-readable medium of any one of claims 13-15, wherein the instructions further include instructions to: generate a third set of consecutive frames with a varying temporal threshold that varies based at least in part on a function.
17. The non-transitory computer-readable medium of any one of claims 13-16, wherein the phases of the procedure comprise at least one of exposure, dissection, transection, extraction, or reconstruction.
18. The non-transitory computer-readable medium of any one of claims 13-17, wherein the instructions further include instructions to: determine the phases of the procedure on the moment-to-moment basis via a the first model configured with a multi-pathway spatial-temporal decoding unit comprising an attention-based deep learning model.
19. A method, comprising: receiving, by one or more processors coupled with memory, a video stream that captures a procedure over a time interval with a robotic medical system; generating, by the one or more processors from the video stream, a plurality of sets of consecutive frames comprising a first set of consecutive frames with a first temporal resolution and a second set of consecutive frames with a second temporal resolution; determining, by the one or more processors, via the plurality of sets of consecutive frames input into a first model trained with machine learning, phases of the procedure on a moment-to-moment basis over the time interval; inputting, by the one or more processors, the phases of the procedure determined on the moment-to-moment basis into a second model, trained with machine learning based on historical workflows, to generate at least one phase segment over the time interval; and providing, by the one or more processors, an action based on a metric of the at least one phase segment.
20. The method of claim 19, comprising: executing, by the one or more processors, prior to generating the plurality of sets of consecutive frames, one or more pre-processing functions on the video stream, the one or more pre-processing functions comprising at least one of a central crop transform, frame resizing, filtering of non-surgical frames, or filtering of noisy frames.

Description

SYSTEMS AND METHODS FOR UNIVERSAL PHASE RECOGNITION FOR INTRAOPERATIVE AND POSTOPERATIVE APPLICATIONS CROSS-REFERENCES TO RELATED APPLICATIONS [0001] This application claims the benefit of, and priority to, under 35 U.S.C. § 119, U.S. Provisional Patent Application No. 63/511,592, filed June 30, 2023, which is hereby incorporated by reference herein in its entirety. BACKGROUND [0002] Surgical procedures can involve capturing imagery such as video feeds from a variety of viewpoints. For example, in some instances, at least part of the surgical procedure can be performed with a computer-assisted robotic medical system. A medical tool, such as an imaging device, can be used in the robotic medical system to provide imagery. Data sources such as cameras, sensors, etc. can be located at various viewpoints in the surgical facility to capture and provide imagery of various aspects of the surgical procedure. The captured imagery from the surgical procedure can be processed in various ways. SUMMARY [0003] This technical solution is generally related to systems and methods for universal phase recognition for intraoperative and postoperative applications. The technology can automatically recognize full-length surgical phases that take place at any moment in a procedure of robot-assisted surgery. The phases can refer to high-level, universal activities that can occur in different types of procedures, and can include: exposure, dissection, transection, extraction, and reconstruction. [0004] This technical solution can recognize surgical phases using one or more approaches. For example, the technology can perform phase recognition using machine learning based on surgical videos, system events, or kinematics data. In another example, this technical solution can perform phase recognition using low-level surgical task annotations. [0005] To perform phase recognition using machine learning based on surgical videos, the technology can generate multiple sets of consecutive frames (e.g., 3) with different temporal resolutions (e.g., short, long, and varying). The technology can use a machine learning model to generate moment-to-moment phase predictions using the multiple sets of consecutive frames. For example, the technology can use a multi-pathway spatial -temporal decoding unit configured with an attention-based deep learning model. The model can be based on or utilize a vision transformer architecture with self-attention over space and time to allow for jointed spatial and temporal feature learning from the video stream. The technology can use multiple parallel streams to process the multiple sets of consecutive frames in a simultaneous or overlapping manner in order to extract features. The technology can perform feature-level fusion by aggregating numerical feature outputs of the multiple parallel streams to make moment-wise predictions of phases generated based on an aggregated feature vector. [0006] The technology can then recognize full-length phases from the moment-to-moment phase prediction using a second machine learning model that is trained with surgical workflows. The technology can utilize a phase transition map with priors and probability to transition to each phase label in order to recognize the full-length phases from the moment-to- moment phase predictions. For example, the technology can identify or find boundaries of each surgical phase and generate full-length phase recognition for a procedure from moment-to- moment phase predictions. In some cases, the technology can quantify the uncertainty in the boundaries based on the variance in the moment-to-moment predictions. For example, the technology can use a long-range temporal module based on analysis of surgical workflows to model distribution over phases at each moment in time. The model can combine information about both the average duration of phases and the temporal ordering of phases in order to define a prior belief of likelihood of staying within the same phase, or transitioning to each of the other phases. The technology can model the likelihood of each phase label for a given timestamp to generate uncertainty-aware phase boundaries throughout an entire case in order to provide real-time phase predictions. Thus, the technology can transition from moment-to- moment to full length phase predictions that can incorporate a variety of different information sources in different ways, including, for example: average predictions; transition probabilities refined by additional info about procedure type, hospital site; unique events to identify boundaries of certain phases, such as installment and unmount of needle driver; decision-tree methods; or multi-modal. [0007] The technology an adapt this approach to an event stream (e.g., information about tool installation or uninstallation) or kinematics stream (e.g., for task detection), as well as a combination of the video stream, event stream, and kinematics stream. [0008] To perform phase recognition based