CN-122029576-A - Attention-based learning from video of a region of interest of a patient

CN122029576ACN 122029576 ACN122029576 ACN 122029576ACN-122029576-A

Abstract

In some embodiments of the present disclosure, video frame data across one or more videos is processed by a pre-trained encoder, and the resulting frame-based embedding is concatenated for downstream attention-based deep learning network analysis. In various embodiments, different types of self-supervised learning (SSL) pre-training techniques and different types of attention-based deep learning networks are utilized. In a particular example, embodiments of the present disclosure are applied to deriving inferences about pulmonary arterial hypertension based on video data from a transthoracic echocardiogram of a patient. In another example, embodiments of the present disclosure are applied to deriving inferences about anatomical bowel segments in endoscopic video. Details of various examples are further described herein.

Inventors

P. Damaseno
C. Pama
P. Mobadalsani
K. Chaitanya
STANDISH KEVIN

Assignees

杨森研发有限责任公司

Dates

Publication Date: 20260512
Application Date: 20240816
Priority Date: 20230816

Claims (20)

1. A method of analyzing one or more medical videos corresponding to one or more patient examinations using one or more computers, the method comprising: Preprocessing video data to obtain a plurality of video frames corresponding to the one or more medical videos corresponding to patient exams of the one or more patient exams; Encoding the plurality of video frames using a pre-trained encoder to obtain a plurality of frame embeddings corresponding to respective frames of the plurality of video frames, wherein the pre-trained video encoder has been pre-trained using self-supervised learning; Embedding the plurality of frames for concatenation; Processing the plurality of frame embeddings in an attention-based deep learning network to obtain a summary vector generated by an attention process focusing on each frame embedment of the plurality of frame embeddings, and The summary vector is submitted to a classification network to obtain one or more calculated classifications corresponding to the one or more medical videos from the patient exam.
2. The method of claim 1, wherein the one or more medical videos comprise a plurality of videos such that the plurality of frame embeddings concatenated for further processing comprise a frame embedment from each of the plurality of videos.
3. The method of any one of claims 1-2, wherein the patient examination is a transthoracic echocardiogram.
4. A method according to any one of claims 1 to 3, wherein the pre-trained encoder comprises a visual transformer (ViT) encoder.
5. A method according to any one of claims 1 to 3, wherein the pre-trained encoder comprises a convolutional neural network.
6. The method of any of claims 1-5, wherein the pre-trained encoder is trained using SimCLR.
7. The method of any of claims 1-5, wherein the pre-trained encoder is trained using DINO.
8. The method of any of claims 1-5, wherein the pre-trained encoder is trained using DINOv 2.
9. The method of any of claims 1 to 8, wherein the attention-based deep learning network comprises a transformer multi-headed attention encoder.
10. The method of claim 9, wherein the transformer multi-head encoder comprises a set transformer encoder.
11. The method of any of claims 1 to 8, wherein the attention-based deep learning network comprises a convolutional neural network encoder.
12. The method of any of claims 1 to 11, wherein the attention-based deep learning network comprises an attention pooling layer.
13. The method of claim 12, wherein the attention pooling layer comprises an attention block and an aggregator.
14. The method of claim 13, wherein the attention block associates respective learnable parameters with respective frame embedded outputs of an encoder of the attention-based deep learning network.
15. The method of claim 14, wherein the aggregator applies the respective learnable parameters to the respective frame embedded outputs and sums the resulting attention scored frame embedded outputs to obtain the summary vector.
16. The method of any one of claims 1 to 15, wherein the one or more calculated classifications are in relation to pulmonary arterial hypertension.
17. A method of analyzing one or more medical videos corresponding to one or more patient examinations using one or more computers, the method comprising: Preprocessing video data to obtain a plurality of video frames corresponding to the one or more medical videos corresponding to patient exams of the one or more patient exams; encoding the plurality of video frames using a pre-trained encoder to obtain a first plurality of frame embeddings corresponding to respective ones of the plurality of video frames, wherein the pre-trained video encoder has been pre-trained using self-supervised learning, and Processing the first plurality of embeddings in an attention-based deep learning network using supervised learning training to obtain a second plurality of frame embeddings corresponding to respective ones of the first plurality of frame embeddings, and deriving a calculated classification for a patient tissue corresponding to the frame embeddings using the frame embeddings of the second plurality of frame embeddings.
18. The method of claim 17, wherein the patient examination is an endoscopic examination.
19. The method of claim 18, wherein the calculated classification is regarding which of a plurality of bowel segments corresponds to the frame embedding.
20. A computer program product stored in a non-transitory tangible medium and comprising instructions executable on one or more processors of one or more computers to implement a process for analyzing one or more medical videos corresponding to a patient examination, the process comprising performing the method of any one of claims 1 to 19.

Description

Attention-based learning from video of a region of interest of a patient Cross Reference to Related Applications The application claims the benefits of U.S. provisional application 63/533,101 filed on 8 th month 16 of 2023, U.S. provisional application 63/599,994 filed on 11 th month 16 of 2023, and U.S. provisional application 63/555,883 filed on 20 nd 2 nd month 2024. The contents of these applications are incorporated herein by reference. Background The present disclosure relates generally to computerized techniques for inferring from one or more videos of a region of interest of a patient in the medical field. Disclosure of Invention In the context of some medical image analysis, the "best" view is typically selected for algorithm-based inference from patient data. For example, in the context of transthoracic echocardiography (TTE), many TTE videos are generated during a patient examination, but only a subset (or even only one) of those videos is selected (e.g., via computerized analysis) for more detailed analysis by a computerized system. However, video data in unselected views/videos may provide useful information for computerized analysis. At the same time, video data is quite bulky in nature, and computerized analysis of video data can be resource intensive. Attention-based deep learning methods also require a significant amount of processing resources, and the level of resource consumption depends on the level of granularity to which the attention mechanism applies. Thus, effectively implementing attention-based learning benefits from a carefully considered approach to the level of data aggregation for attention-based analysis in the context of analyzing medical video data. Furthermore, there is a need for methods that can utilize information available across multiple videos, whether or not a particular video is determined to be the "best" view. In some embodiments of the present disclosure, video frame data across one or more videos is processed by a pre-trained encoder, and the resulting frame-based embedding is concatenated for downstream attention-based deep learning network analysis. In various embodiments, different types of self-supervised learning (SSL) pre-training techniques and different types of attention-based deep learning networks are utilized. In one example, embodiments of the present disclosure are applied to deriving inferences about pulmonary arterial hypertension (PH) based on video data from a transthoracic echocardiogram of a patient. In another example, embodiments of the present disclosure are applied to identifying an anatomical segment of bowel corresponding to a frame of endoscopic video. Details of various examples are further described herein. Drawings Fig. 1 illustrates a medical video processing system according to an embodiment of the present disclosure. Fig. 2 illustrates a medical video processing system according to another embodiment of the present disclosure. Fig. 3 illustrates a self-supervised learning (SSL) pre-training method according to one embodiment of the present disclosure. Fig. 4 illustrates a method for processing medical videos corresponding to a patient examination through a computerized deep learning network to make patient-related inferences based on the medical videos in accordance with the present disclosure. Fig. 5 a-5 d illustrate frame-by-frame attention representations obtained from a medical examination video dataset at various granularity levels. FIG. 6 illustrates an example of a computer system, one or more of which may be used to implement one or more of the devices, systems, and methods illustrated herein. Although embodiments of the present disclosure have been described with reference to the above figures, the figures are intended to be illustrative. Other embodiments are consistent with the spirit and scope of the present disclosure. Detailed Description Various embodiments will now be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show, by way of illustration, specific examples of practiced embodiments. This specification may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, but rather, these embodiments are provided so that this specification will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. The present description may be embodied, inter alia, in methods or apparatus. Thus, any of the various embodiments herein may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. The following description is, therefore, not to be taken in a limiting sense. Fig. 1 illustrates a medical video processing system 1000 according to an embodiment of the present disclosure. This and other embodiments will be described with reference to video from a transthoracic echoc