CN-122023647-A - Three-dimensional reconstruction method, device, equipment and medium based on endoscope

CN122023647ACN 122023647 ACN122023647 ACN 122023647ACN-122023647-A

Abstract

The invention discloses a three-dimensional reconstruction method, device, equipment and medium based on an endoscope, and belongs to the technical field of image processing. The method comprises the steps of obtaining a video frame sequence and a gesture sequence, inputting the video frame sequence and the gesture sequence into a neural network model for processing to obtain three-dimensional point clouds corresponding to the video frame sequence, enabling the neural network model to be used for the gesture data of the endoscope at each time point, fusing the video frame data at each time point, and generating a three-dimensional reconstruction model of a target area based on the three-dimensional point clouds corresponding to the video frame sequence. The scheme improves the accuracy of three-dimensional reconstruction of the endoscope through the acquired two-dimensional images in the rapid moving process.

Inventors

YAO ZHIYUAN
WANG QIUGE

Assignees

江苏圆和医疗科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260115

Claims (10)

1. An endoscope-based three-dimensional reconstruction method, the method comprising: The method comprises the steps of acquiring a video frame sequence and a gesture sequence, wherein the video frame sequence comprises video frame data of each time point acquired by a visual camera of an endoscope in a target area, and the gesture sequence comprises gesture data of the endoscope in each time point, wherein the gesture data are acquired by a gesture sensor of the endoscope; Inputting the video frame sequence and the gesture sequence into a neural network model for processing to obtain a three-dimensional point cloud corresponding to the video frame sequence, wherein the neural network model is used for fusing video frame data of each time point according to gesture data of the endoscope at each time point; And generating a three-dimensional reconstruction model of the target area based on the three-dimensional point cloud corresponding to the video frame sequence.
2. The method of claim 1, wherein the neural network model comprises a first image coding network, a first gesture coding network and a multi-modal feature fusion module, wherein the first image coding network is used for carrying out feature extraction on the video frame data to obtain a video feature map; the multi-mode feature fusion module is used for carrying out alignment fusion on video feature graphs at different time points according to the relation between the gesture vectors at different time points.
3. The method of claim 1, wherein the neural network model includes a second image encoding network and a second pose encoding network; inputting the video frame sequence and the gesture sequence into a neural network model for processing to obtain a three-dimensional point cloud corresponding to the video frame sequence, wherein the three-dimensional point cloud comprises: inputting the video frame sequence into the second image coding network for feature extraction to obtain video feature images of all time points; Inputting the gesture sequence into the second gesture coding network to perform feature extraction, and obtaining gesture feature diagrams of all time points; Splicing the video feature images of all the time points and the gesture feature images of all the time points according to the time points to obtain fusion feature images of all the time points; and decoding based on the fusion feature map of each time point to obtain the three-dimensional point cloud corresponding to the video frame sequence.
4. The method of claim 1, wherein the neural network model includes a third image encoding network and a third pose encoding network; inputting the video frame sequence and the gesture sequence into a neural network model for processing to obtain a three-dimensional point cloud corresponding to the video frame sequence, wherein the three-dimensional point cloud comprises: Inputting the video frame sequence into the third image coding network for feature extraction to obtain video feature images of all time points; Inputting the gesture sequence into the third gesture coding network to perform feature extraction, and obtaining gesture feature diagrams of all time points; Acquiring video weights and gesture weights of all time points according to the video feature images of all time points and gesture feature images of all time points, wherein the video weights are used for indicating the information quantity and the credibility of images of all time points; according to the video weight and the gesture weight of each time point, weighting and fusing the video feature images of each time point and the gesture feature images of each time point to obtain fusion feature images of each time point; and decoding based on the fusion feature map of each time point to obtain the three-dimensional point cloud corresponding to the video frame sequence.
5. The method according to claim 4, wherein the obtaining the video weight and the gesture weight of each time point according to the video feature map of each time point and the gesture feature map of each time point includes: for each time point, acquiring the video weight according to at least one of image definition weight, key frame priority weight, focus perception weight and texture richness weight corresponding to the video feature map; For each time point, acquiring the gesture weight according to at least one of gesture-to-target direction distance weight, motion amplitude weight, view coverage weight and gesture stability weight.
6. The method according to claim 4, wherein the obtaining the video weight and the gesture weight of each time point according to the video feature map of each time point and the gesture feature map of each time point includes: And inputting the video feature map and the gesture feature map into a gating weight generating module for each time point to obtain video weight and gesture weight.
7. The method according to any one of claims 1 to 6, wherein inputting the video frame sequence and the gesture sequence into a neural network model for processing to obtain a three-dimensional point cloud corresponding to the video frame sequence, comprises: inputting the video frame sequence and the gesture sequence into a neural network model for processing to obtain a three-dimensional point cloud corresponding to the video frame sequence and predicted gestures of all time points; the method further comprises the steps of: and correcting the real-time pose of the endoscope according to the predicted pose of each time point and the pose data of each time point.
8. An endoscope-based three-dimensional reconstruction device, the device comprising: The system comprises a sequence acquisition module, a sequence acquisition module and a gesture detection module, wherein the sequence acquisition module is used for acquiring a video frame sequence and a gesture sequence, the video frame sequence comprises video frame data of each time point acquired by a visual camera of the endoscope in a target area, and the gesture sequence comprises gesture data of the endoscope in each time point, wherein the gesture data is acquired by a gesture sensor of the endoscope; The three-dimensional point cloud acquisition module is used for inputting the video frame sequence and the gesture sequence into a neural network model for processing to obtain three-dimensional point clouds corresponding to the video frame sequence, wherein the neural network model is used for fusing the video frame data of each time point according to the gesture data of the endoscope at each time point; and the three-dimensional reconstruction module is used for generating a three-dimensional reconstruction model of the target area based on the three-dimensional point cloud corresponding to the video frame sequence.
9. An electronic device, comprising: A memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the endoscope-based three-dimensional reconstruction method of any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the endoscope-based three-dimensional reconstruction method of any one of claims 1 to 7.

Description

Three-dimensional reconstruction method, device, equipment and medium based on endoscope Technical Field The present invention relates to the field of image processing technologies, and in particular, to a three-dimensional reconstruction method, apparatus, device, and medium based on an endoscope. Background Minimally invasive endoscopes (e.g., urologic soft scopes, bronchoscopes, cystoscopes, etc.) are widely used in minimally invasive procedures. However, because the endoscope has a narrow field of view and images only provide two-dimensional images, depth information is lacking, and spatial disorientation and distance perception errors are easily generated by operators in the operation process. In order to give navigation to an operator in the operation process of an endoscope, three-dimensional reconstruction is attempted based on an endoscope image in the prior art, for example, a binocular stereo endoscope is used for acquiring left and right images, three-dimensional coordinates of image pixels are calculated through camera calibration, stereo correction and stereo matching, so that the surface morphology of a target area is reconstructed, or the binocular image is used as input, a depth neural network is used for directly predicting a parallax image to calculate depth information, and rapid three-dimensional reconstruction is realized. However, in the operation process of the endoscope operation, when the endoscope moves rapidly, so that the viewing angle of the endoscope changes rapidly, the correlation between the images continuously collected by the endoscope is weakened, and at this time, the accuracy of three-dimensional reconstruction by the two-dimensional images collected by the endoscope is significantly reduced. Disclosure of Invention In view of the foregoing, it is desirable to provide a three-dimensional reconstruction method, apparatus, device, and medium based on an endoscope, which improves the accuracy of three-dimensional reconstruction of the endoscope through two-dimensional images acquired during rapid movement. The application provides a three-dimensional reconstruction method based on an endoscope, which comprises the following steps: The method comprises the steps of acquiring a video frame sequence and a gesture sequence, wherein the video frame sequence comprises video frame data of each time point acquired by a visual camera of an endoscope in a target area, and the gesture sequence comprises gesture data of the endoscope in each time point, wherein the gesture data are acquired by a gesture sensor of the endoscope; Inputting the video frame sequence and the gesture sequence into a neural network model for processing to obtain a three-dimensional point cloud corresponding to the video frame sequence, wherein the neural network model is used for fusing video frame data of each time point according to gesture data of the endoscope at each time point; And generating a three-dimensional reconstruction model of the target area based on the three-dimensional point cloud corresponding to the video frame sequence. In an optional implementation manner, the neural network model comprises a first image coding network, a first gesture coding network and a multi-modal feature fusion module, wherein the first image coding network is used for carrying out feature extraction on the video frame data to obtain a video feature map; the multi-mode feature fusion module is used for carrying out alignment fusion on video feature graphs at different time points according to the relation between the gesture vectors at different time points. In an alternative embodiment, the neural network model includes a second image encoding network and a second pose encoding network; inputting the video frame sequence and the gesture sequence into a neural network model for processing to obtain a three-dimensional point cloud corresponding to the video frame sequence, wherein the three-dimensional point cloud comprises: inputting the video frame sequence into the second image coding network for feature extraction to obtain video feature images of all time points; Inputting the gesture sequence into the second gesture coding network to perform feature extraction, and obtaining gesture feature diagrams of all time points; Splicing the video feature images of all the time points and the gesture feature images of all the time points according to the time points to obtain fusion feature images of all the time points; and decoding based on the fusion feature map of each time point to obtain the three-dimensional point cloud corresponding to the video frame sequence. In an alternative embodiment, the neural network model includes a third image encoding network and a third pose encoding network; inputting the video frame sequence and the gesture sequence into a neural network model for processing to obtain a three-dimensional point cloud corresponding to the video frame sequence, wherein the three-dimensional point c