CN-122023497-A - Ultra-long video depth consistency estimation system based on time sequence association
Abstract
The invention discloses a time sequence correlation-based ultra-long video depth consistency estimation system which comprises an input preprocessing module, a multi-scale feature extraction module, a time sequence correlation modeling module, a dual-mode depth estimation head module and a dual-mode depth estimation head module, wherein the input preprocessing module is responsible for converting an original video into a frame sequence which can be processed by a model, the multi-scale feature extraction module is realized based on VisionTransformer of reconstruction and extracts multi-scale semantic features from preprocessed frames, the time sequence correlation modeling module is responsible for modeling time sequence correlation of cross-frame features and supporting ultra-long video and streaming processing through dynamic buffering, the dual-mode depth estimation head module performs relative depth estimation and measurement depth estimation, and the loss function of the dual-mode depth estimation head module is Ltotal= Ldepth +lambda-shaped Ltemporal (Ldepth is depth estimation loss and Ltemporal is time sequence consistency loss). The method can effectively realize the ultra-long video depth consistency estimation, support stream processing, give consideration to the relative and measurement depth estimation, improve the accuracy and time sequence stability, and is suitable for video depth analysis scenes.
Inventors
- JIANG KELEI
- ZHAO TIANCHENG
- ZHANG HUI
- Ying Yile
Assignees
- 杭州联汇科技股份有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251222
Claims (9)
- 1. The ultra-long video depth consistency estimation system based on time sequence association is characterized by comprising: the input preprocessing module is used for converting the original video into a frame sequence which can be processed by the model; The multi-scale feature extraction module is realized based on the reconstructed VisionTransformer, and extracts multi-scale semantic features from the preprocessed frames converted by the output preprocessing module; the time sequence association modeling module is used for modeling time sequence association of the cross-frame characteristics and supporting ultra-long video and streaming processing through dynamic buffering; a dual mode depth estimation head module for performing a relative depth estimation and a metric depth estimation; the loss function of the dual-mode depth estimation head module consists of a depth estimation loss and a time sequence consistency loss: ; Wherein, the In order to estimate the loss of depth, Is a loss of timing consistency.
- 2. The system for estimating depth consistency of ultra-long video based on time sequence correlation as set forth in claim 1, wherein said input preprocessing module converts an original video into a frame sequence processable by a model as follows: First, frame sampling is performed according to The sampling formula is as follows: , wherein, Is the original video frame number.
- 3. Resolution scaling is then performed, based on And (3) with Is to scale the frame to ensure that the width/height of the frame does not exceed the constraint of And has a scaled size of The scaling factor calculation is shown as follows: scaled size , Wherein W, H is the original frame width/height Finally, converting the format, namely converting RGB to gray scale And normalize the pixel values to [0,1].
- 4. In the steps: default 518, input patch grid size of the Vit encoder; Default 1280; Default to-1.
- 5. The ultra-long video depth consistency estimation system based on time sequence association according to claim 1 or 2, wherein the specific steps of extracting multi-scale semantic features by the multi-scale feature extraction module are as follows: patch segmentation is first performed by dividing the frame Partitioning into non-overlapping patch, patch sizes Obtaining the patch sequence Wherein: , ; Then using ViT encoder to apply linear projection Mapping to high-dimensional features, then The layer encoder extracts the multi-scale features, then the feature output is the first Layer encoder output features ,( ) , wherein, Is the first A layer channel number; finally, multi-scale feature fusion is carried out, namely features of different layers are fused After the 1 multiplied by 1 convolution unified channel number, up sampling is adopted to align to the same resolution, and fusion characteristics are obtained )。
- 6. The super long video depth consistency estimation system based on time sequence correlation as set forth in claim 1 or 2, wherein the time sequence correlation modeling module comprises: A time sequence attention sub-module, which captures the characteristic association of the current frame and the history frame through a time sequence window attention; a dynamic hidden state buffer sub-module for supporting ultra-long video and stream processing and buffering time sequence attention hidden state of each frame ; Wherein, the Is the first Hidden state, dimension and layer encoder The same applies.
- 7. The system for estimating depth consistency of ultra-long video based on time sequence correlation as set forth in claim 4, wherein the time sequence attention submodule captures characteristic correlation of current frames and historical frames in the following specific way: firstly, inquiring/key/value projection, namely fusing characteristics to the current frame And history of Frame feature And (3) performing projection: , , ; Wherein, the In order to project the matrix of the light, , ; Then time series attention calculation is performed: Wherein, the As the time series attention weighting is to be given, Is a time sequence attention output; Finally, feature fusion is carried out, namely the current frame feature and the time sequence attention output are fused, and the time sequence enhancement feature is obtained: Wherein the method comprises the steps of 、 To fuse parameters, ensure output dimension and And consistent.
- 8. The system for estimating depth consistency of ultra-long video based on time sequence association according to claim 4, wherein the specific mode of supporting ultra-long video and stream processing of the dynamic hidden state buffer sub-module is as follows: Firstly, initializing a cache, namely when stream processing is started, caching Is empty, front The hidden state of the frame is directly stored in the buffer memory; Then make a cache update for the first Frame [ ] ) Will after the calculation is completed Logging into At this time ; Finally, dynamically cleaning when the number of buffered frames exceeds the capacity (Default) Reference input shape ) When the latest is reserved The hidden state of the frame, delete the long-term state: 。
- 9. The ultra-long video depth consistency estimation system based on time sequence correlation as set forth in claim 1 or 2, wherein said dual-mode depth estimation head module performs relative depth estimation and metric depth estimation by enhancing time sequence The dimension is reduced through a convolution layer, and up-sampling is carried out to the original frame resolution, so that a relative depth map is obtained, on the basis of the relative depth, the relative depth is converted into the measurement depth in the physical sense through a measurement calibration submodule, and parameters are calibrated From Virtual KITTI pre-training with the IRS dataset.
Description
Ultra-long video depth consistency estimation system based on time sequence association Technical Field The invention relates to the field of video depth estimation, in particular to an ultra-long video depth consistency estimation system based on time sequence association. Background In the field of video depth estimation, the prior art has many pain points. The ultra-long video has poor suitability, is difficult to process long-time videos, and cannot fully exert the advantages of the video depth estimation technology in long video scenes. Insufficient timing consistency results in lack of consistency of depth information between video frames, affecting accuracy and reliability of depth estimation. The calculation efficiency is low, so that the time consumption is long when large-scale video data are processed, and the resource consumption is high. The streaming processing is missing, the video stream cannot be processed in real time, and the application of the video depth estimation technology in a real-time application scene is limited. The existing video depth estimation technology has the following pain points of poor adaptation of the ultra-long video, insufficient time sequence consistency, low calculation efficiency and streaming processing missing. Disclosure of Invention Aiming at the defects existing in the prior art, the invention aims to provide an ultra-long video depth consistency estimation system based on time sequence association, which realizes time sequence association of cross-frame characteristics and supports ultra-long video and stream processing through a time sequence association modeling module to improve the time sequence consistency, a multi-scale characteristic extraction module optimizes the characteristic extraction efficiency, a dual-mode depth estimation head module gives consideration to relative and measurement depth estimation, and solves the problems of poor adaptation of the ultra-long video, insufficient time sequence consistency, low calculation efficiency and stream processing deficiency. In order to achieve the purpose, the invention provides a super-long video depth consistency estimation system based on time sequence association, which is characterized by comprising the following steps: the input preprocessing module is used for converting the original video into a frame sequence which can be processed by the model; The multi-scale feature extraction module is realized based on the reconstructed VisionTransformer, and extracts multi-scale semantic features from the preprocessed frames converted by the output preprocessing module; the time sequence association modeling module is used for modeling time sequence association of the cross-frame characteristics and supporting ultra-long video and streaming processing through dynamic buffering; a dual mode depth estimation head module for performing a relative depth estimation and a metric depth estimation; the loss function of the dual-mode depth estimation head module consists of a depth estimation loss and a time sequence consistency loss: ; Wherein, the In order to estimate the loss of depth,Is a loss of timing consistency. As a further improvement of the present invention, the specific steps of the input preprocessing module for converting the original video into a frame sequence that can be processed by the model are as follows: First, frame sampling is performed according to The sampling formula is as follows: , wherein, Is the original video frame number. Resolution scaling is then performed, based onAnd (3) withIs to scale the frame to ensure that the width/height of the frame does not exceed the constraint ofAnd has a scaled size ofThe scaling factor calculation is shown as follows: scaled size ,Wherein W, H is the original frame width/height Finally, converting the format, namely converting RGB to gray scaleAnd normalize the pixel values to [0,1]. In the steps: default 518, input patch grid size of the Vit encoder; Default 1280; Default to-1. As a further improvement of the invention, the specific steps of the multi-scale feature extraction module for extracting multi-scale semantic features are as follows: patch segmentation is first performed by dividing the frame Partitioning into non-overlapping patch, patch sizesObtaining the patch sequenceWherein:,; Then using ViT encoder to apply linear projection Mapping to high-dimensional features, thenThe layer encoder extracts the multi-scale features, then the feature output is the firstLayer encoder output features,() , wherein,Is the firstA layer channel number; finally, multi-scale feature fusion is carried out, namely features of different layers are fused After the 1 multiplied by 1 convolution unified channel number, up sampling is adopted to align to the same resolution, and fusion characteristics are obtained)。 As a further improvement of the present invention, the timing correlation modeling module includes: A time sequence attention sub-module, whic