CN-122024242-A - Method for intelligently generating report by identifying key text information in real time through multipath video streams

CN122024242ACN 122024242 ACN122024242 ACN 122024242ACN-122024242-A

Abstract

The invention provides a method for intelligently generating a report by real-time identification key text information of a multipath video stream, which realizes pixel level alignment of a text region between frames through precise alignment of the text region at the frame level, multidimensional confidence output, joint modeling of feature points and affine transformation, combines character confidence, semantic consistency score and spatial stability scoring matrix of an OCR result, dynamically weights and fuses, adjusts confidence parameters according to image definition, and finally optimizes identification confidence decisions through nonlinear mapping and dynamic bias compensation.

Inventors

Bin Junwei
YAN GUANGWEN
FENG TIANCAI

Assignees

慧点智科(广东)技术有限公司

Dates

Publication Date: 20260512
Application Date: 20251211

Claims (10)

1. The method for intelligently generating the report by identifying the key text information in real time by the multipath video stream is characterized by comprising the following steps of: S1, carrying out frame-level segmentation on a plurality of paths of video streams, obtaining a continuous frame sequence containing a target text region, and outputting a video frame set arranged in time sequence; s2, performing OCR (optical character recognition) on each frame of video, generating an initial character sequence and a corresponding original confidence score, and constructing a two-dimensional output matrix containing character content and confidence; S3, calculating a mapping relation between pixel level displacement and deformation based on affine transformation parameter estimation of text areas between adjacent frames, and establishing a space-time consistency reference for cross-frame pixel alignment; S4, carrying out semantic consistency scoring on the recognition results of the adjacent frames through the language model, calculating the similarity between the editing distance and the context probability distribution, and generating a semantic consistency scoring sequence; s5, counting identification consistency frequency and geometric center offset variance of the same character position in the sliding window, and constructing a space stability scoring matrix; S6, dynamically adjusting the weighted fusion proportion of the semantic consistency score and the spatial stability score according to the definition grade output by the image quality evaluation submodule to generate a time sequence consistency comprehensive score; And S7, fusing the original OCR confidence coefficient and the time sequence consistency comprehensive score through a preset nonlinear mapping function, and outputting the corrected comprehensive confidence coefficient for controlling a decision threshold of a subsequent information extraction module.
2. The method for intelligently generating reports by identifying key text information in real time by using multiple paths of video streams according to claim 1, wherein the step S1 specifically comprises: Acquiring a real-time video signal based on a multi-path video stream input interface, and performing frame synchronization and time stamp alignment processing on the real-time video signal to acquire a video frame sequence with time consistency; Performing text region detection on each frame by using a target detection model, extracting an interested region containing potential text content, and generating frame-level text region annotation information; Cutting the original video frames according to the frame-level text region labeling information, extracting local image regions related to texts in each video frame, and forming a preliminary text candidate frame set; performing pixel level alignment processing on the preliminary text candidate frame set based on optical flow estimation and affine transformation parameter calculation between adjacent frames to generate a continuous text region frame sequence with consistent space-time; And carrying out segmentation processing on the continuous text region frame sequences with consistent time and space by adopting a sliding window mechanism, and constructing a continuous and time-ordered video frame set.
3. The method for intelligently generating reports by real-time identification of key text information of multi-channel video streams according to claim 2, wherein the frame synchronization and time stamp alignment process adopts an IEEE 1588 protocol, the precision is nanosecond, the time alignment error is controlled within 1ms, the frame synchronization allowable inter-frame delay threshold is 5 ms, and the interpolation window width is 1-10 frames during frame loss compensation.
4. The method for intelligently generating reports by identifying key text information in real time by using multiple paths of video streams according to claim 1, wherein the step S2 specifically comprises: Executing an OCR recognition model based on a CNN-CTC architecture on each frame in a video frame set, extracting character sequence information in a text region, and obtaining an original character sequence set; generating a corresponding original confidence score for each character position based on the decoder output of the OCR recognition model to obtain an original confidence vector sequence; Performing time sequence alignment processing on the original character sequence set and the original confidence vector sequence, and constructing a two-dimensional output matrix of character content and confidence by taking a frame as a unit; Based on the character content fields in the two-dimensional output matrix, carrying out semantic consistency preliminary screening on the recognition result, and marking character sequence fragments suspected to be wrong in recognition or incoherent in semantic; And carrying out confidence level classification on the identification result of each frame according to the confidence value in the original confidence vector sequence, and outputting three types of low, medium and high confidence level labels.
5. The method for intelligently generating reports according to claim 4, wherein the OCR recognition model uses CNN-CTC architecture comprising three layers of 3 x 3 convolution and bi-directional LSTM, the input area is normalized to a fixed size, and the output character sequence length does not exceed the maximum character length N.
6. The method for intelligently generating reports by identifying key text information in real time by using multiple paths of video streams according to claim 1, wherein the step S3 specifically comprises: Based on the text area recognized by OCR in each frame, extracting key points and descriptors thereof by adopting a scale invariant feature transformation algorithm, and outputting a feature point coordinate set and a corresponding feature vector set; Based on the feature vector set extracted from the previous frame and the current frame, performing feature point matching by adopting a nearest neighbor ratio matching algorithm, and outputting a feature point pair set successfully matched; based on the characteristic point pair set, adopting a random sampling consistency algorithm to iteratively optimize a matching relation, and outputting an effective characteristic point pair set after abnormal points are removed and a preliminary affine transformation parameter matrix; based on the spatial distribution characteristics of the effective characteristic point pair sets, constructing a local deformation compensation model by adopting a thin plate spline interpolation method, and outputting a pixel-level mapping relation matrix; And based on the pixel-level mapping relation matrix, carrying out affine and local deformation joint transformation on the text recognition area of the current frame, and outputting a character position mapping chart aligned with the previous frame.
7. The method for intelligently generating reports by identifying key text information in real time by using a multi-channel video stream according to claim 6, wherein the parameter settings of the scale-invariant feature transform algorithm comprise a sampling step length of 2 pixels, a scale layer level of 4 and an initial Gaussian kernel standard deviation of 1.6.
8. The method for intelligently generating reports by identifying key text information in real time by using multiple paths of video streams according to claim 1, wherein the step S4 specifically comprises: Preprocessing a character sequence in an OCR recognition result of an adjacent frame, including space removal, punctuation mark normalization and unified case and case processing, so as to obtain a standardized character sequence; Calculating character-level differences between adjacent frame recognition results by utilizing an edit distance algorithm based on the standardized character sequence to obtain a character-level edit distance cost vector; performing context modeling on a character sequence in an OCR recognition result of an adjacent frame based on a lightweight language model, extracting local language features of the character sequence by utilizing a sliding window mechanism, and generating a context probability distribution vector; Performing cosine similarity calculation on the context probability distribution vector to obtain a direction consistency score of the character sequence in a semantic space; And carrying out weighted fusion on the editing distance cost vector and the direction consistency score to generate a semantic consistency comprehensive score.
9. The method for intelligently generating reports by real-time recognition of keyword information through multiple video streams according to claim 8, wherein the step S4 further comprises obtaining a context probability distribution based on a lightweight bi-directional cyclic neural network language model (embedded dimension is set to 128) with a window step size of 1 and combining with a Softmax temperature coefficient, wherein the edit distance normalization and cosine similarity evaluation are weighted by weights w 1 and w 2 , and the weights are dynamically adjusted according to the image sharpness.
10. The method for intelligently generating reports by identifying key text information in real time by using multiple paths of video streams according to claim 1, wherein the step S5 specifically comprises: Extracting character position coordinate information obtained by OCR recognition in continuous video frames, and constructing a character space position sequence based on pixel coordinate data of a character boundary frame; Based on the character space position sequence, calculating the identification consistency frequency of the same character position in the sliding window in continuous frames, namely the frame number duty ratio of the character stably identified as the same content in a set time window, and outputting a character identification consistency frequency vector; counting the geometric center coordinates of the boundary frame of the same character in the sliding window, calculating the offset variance of the same character in the directions of the X axis and the Y axis, and outputting a character geometric center offset variance matrix; based on the character recognition consistency frequency vector and the character geometric center offset variance matrix, carrying out fusion processing on the character recognition consistency frequency vector and the character geometric center offset variance matrix by adopting a normalization weighting method to generate a character space stability scoring vector; and (3) arranging and combining the space stability scoring vectors of all the characters according to the character sequence to construct a space stability scoring matrix.

Description

Method for intelligently generating report by identifying key text information in real time through multipath video streams Technical Field The invention relates to the technical field of video text recognition and confidence modeling, in particular to a method for intelligently generating reports by real-time recognition of key text information of multiple paths of video streams. Background The current Video character recognition (Video OCR) technology is widely applied to scenes such as security monitoring, news transfer, automatic generation of meeting summary and the like. In the prior art, the mainstream video OCR system generally performs text detection and recognition on a frame-by-frame basis, and then performs confidence evaluation on the recognition result to assist in subsequent information filtering and automatic report generation. Along with the application of the deep learning algorithm in OCR, the frame-level recognition accuracy based on the CNN-CTC, transformer model and the like is remarkably improved, and the method is gradually compatible with the real-time processing requirements of lightweight deployment and high concurrency scenes. The current development trend comprises multi-mode recognition result fusion, end-to-end recognition pipeline optimization and video stream intelligent analysis oriented to specific industries, but the deep research in the time sequence consistency modeling direction is relatively deficient; Aiming at the evaluation problem of text recognition confidence in video streams, the existing main stream practice generally focuses on single-frame image quality and direct confidence signals output by an OCR model, and a modeling mechanism is absent for dynamic changes of an inter-frame recognition result. For example, some systems improve the robustness of single frame models to blur, distortion by adjusting OCR model perceptual loss functions or introducing image enhancement strategies. However, in a practical complex scene, continuous frames of a video stream are interfered by various dynamic factors such as motion blur, compression noise, shooting shake and the like, so that the recognition result of adjacent frames has large fluctuation. The fluctuation not only affects the stable extraction of each frame of text, but also easily causes misjudgment of a final adoption strategy based on single frame-level confidence, and the accuracy and the robustness of the overall recognition form a remarkable challenge; Currently, a solution for uniformly modeling the dynamic time sequence consistency of OCR recognition results between video frames is not available. In some technical schemes, the overall confidence can be roughly improved by simply voting or counting frequently appearing characters, but the multi-element factors such as semantic consistency, spatial stability and image quality are not fully considered. The confidence coefficient processing is difficult to dynamically calibrate under the condition that the identification content between continuous frames rapidly fluctuates, and the actual scenes such as short-time strong disturbance or text appearance disappearance cannot be effectively applied, so that key information or abnormal events in the video stream are difficult to accurately and timely identify and adopt. Disclosure of Invention The invention aims to solve the technical problems and provides a method for intelligently generating reports by identifying key text information in real time by multiple paths of video streams. The technical scheme of the invention is realized in that the method for intelligently generating the report by identifying the key text information in real time by the multipath video stream comprises the following steps: S1, carrying out frame-level segmentation on a plurality of paths of video streams, obtaining a continuous frame sequence containing a target text region, and outputting a video frame set arranged in time sequence; s2, performing OCR (optical character recognition) on each frame of video, generating an initial character sequence and a corresponding original confidence score, and constructing a two-dimensional output matrix containing character content and confidence; S3, calculating a mapping relation between pixel level displacement and deformation based on affine transformation parameter estimation of text areas between adjacent frames, and establishing a space-time consistency reference for cross-frame pixel alignment; S4, carrying out semantic consistency scoring on the recognition results of the adjacent frames through a lightweight language model, calculating the similarity between the editing distance and the context probability distribution, and generating a semantic consistency scoring sequence; s5, counting identification consistency frequency and geometric center offset variance of the same character position in the sliding window, constructing a space stability scoring matrix, and quantifying visual stab