CN-121999406-A - Image data mining processing system and method combining large visual model and small visual model

CN121999406ACN 121999406 ACN121999406 ACN 121999406ACN-121999406-A

Abstract

The invention relates to the technical field of artificial intelligent image recognition and data mining, discloses an image data mining processing system and method combining a large visual model and a small visual model, and aims to solve the problems that in the prior art, the recall rate and accuracy of image recognition are difficult to consider, the computing cost and the processing speed are unbalanced, the data management efficiency is low and the model generalization capability is insufficient. The system comprises a data input module, a frame extraction and duplication removal module, a picture library construction module, a visual large model judgment module, a visual small model judgment module, a label preprocessing module, a manual inspection module, a training data generation module, a model training module and a data security module. The invention ensures generalization capability through preliminary screening of an LPM (Detic) large model, matches with secondary verification and full omission of a YOLO small model, eliminates false alarms by combining with manual verification, simultaneously optimizes double models respectively through specific sample training, effectively solves pain points of single model omission or false judgment, and realizes accurate and comprehensive compromise of target identification.

Inventors

You Qianhe
Li Xuanbian
LI MI
LI KUANGYI

Assignees

厦门浩森威视科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251215

Claims (8)

1. An image data mining processing system combining a large visual model with a small visual model, comprising: The data input module is used for receiving original video data, wherein the original video data is a single-path or multi-path parallel video file; The frame extraction and de-duplication module is connected with the data input module and is used for carrying out frame extraction processing on the original video data to obtain continuous image frames, and meanwhile, a characteristic hash algorithm is adopted to carry out de-duplication on the continuous image frames so as to remove repeated images; The picture library construction module is connected with the frame extraction and duplication removal module and is used for storing the images subjected to frame extraction and duplication removal to form a structured image data set; The visual large model judging module adopts an LPM model, in particular to a Detic open vocabulary target detection model, is connected with the picture library constructing module and is used for carrying out target recognition on each image in the image data set and outputting a mark containing a target image and a mark not containing the target image; The visual small model judging module is connected with the visual large model judging module by adopting a YOLO real-time target detection model and is used for receiving the image which is output by the visual large model judging module and does not contain the target image mark, carrying out secondary target recognition and outputting a secondary target image mark and an invalid data mark; The label preprocessing module is respectively connected with the large visual model judging module and the small visual model judging module and is used for carrying out target area positioning and label primary marking on the images containing the target image marks and the images containing the target image marks for the second time; the manual checking module is connected with the label preprocessing module and is used for carrying out target authenticity checking on the image preprocessed by the label and outputting a passing mark, an LPM missing mark and a YOLO false alarm mark; The training data generation module is connected with the manual checking module and is used for generating corresponding training data according to different marks, wherein an image which contains a target image mark and passes verification is taken as an LPM positive sample, an image which does not contain the target image mark, contains the target image mark for the second time and passes verification is taken as an LPM missing sample, an image which does not contain the target image mark, contains the target image mark for the second time and passes verification is taken as a YOLO false alarm sample, and the LPM missing sample, the YOLO false alarm sample and other recognition error cases are integrated into the badcase training data; The model training module is respectively connected with the training data generating module, the visual large model judging module and the visual small model judging module and is used for inputting an LPM positive sample and an LPM missed detection sample into the LPM model for training so as to improve recall rate, inputting a YOLO false alarm sample into the YOLO model for training so as to improve accuracy, and using badcase training data for generalization capability optimization of the double models.
2. The system of claim 1, wherein the frame extraction frequency of the frame extraction and de-duplication module is adjustable, the adjustment range is 1-10 frames/second, the frame extraction mode adopts key frames to extract frames preferentially, frame images with the picture change rate exceeding a preset threshold value in the video are extracted preferentially, the characteristic hash algorithm adopts a mean hash algorithm, a hash value is generated by comparing an image gray level mean value with pixels, and when the Hamming distance of the hash values of two images is smaller than 3, the repeated images are determined.
3. The system of claim 1, wherein the preliminary labeling of the tag preprocessing module comprises a target class, target coordinates, and a confidence level, wherein the target class is determined based on an open vocabulary library of Detic models, the confidence level threshold is set to 0.5, and the model output confidence level is marked as to be manually confirmed when the confidence level is below the threshold.
4. The system of claim 1, further comprising a data security module, coupled to the picture library construction module and the training data generation module, for encrypting and storing the image data and the training data, using an AES encryption algorithm, and setting an access authority control mechanism, only authorizing a user to perform a data operation.
5. The image data mining processing method combining the large visual model and the small visual model is characterized by comprising the following steps of: step 1, original data input, namely receiving single-path or multi-path original video data through a data input module, and establishing a data transmission channel; Step 2, frame extraction and de-duplication, namely performing frame extraction processing on the original video data, setting the frame extraction frequency to be 1-10 frames/second, preferentially extracting key frames with the picture change rate of more than or equal to 30%, de-duplication the extracted image frames by adopting a mean value hash algorithm, and removing repeated images with the Hamming distance of less than 3; step 3, a picture library construction step, namely carrying out structural storage on the image subjected to de-duplication according to video sources and frame extraction time stamps to construct a traceable image data set; The LPM model judging step is that the Detic open vocabulary target detection model is adopted to identify the images in the image data set one by one, the confidence coefficient threshold value is set to be 0.5, and the target-containing images with the confidence coefficient more than or equal to 0.5 and the target-free images with the confidence coefficient less than 0.5 are output; step 5. A second judging step of the YOLO model, namely carrying out second identification on the target-free image output by the LPM model, setting a confidence coefficient threshold value of 0.5 as well, and outputting the second target-containing image with the confidence coefficient of more than or equal to 0.5 and invalid data with the confidence coefficient of less than 0.5; Step 6, label preprocessing, namely labeling target frames of the target-containing image and the secondary target-containing image, recording target types, coordinate information and model output confidence level, and forming primary labeling data; Step 7, a manual verification step, namely manually checking the preliminary labeling data to confirm the authenticity of the target, wherein the mark ① contains an LPM positive sample of the true target in the target image, ② does not contain an LPM missed sample of the target image identified by YOLO and the true target, ③ does not contain a YOLO false positive sample of the target image identified by YOLO and the false target, and ④ indicates other error cases in the identification process; Step 8, training data integration, namely classifying an LPM positive sample, an LPM missed detection sample, a YOLO false alarm sample and an error case respectively to form a special training data set and a badcase comprehensive data set; Step 9, model optimization training, namely inputting an LPM positive sample and an LPM missed sample into an LPM model, optimizing recall rate by adjusting model loss function weight, inputting a YOLO false alarm sample into the YOLO model, optimizing target classifier parameters by a fine-tune mode to reduce false alarm rate, using a badcase comprehensive data set for cross verification and generalization capacity improvement of a double model, and step 10, iterating and updating steps, namely repeating steps 1-9, continuously generating training samples by utilizing newly input video data, and realizing iterative optimization of model performance.
6. The method according to claim 5, wherein in the frame extraction and de-duplication step, when multiple paths of parallel video are processed, a distributed frame extraction strategy is adopted, each path of video is allocated with an independent processing thread, and unified de-duplication merging is performed after frame extraction is completed, so as to ensure redundant elimination of cross video data.
7. The method according to claim 5, wherein in the model optimization training step, the loss function of the LPM model adopts a mode of combining cross entropy loss and focalloss, wherein the loss weight of the LPM missing sample is 2 times that of the common sample, the YOLO model is optimized by adopting a random gradient descent algorithm, the learning rate is initially set to 0.001, and 10 rounds of attenuation are carried out each iteration to 1/10 of the original value.
8. The method of claim 5, further comprising the step of data feedback, wherein the recognition result after model optimization training is compared with the original recognition result, and recall rate, accuracy and processing speed improvement data are calculated to form a feedback report for guiding adjustment of parameters such as the extraction frequency, the confidence threshold and the like.

Description

Image data mining processing system and method combining large visual model and small visual model Technical Field The invention relates to the technical field of artificial intelligent image recognition and data mining, in particular to an image data mining processing system and method combining a large visual model and a small visual model. Background With the rapid development of artificial intelligence technology, image data mining processing plays an increasingly important role in various fields such as video monitoring, intelligent security, industrial detection and the like. The core requirement of the image data mining processing technology is to accurately and efficiently identify target objects from massive image and video data, and provide reliable data support for subsequent analysis decisions. Currently, various image data mining related technical schemes exist in the prior art. For example, patent CN113470011a discloses a data mining system based on big data and a mining method thereof, the system realizes data processing through processes of data acquisition, feature grabbing, distributed storage, data cleaning, mining and the like, but the scheme is not enough in pertinence of image recognition, and is not optimized for the specificity of image data, so that the recognition precision and efficiency of a target object are difficult to meet the requirement of real-time detection. In the comprehensive view, the existing image data mining processing technology has the defects that firstly, recall rate and accuracy are difficult to achieve, a single model is missed due to insufficient generalization capability, or misjudgment occurs due to model characteristics, if a traditional small model is invalid for recognition of rare samples, redundant calculation is easy to occur in the single large model, secondly, calculation cost and processing speed are unbalanced, if a high-performance model is adopted to process full data, a large amount of calculation resources are occupied, cost is increased, processing accuracy is not guaranteed if a light-weight model is adopted, thirdly, data management efficiency is low, video data contains a large number of repeated invalid frames, the existing scheme lacks effective pretreatment machinery, storage and calculation resources are wasted, fourthly, model optimization direction is single, multiple dependence on sample screening is achieved, complementary training is performed by underutilization of advantages of different models, and model generalization capability is limited. Disclosure of Invention The invention aims to solve the defects in the prior art, and provides an image data mining processing system and method combining a large visual model and a small visual model. In order to achieve the above purpose, the present invention adopts the following technical scheme: An image data mining processing system combining a large visual model with a small visual model, comprising: The data input module is used for receiving original video data, wherein the original video data is a single-path or multi-path parallel video file; The frame extraction and de-duplication module is connected with the data input module and is used for carrying out frame extraction processing on the original video data to obtain continuous image frames, and meanwhile, a characteristic hash algorithm is adopted to carry out de-duplication on the continuous image frames so as to remove repeated images; The picture library construction module is connected with the frame extraction and duplication removal module and is used for storing the images subjected to frame extraction and duplication removal to form a structured image data set; The visual large model judging module adopts an LPM model, in particular to a Detic open vocabulary target detection model, is connected with the picture library constructing module and is used for carrying out target recognition on each image in the image data set and outputting a mark containing a target image and a mark not containing the target image; The visual small model judging module is connected with the visual large model judging module by adopting a YOLO real-time target detection model and is used for receiving the image which is output by the visual large model judging module and does not contain the target image mark, carrying out secondary target recognition and outputting a secondary target image mark and an invalid data mark; The label preprocessing module is respectively connected with the large visual model judging module and the small visual model judging module and is used for carrying out target area positioning and label primary marking on the images containing the target image marks and the images containing the target image marks for the second time; the manual checking module is connected with the label preprocessing module and is used for carrying out target authenticity checking on the image preprocessed by the label and outputting