CN-122023952-A - Data annotation generation method and device, storage medium and electronic equipment

CN122023952ACN 122023952 ACN122023952 ACN 122023952ACN-122023952-A

Abstract

The invention provides a data annotation generation method, a device, a storage medium and electronic equipment, wherein the method comprises the steps of determining at least one candidate area image from image data to be processed, calling a description generation model to respectively generate area description texts of all candidate area images in the at least one candidate area image, calling a graph-text matching model to respectively determine graph-text similarity scores of all candidate area images, respectively determining gating index values of all candidate area images based on the graph-text similarity scores of all candidate area images, respectively determining a target area image set from the at least one candidate area image based on the gating index values of all candidate area images, generating target data annotation of all target area images in the target area image set, and adding the target data annotation of all target area images into the target data annotation set. The embodiment of the invention can improve the efficiency of generating the data labels under the condition of reducing the cost.

Inventors

TONG KAILIANG
WANG JIEMIN
PAN CHAO
HE QIAOLING
XIE SHUAI
Qi Zukang

Assignees

航天时代低空科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (9)

1. The data annotation generation method is characterized by comprising the following steps of: acquiring image data to be processed, and determining at least one candidate region image from the image data to be processed; Calling a description generation model to respectively generate region description texts of each candidate region image in the at least one candidate region image; Invoking an image-text matching model to respectively determine image-text similarity scores of the candidate region images, wherein the image-text similarity score of one candidate region image is used for indicating the similarity between the corresponding candidate region image and the region description text of the corresponding candidate region image, and the image-text similarity score of one candidate region image is determined based on the model output matching score of the corresponding candidate region image generated by the image-text matching model; Determining a gating index value of each candidate region image based on the graph-text similarity score of each candidate region image, and determining a target region image set from at least one candidate region image based on the gating index value of each candidate region image; Generating target data labels of all target area images in the target area image set, wherein the target data label of one area image comprises an area description text of the corresponding area image; Wherein the method further comprises: Aiming at any candidate region image in the at least one candidate region image, carrying out text analysis on the region description text of the any candidate region image to obtain a standard candidate annotation set of the any candidate region image; Invoking an open vocabulary verification model, and respectively outputting the confidence coefficient of each standard candidate label in the standard candidate label set of any candidate region image, wherein the confidence coefficient of one standard candidate label of any candidate region image refers to the confidence coefficient of the corresponding standard candidate label aiming at any candidate region image; the determining the gating index value of each candidate region image based on the graph-text similarity score of each candidate region image comprises the following steps: And determining a gating index value of the any candidate region image based on the graph-text similarity score of the any candidate region image and the confidence coefficient of each standard candidate label of the any candidate region image.
2. The method of claim 1, wherein the description generation model is a multi-modal large model, wherein the invoking the description generation model generates region description text for each of the at least one candidate region image, respectively, comprising: Acquiring image description information of the image data to be processed; And calling the description generation model to generate a region description text of each candidate region image based on the image description information and each candidate region image in the at least one candidate region image.
3. The method according to claim 1 or 2, wherein the invoking the graph matching model to determine the graph similarity score of each candidate region image, respectively, comprises: invoking an image-text matching model, and generating a model output matching score of any candidate region image based on the any candidate region image and a region description text of the any candidate region image; Performing optical character recognition on the any candidate region image to obtain an optical character recognition result of the any candidate region image, and determining an optical character recognition score of the any candidate region image based on a region description text and the optical character recognition result of the any candidate region image; And determining the image-text similarity score of any candidate region image based on the model output matching score and the optical character recognition score of the any candidate region image.
4. The method according to claim 1 or 2, wherein the determining a set of target region images from the at least one candidate region image based on gating index values of the respective candidate region images, respectively, comprises: judging whether the any candidate region image is gated or not based on the gating index value of the any candidate region image; If the any candidate region image passes the gating, adopting the confidence coefficient of each standard candidate label of the any candidate region image, carrying out miss filtering on a standard candidate label set of the any candidate region image to obtain a target entity label set of the any candidate region image, carrying out consistency cross verification on a region description text and the target entity label set of the any candidate region image, and taking the any candidate region image as one target region image in a target region image set when the any candidate region image passes the consistency cross verification; And if any candidate region image does not pass the gating, the any candidate region image is not used as one target region image in the target region image set, wherein the support of the image data to be rechecked of the candidate region image which does not pass the gating is added into the image data set to be rechecked.
5. The method according to claim 1 or 2, characterized in that the method further comprises: Acquiring a first training area image data set, wherein one first training area image data set comprises gating judgment indication information of one first training area image; Judging whether each first training area image is an area image with passed gate erroneous judgment or an area image with failed gate accurate judgment based on gate judgment indication information of each first training area image in the first training area image data set respectively so as to determine the number of the area images with passed gate erroneous judgment and the number of the area images with failed gate accurate judgment from the first training area image data set; And updating a gating threshold parameter based on the number of the regional images which are judged to pass by the gating error and the number of the regional images which are judged to not pass by the gating error, wherein the gating threshold parameter is used for judging whether the regional images pass by the gating.
6. The method according to claim 1 or 2, characterized in that the method further comprises: acquiring a second training area image data set, wherein the second training area image data comprises an image-text similarity score and a training score label of a second training area image; Determining an image-text matching loss value based on each second training area image data in the second training area image data set; Determining a model loss value based on the image-text matching loss value, and optimizing model parameters in a model to be trained according to the direction of reducing the model loss value, wherein the model to be trained comprises at least one of the description generation model and the image-text matching model.
7. A data annotation generation apparatus, the apparatus comprising: An acquisition unit configured to acquire image data to be processed; The processing unit is used for determining at least one candidate area image from the image data to be processed; The processing unit is further used for calling a description generation model to respectively generate region description texts of each candidate region image in the at least one candidate region image; the processing unit is also used for calling a picture-text matching model to respectively determine picture-text similarity scores of the candidate region images, wherein the picture-text similarity score of one candidate region image is used for indicating the similarity between the corresponding candidate region image and the region description text of the corresponding candidate region image, and the picture-text similarity score of one candidate region image is determined based on the model output matching score of the corresponding candidate region image generated by the picture-text matching model; The processing unit is further used for determining gating index values of the candidate region images based on the image-text similarity scores of the candidate region images respectively, and determining a target region image set from the at least one candidate region image based on the gating index values of the candidate region images respectively; The processing unit is also used for generating target data labels of all target area images in the target area image set, wherein the target data label of one area image comprises an area description text of the corresponding area image; The processing unit is further configured to: Aiming at any candidate region image in the at least one candidate region image, carrying out text analysis on the region description text of the any candidate region image to obtain a standard candidate annotation set of the any candidate region image; Invoking an open vocabulary verification model, and respectively outputting the confidence coefficient of each standard candidate label in the standard candidate label set of any candidate region image, wherein the confidence coefficient of one standard candidate label of any candidate region image refers to the confidence coefficient of the corresponding standard candidate label aiming at any candidate region image; The processing unit is specifically configured to, when determining the gating index value of each candidate region image based on the graph-text similarity score of each candidate region image, determine the gating index value of each candidate region image: And determining a gating index value of the any candidate region image based on the graph-text similarity score of the any candidate region image and the confidence coefficient of each standard candidate label of the any candidate region image.
8. An electronic device, comprising: Processor, and A memory in which a program is stored, Wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method according to any of claims 1-6.
9. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-6.

Description

Data annotation generation method and device, storage medium and electronic equipment Technical Field The present invention relates to the field of data annotation technologies, and in particular, to a data annotation generation method, a device, a storage medium, and an electronic apparatus. Background At present, generating high-quality and fine-grained data annotation for massive image data faces serious challenges, and particularly targets in low-altitude scenes (such as the fields of unmanned aerial vehicle inspection, security monitoring, agricultural plant protection and the like) have the characteristics of severe dimensional change, unique visual angle (such as nodding), complex background, a large number of professional subdivision concepts and the like, and for example, in electric power inspection, insulator strings, towers, hardware fittings and other parts and defect states (such as damage, rust and the like) of the insulator strings, the towers, hardware fittings and the like need to be marked. However, the related art generally constructs the data annotation by a manual annotation mode, which results in high cost and low efficiency. Based on this, how to generate the data labels conveniently, so as to improve the efficiency of generating the data labels under the condition of reducing the cost, no better solution exists at present. Disclosure of Invention In view of the above, embodiments of the present invention provide a method, an apparatus, a storage medium, and an electronic device for generating a data annotation, so as to solve the problems of high cost and low efficiency caused by providing a manual annotation mode in the related art, that is, the embodiments of the present invention may generate the target data annotation of each target area image conveniently by describing a generating model, a graphics matching model, etc., so that the efficiency of generating the data annotation may be effectively improved under the condition of reducing the cost, the accuracy of the target data annotation may be effectively improved by gating the index value, and the consistency of the data annotation may be effectively ensured by a unified automatic generating mode. According to an aspect of the present invention, there is provided a data annotation generation method, the method including: acquiring image data to be processed, and determining at least one candidate region image from the image data to be processed; Calling a description generation model to respectively generate region description texts of each candidate region image in the at least one candidate region image; Invoking an image-text matching model to respectively determine image-text similarity scores of the candidate region images, wherein the image-text similarity score of one candidate region image is used for indicating the similarity between the corresponding candidate region image and the region description text of the corresponding candidate region image, and the image-text similarity score of one candidate region image is determined based on the model output matching score of the corresponding candidate region image generated by the image-text matching model; Determining a gating index value of each candidate region image based on the graph-text similarity score of each candidate region image, and determining a target region image set from at least one candidate region image based on the gating index value of each candidate region image; Generating target data labels of all target area images in the target area image set, wherein the target data label of one area image comprises an area description text of the corresponding area image; Wherein the method further comprises: Aiming at any candidate region image in the at least one candidate region image, carrying out text analysis on the region description text of the any candidate region image to obtain a standard candidate annotation set of the any candidate region image; Invoking an open vocabulary verification model, and respectively outputting the confidence coefficient of each standard candidate label in the standard candidate label set of any candidate region image, wherein the confidence coefficient of one standard candidate label of any candidate region image refers to the confidence coefficient of the corresponding standard candidate label aiming at any candidate region image; the determining the gating index value of each candidate region image based on the graph-text similarity score of each candidate region image comprises the following steps: And determining a gating index value of the any candidate region image based on the graph-text similarity score of the any candidate region image and the confidence coefficient of each standard candidate label of the any candidate region image. According to another aspect of the present invention, there is provided a data annotation generation apparatus, the apparatus comprising: An acquisition unit configured to acquire