CN-122019813-A - Picture classification method, device and storage medium

CN122019813ACN 122019813 ACN122019813 ACN 122019813ACN-122019813-A

Abstract

The application discloses a picture classification method, picture classification equipment and a storage medium, and relates to the technical field of image processing. According to the method, the picture text is extracted through OCR and divided into sentences, the pre-training vector generation model is utilized to convert sentences into vectors, core semantic information of associated category attributes in the text is fully mined by means of semantic capturing capability of the pre-training model, word frequency and inverse document frequency are calculated after word division to determine sentence text weight, sentences corresponding to important semantics are enabled to obtain higher weight, the weight and a vector sequence are weighted and averaged to generate document vectors, the document vectors are enabled to focus on the core semantics accurately, effectiveness of text representation is improved, and finally accurate classification is achieved through the classification model.

Inventors

LIU ZHI
CAO YIMING

Assignees

深圳市石犀科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251219

Claims (10)

1. A picture classification method, characterized in that the picture classification method comprises: acquiring a text in an OCR engine picture, and dividing sentences of the text to obtain a sentence list; Taking the sentence list as input of a pre-trained vector generation model, and converting sentences in the sentence list into vectors through the vector generation model to obtain a vector sequence; word segmentation is carried out on each sentence in the sentence list, and text weight of each sentence is determined based on word frequency and inverse document frequency of each word in each sentence; and carrying out weighted average on the text weight and the vector sequence to obtain a document vector, and determining a category label corresponding to the document vector through a pre-trained classification model to serve as the category to which the picture belongs.
2. The picture classifying method according to claim 1, wherein the step of converting sentences in the sentence list into vectors by the vector generation model using the sentence list as an input to a pre-trained vector generation model, and obtaining a vector sequence comprises: taking the sentence list as the input of the vector generation model, and converting each sentence in the sentence list into a token and an attention mask through an input layer of the vector generation model; Taking the context feature of each token as a hidden state of the token through a transducer layer of the vector generation model, and generating a vector sequence by weighted average of the attention mask corresponding to each token and the hidden state; And obtaining the vector sequence output by the output layer of the vector generation model.
3. The picture classification method according to claim 1, wherein the step of weighted averaging the text weights and the vector sequences to obtain document vectors, and determining a class label corresponding to the document vectors by a pre-trained classification model, as the class to which the picture belongs, further comprises: Determining the character length of the sentence list, and determining a text density fraction according to the ratio of the character length to the total pixel area of the picture; extracting visual feature vectors corresponding to each pixel region from the picture through a pre-trained image processing model, and determining a predicted text score corresponding to each pixel region based on the visual feature vectors, wherein the predicted text score represents the possibility of text in the pixel region; Determining a difference value between the sum of the predicted text scores corresponding to the pixel areas and the text density score, and determining a confidence score of the sentence list according to the difference value; And when the confidence score is greater than or equal to a preset threshold value, executing the step of weighted averaging the text weight and the vector sequence to obtain a document vector, and determining a category label corresponding to the document vector through a pre-trained classification model as the category to which the picture belongs.
4. The picture classifying method according to claim 3, wherein said step of determining a difference between a sum of predicted text scores corresponding to each of said pixel regions and said text density score, and determining a confidence score for said sentence list based on said difference, further comprises: When the confidence score is smaller than the preset threshold value, determining the region type of the pixel region based on the visual feature vector corresponding to the pixel region; Determining the visual weight corresponding to the pixel region according to the region type; Taking the sum of the visual weight corresponding to the pixel region of each sentence in the sentence list and the text weight of each sentence as the classification weight of the sentence; and carrying out weighted average on the classification weight and the vector sequence to obtain a document vector, and determining a category label corresponding to the document vector through a pre-trained classification model to serve as the category to which the picture belongs.
5. The picture classification method according to claim 4, wherein the step of weighted averaging the text weights and the vector sequences to obtain document vectors, determining a class label corresponding to the document vectors by a pre-trained classification model, and then further comprises: acquiring pixel coordinate information of a text output by the OCR engine in the picture, and taking the pixel coordinate information as a spatial feature of the text; Creating a two-dimensional grid based on a preset heat map resolution; According to the spatial characteristics of the text, mapping the classification weight corresponding to each sentence in the sentence list to the two-dimensional grid to obtain a weight heat map; And upsampling the weight heat map so that the resolution of the weight heat map is consistent with that of the picture, and then, overlaying and displaying the weight heat map on the picture.
6. The picture classifying method according to claim 5, wherein said step of mapping said classification weight corresponding to each sentence in said sentence list to said two-dimensional grid according to a spatial feature of said text, and obtaining a weight heat map comprises: Mapping the attention mask corresponding to each sentence on the two-dimensional grid according to the spatial characteristics of the text so as to determine the magnitude of the thermodynamic value of the two-dimensional grid; And according to the classification weight corresponding to each sentence, adjusting the thermodynamic value of the corresponding region of the two-dimensional grid to obtain the weight heat map, wherein the thermodynamic value is in direct proportion to the classification weight.
7. The picture classifying method according to claim 1, wherein the step of converting sentences in the sentence list into vectors by the vector generation model using the sentence list as an input of a pre-trained vector generation model, and obtaining a vector sequence further comprises: matching target ambiguous words in each sentence based on a preset ambiguous entity dictionary; acquiring other entity words in a preset text window of the target ambiguous word, and determining the belonging field of the target ambiguous word based on semantic features of the other entity words; Acquiring a knowledge graph of the field to which the target ambiguous word belongs, and determining a knowledge vector of the target ambiguous word based on the knowledge graph; and splicing the vector corresponding to the target ambiguous word in the vector sequence with the knowledge vector to obtain a new vector sequence.
8. The picture classification method of claim 7, wherein the step of obtaining a knowledge-graph of the domain to which the target ambiguous word belongs and determining a knowledge vector of the target ambiguous word based on the knowledge-graph comprises: Acquiring a knowledge graph of the field to which the target ambiguous word belongs, and determining a target node corresponding to the target ambiguous word in the knowledge graph; traversing a neighborhood graph of the target node, and determining the knowledge vector of the target ambiguous word based on nodes and edges in the neighborhood graph.
9. A picture classification apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the picture classification method according to any one of claims 1 to 8.
10. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the picture classification method according to any one of claims 1 to 8.

Description

Picture classification method, device and storage medium Technical Field The present application relates to the field of image processing technologies, and in particular, to a method, apparatus, and storage medium for classifying pictures. Background In order to realize efficient management, retrieval and utilization of massive pictures, a picture classification technology becomes a research hotspot in the fields of computer vision and pattern recognition. The current picture classification technology mainly extracts visual characteristics such as color, texture, shape and the like of pictures through convolutional neural networks such as CNN, resNet, VGG and the like, and inputs the visual characteristics into a classification model to finish classification judgment. However, in many picture scenes, such as file pictures containing text descriptions, e-commerce pictures with commodity descriptions, government affair public pictures with labeling information, etc., text information carried by the pictures is often directly related to core category attributes, for example, the text of a "financial certificate" picture contains more information such as money amount and certificate number, and the text of a "commodity description" picture focuses on contents such as product parameters, use methods, etc. The classification scheme based on the visual features ignores category association semantics contained in the text information in the pictures, and when the visual features of the pictures in different categories are similar but the text information is obvious in difference, the classification accuracy is reduced. The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art. Disclosure of Invention The application mainly aims to provide a picture classification method, equipment and a storage medium, which aim to solve the technical problem of how to improve the picture classification accuracy. In order to achieve the above object, the present application provides a picture classification method, which includes: acquiring a text in an OCR engine picture, and dividing sentences of the text to obtain a sentence list; Taking the sentence list as input of a pre-trained vector generation model, and converting sentences in the sentence list into vectors through the vector generation model to obtain a vector sequence; word segmentation is carried out on each sentence in the sentence list, and text weight of each sentence is determined based on word frequency and inverse document frequency of each word in each sentence; and carrying out weighted average on the text weight and the vector sequence to obtain a document vector, and determining a category label corresponding to the document vector through a pre-trained classification model to serve as the category to which the picture belongs. In one embodiment, the step of using the sentence list as an input to a pre-trained vector generation model, and converting the sentences in the sentence list into vectors by the vector generation model, to obtain a vector sequence includes: taking the sentence list as the input of the vector generation model, and converting each sentence in the sentence list into a token and an attention mask through an input layer of the vector generation model; Taking the context feature of each token as a hidden state of the token through a transducer layer of the vector generation model, and generating a vector sequence by weighted average of the attention mask corresponding to each token and the hidden state; And obtaining the vector sequence output by the output layer of the vector generation model. In an embodiment, the step of weighted averaging the text weight and the vector sequence to obtain a document vector, determining a category label corresponding to the document vector through a pre-trained classification model, and before the step of serving as the category to which the picture belongs, further includes: Determining the character length of the sentence list, and determining a text density fraction according to the ratio of the character length to the total pixel area of the picture; extracting visual feature vectors corresponding to each pixel region from the picture through a pre-trained image processing model, and determining a predicted text score corresponding to each pixel region based on the visual feature vectors, wherein the predicted text score represents the possibility of text in the pixel region; Determining a difference value between the sum of the predicted text scores corresponding to the pixel areas and the text density score, and determining a confidence score of the sentence list according to the difference value; And when the confidence score is greater than or equal to a preset threshold value, executing the step of weighted averaging the text weight and the vector