CN-114065829-B - Image recognition method and device, and model training method and device

CN114065829BCN 114065829 BCN114065829 BCN 114065829BCN-114065829-B

Abstract

The image recognition method comprises the steps of extracting a plurality of pictures to be recognized from a set to be recognized, constructing a corresponding picture network for each picture, inputting a feature extraction model into the picture network corresponding to the plurality of pictures to be recognized to obtain feature vectors of the pictures, wherein the feature extraction model is obtained through training a metric learning loss function, clustering the feature vectors of the plurality of pictures, and recognizing the picture types. According to the image recognition method provided by the specification, the image network of the image to be recognized is input into the feature extraction model obtained through training of the metric learning loss function, so that the feature vector of the image to be recognized is extracted, the clustering is more accurate, and the image type is recognized more quickly and accurately.

Inventors

LUO WEIMENG
WANG YONGPAN
ZHENG QI

Assignees

阿里巴巴集团控股有限公司

Dates

Publication Date: 20260512
Application Date: 20200810

Claims (20)

1. An image recognition method applied to a picture with text content comprises the following steps: extracting a plurality of pictures to be identified from a set to be identified, and constructing a corresponding graph network for each picture, wherein nodes of the graph network are word embedding of text blocks of the picture, and edges of the graph network are relative position relation features of every two text blocks in the picture; Inputting a graph network corresponding to a plurality of pictures to be identified into a feature extraction model to obtain feature vectors of the pictures, wherein the feature extraction model combines semantics and structural features to train the result of calculation of a first distance between feature vectors of an initial sample picture and a negative sample picture and a second distance between feature vectors of the initial sample picture and a positive sample picture in a triplet training sample through a measurement learning loss function, the positive sample picture and the initial sample picture are of the same picture type, and the negative sample picture and the initial sample picture are of different picture types; and clustering the feature vectors of the plurality of pictures to identify the picture types.
2. The image recognition method of claim 1, further comprising, prior to extracting the plurality of pictures to be recognized from the set to be recognized: And receiving a picture to be identified and adding the picture to be identified into the set to be identified.
3. The image recognition method according to claim 1 or 2, wherein clustering feature vectors of the plurality of pictures, recognizing a picture type, comprises: And clustering the feature vectors of the pictures by using a density-based clustering algorithm, and determining the pictures corresponding to the vectors combined into the cluster as the same picture type.
4. The image recognition method according to claim 1 or 2, wherein the feature extraction model is trained by: constructing a plurality of triplet training samples consisting of an initial sample picture, a positive sample picture, an initial sample picture network corresponding to the negative sample picture, a positive sample picture network and a negative sample picture network; Inputting the multiple triplet training samples into the feature extraction model to obtain feature vectors of an initial sample picture, a positive sample picture and a negative sample picture in each triplet training sample; Calculating a first distance between feature vectors of an initial sample picture and a negative sample picture in each triplet training sample and a second distance between feature vectors of the initial sample picture and a positive sample picture, inputting the first distance and the second distance into a measurement learning loss function, and training the feature extraction model according to output of the loss function until the loss function tends to be stable.
5. The image recognition method according to claim 1 or 2, wherein said constructing a corresponding graph network for each picture comprises: Performing optical character recognition on each picture to obtain a character block; And embedding the words of the text blocks into nodes of the graph network, and setting the relative position relationship of every two text blocks as the edges of the graph network.
6. The image recognition method according to claim 5, wherein inputting a graph network corresponding to a plurality of pictures to be recognized into a feature extraction model, obtaining feature vectors of the pictures comprises: Performing graph convolution coding on nodes and edges of the graph network to obtain a first dimension vector of the nodes and edges; and carrying out averaging pooling on the first dimension vector of the graph network to obtain the feature vector of the picture.
7. The image recognition method according to claim 1 or 2, further comprising: and receiving and carrying out type merging and/or screening on the pictures with the identified types, and carrying out type labeling on the pictures according to the merging and/or screening results.
8. The image recognition method according to claim 1 or 2, further comprising: and storing the pictures with the unidentified picture types into the set to be identified.
9. An image recognition method applied to a picture with text content comprises the following steps: Displaying a picture input interface for a user based on a call request of the user; Receiving a plurality of pictures to be identified, which are input by the user based on the picture input interface, and constructing a corresponding picture network for each picture, wherein nodes of the picture network are word embedding of text blocks of the picture, and edges of the picture network are relative position relation features of every two text blocks in the picture; Inputting a graph network corresponding to a plurality of pictures to be identified into a feature extraction model to obtain feature vectors of the pictures, wherein the feature extraction model combines semantics and structural features to train the result of calculation of a first distance between feature vectors of an initial sample picture and a negative sample picture and a second distance between feature vectors of the initial sample picture and a positive sample picture in a triplet training sample through a measurement learning loss function, the positive sample picture and the initial sample picture are of the same picture type, and the negative sample picture and the initial sample picture are of different picture types; And clustering the feature vectors of the pictures, identifying the picture types and returning to the user.
10. An image recognition method applied to a picture with text content comprises the following steps: receiving a call request sent by a user, wherein the call request carries a plurality of pictures to be identified; Constructing a corresponding graph network for each picture, wherein nodes of the graph network are word embedding of text blocks of the picture, and edges of the graph network are relative position relation features of every two text blocks in the picture; Inputting a graph network corresponding to a plurality of pictures to be identified into a feature extraction model to obtain feature vectors of the pictures, wherein the feature extraction model combines semantics and structural features to train the result of calculation of a first distance between feature vectors of an initial sample picture and a negative sample picture and a second distance between feature vectors of the initial sample picture and a positive sample picture in a triplet training sample through a measurement learning loss function, the positive sample picture and the initial sample picture are of the same picture type, and the negative sample picture and the initial sample picture are of different picture types; And clustering the feature vectors of the pictures, identifying the picture types and returning to the user.
11. A model training method, comprising: constructing a plurality of triplex training samples consisting of an initial sample picture, a positive sample picture, an initial sample picture network corresponding to a negative sample picture, a positive sample picture network and a negative sample picture network, wherein the process of constructing the picture network comprises the steps of carrying out optical character recognition on each picture to obtain a text block, embedding words of the text block into nodes which are set as the picture network, and setting the relative position relationship of every two text blocks as edges of the picture network so as to obtain the initial sample picture network, the positive sample picture network and the negative sample picture network; Inputting the multiple triplet training samples into a feature extraction model, and combining semantic and structural features to obtain feature vectors of an initial sample picture, a positive sample picture and a negative sample picture in each triplet training sample, wherein the positive sample picture and the initial sample picture are of the same picture type, and the negative sample picture and the initial sample picture are of different picture types; Calculating a first distance between feature vectors of an initial sample picture and a negative sample picture in each triplet training sample and a second distance between feature vectors of the initial sample picture and a positive sample picture, inputting the first distance and the second distance into a measurement learning loss function, and training the feature extraction model according to output of the loss function until the loss function tends to be stable.
12. An image recognition device applied to a picture with text content, comprising: The first construction module is configured to extract a plurality of pictures to be identified from a set to be identified, and construct a corresponding graph network for each picture, wherein nodes of the graph network are word embedding of text blocks of the picture, and edges of the graph network are relative position relation features of every two text blocks in the picture; The first obtaining module is configured to input a graph network corresponding to a plurality of pictures to be identified into a feature extraction model to obtain feature vectors of the pictures, wherein the feature extraction model is obtained by training a first distance between feature vectors of an initial sample picture and a negative sample picture in a triplet training sample and a second distance between feature vectors of the initial sample picture and a positive sample picture through measuring a learning loss function by combining semantics and structural features, the positive sample picture and the initial sample picture are of the same picture type, and the negative sample picture and the initial sample picture are of different picture types; the first clustering module is configured to cluster the feature vectors of the plurality of pictures and identify the picture types.
13. The image recognition device of claim 12, further comprising: And the adding module is configured to receive the picture to be identified and add the picture to the set to be identified.
14. The image recognition device of claim 12 or 13, wherein the first clustering module is further configured to: And clustering the feature vectors of the pictures by using a density-based clustering algorithm, and determining the pictures corresponding to the vectors combined into the cluster as the same picture type.
15. The image recognition device according to claim 12 or 13, wherein the feature extraction model is trained by: constructing a plurality of triplet training samples consisting of an initial sample picture, a positive sample picture, an initial sample picture network corresponding to the negative sample picture, a positive sample picture network and a negative sample picture network; inputting the multiple triplet training samples into the feature extraction model, and combining semantic and structural features to obtain feature vectors of an initial sample picture, a positive sample picture and a negative sample picture in each triplet training sample; Calculating a first distance between feature vectors of an initial sample picture and a negative sample picture in each triplet training sample and a second distance between feature vectors of the initial sample picture and a positive sample picture, inputting the first distance and the second distance into a measurement learning loss function, and training the feature extraction model according to output of the loss function until the loss function tends to be stable.
16. The image recognition device of claim 12 or 13, wherein the first build module is further configured to: Performing optical character recognition on each picture to obtain a character block; And embedding the words of the text blocks into nodes of the graph network, and setting the relative position relationship of every two text blocks as the edges of the graph network.
17. The image recognition device of claim 16, wherein the first obtaining module is further configured to: Performing graph convolution coding on nodes and edges of the graph network to obtain a first dimension vector of the nodes and edges; and carrying out averaging pooling on the first dimension vector of the graph network to obtain the feature vector of the picture.
18. The image recognition apparatus according to claim 12 or 13, further comprising: And the type labeling module is configured to receive type combination and/or screening of the pictures with the identified types, and carry out type labeling on the picture types according to the combination and/or screening result.
19. The image recognition apparatus according to claim 12 or 13, further comprising: and the storage module is configured to store the pictures with the unidentified picture types into the set to be identified.
20. An image recognition device applied to a picture with text content, comprising: the display module is configured to display a picture input interface for a user based on a call request of the user; the second construction module is configured to receive a plurality of pictures to be identified, which are input by the user based on the picture input interface, and construct a corresponding picture network for each picture, wherein nodes of the picture network are word embedding of text blocks of the picture, and edges of the picture network are relative position relation features of every two text blocks in the picture; The second obtaining module is configured to input a feature extraction model into a graph network corresponding to a plurality of pictures to be identified to obtain feature vectors of the pictures, wherein the feature extraction model is obtained by training a first distance between feature vectors of an initial sample picture and a negative sample picture in a triplet training sample and a second distance between feature vectors of the initial sample picture and a positive sample picture through measuring a learning loss function by combining semantics and structural features, the positive sample picture and the initial sample picture are of the same picture type, and the negative sample picture and the initial sample picture are of different picture types; and the second clustering module is configured to cluster the feature vectors of the pictures, identify the picture types and return the picture types to the user.

Description

Image recognition method and device, and model training method and device Technical Field The present disclosure relates to the field of computer technologies, and in particular, to an image recognition method and apparatus, and a model training method and apparatus. Background With the development of optical character recognition technology, the technology is gradually applied to recognition of picture types, such as card classification. In reality, the picture type is endless, and although a large amount of data can be returned by each application scene for optical character recognition, the data is idle at present due to the lack of processing capability on unknown data. Furthermore, existing discovery schemes for new picture types rely primarily on manual or rules. It is necessary to find some new picture types meeting the definition from a large number of picture libraries by means of keyword searching and the like. The scheme is time-consuming and labor-consuming, has poor robustness, and needs to face the problems that the number of noise pictures is large and the effective pictures occupy only a small part. There is a further need for faster and more accurate identification of the operation or processing of picture types. Disclosure of Invention In view of this, the embodiments of the present disclosure provide an image recognition method and apparatus. The present disclosure also relates to a model training method and apparatus, a computing device, and a computer readable storage medium, to solve the technical drawbacks of the prior art. According to a first aspect of embodiments of the present specification, there is provided an image recognition method, including: extracting a plurality of pictures to be identified from a set to be identified, and constructing a corresponding picture network for each picture; inputting a graph network corresponding to a plurality of pictures to be identified into a feature extraction model to obtain feature vectors of the pictures, wherein the feature extraction model is obtained through training a metric learning loss function; and clustering the feature vectors of the plurality of pictures to identify the picture types. Optionally, before extracting the plurality of pictures to be identified from the set to be identified, the method further includes: And receiving a picture to be identified and adding the picture to be identified into the set to be identified. Optionally, clustering feature vectors of the plurality of pictures, identifying a picture type, including: And clustering the feature vectors of the pictures by using a density-based clustering algorithm, and determining the pictures corresponding to the vectors combined into the cluster as the same picture type. Optionally, the feature extraction model is obtained through training of the following steps: constructing a plurality of triplet training samples consisting of an initial sample picture, a positive sample picture, an initial sample picture network corresponding to the negative sample picture, a positive sample picture network and a negative sample picture network; Inputting the multiple triplet training samples into the feature extraction model to obtain feature vectors of an initial sample picture, a positive sample picture and a negative sample picture in each triplet training sample; Calculating a first distance between feature vectors of an initial sample picture and a negative sample picture in each triplet training sample and a second distance between feature vectors of the initial sample picture and a positive sample picture, inputting the first distance and the second distance into a measurement learning loss function, and training the feature extraction model according to output of the loss function until the loss function tends to be stable. Optionally, the building a corresponding graph network for each picture includes: Performing optical character recognition on each picture to obtain a character block; And embedding the words of the text blocks into nodes of the graph network, and setting the relative position relationship of every two text blocks as the edges of the graph network. Optionally, inputting the image network corresponding to the plurality of images to be identified into a feature extraction model, and obtaining the feature vector of the images includes: Performing graph convolution coding on nodes and edges of the graph network to obtain a first dimension vector of the nodes and edges; and carrying out averaging pooling on the first dimension vector of the graph network to obtain the feature vector of the picture. Optionally, the image recognition method further comprises: and receiving and carrying out type merging and/or screening on the pictures with the identified types, and carrying out type labeling on the pictures according to the merging and/or screening results. Optionally, the image recognition method further comprises: And storing the pictures with the unre