CN-117173483-B - Object identification method, device, equipment and storage medium

CN117173483BCN 117173483 BCN117173483 BCN 117173483BCN-117173483-B

Abstract

The application discloses an object identification method, device, equipment and storage medium, wherein the method comprises the steps of obtaining a plurality of candidate categories of text modes corresponding to pictures to be identified, wherein the plurality of candidate categories comprise real categories of objects in the pictures to be identified; and calculating the similarity between the visual characteristics of the picture to be identified and the text characteristics of each candidate category, and taking the candidate category corresponding to the highest similarity as the target category of the object in the picture to be identified. The application can extract the characteristics more accurately by means of the strong general knowledge representation capability of the multi-mode large model, further carries out object recognition based on the extracted characteristics, improves the recognition accuracy, and avoids the problem that the recognition accuracy of the traditional object recognition model is not high when the training data is insufficient and the problem that the recognition accuracy of the picture shot at a special visual angle is not high.

Inventors

WU JIAJIA
ZHANG YUAN
LAI JIAJUN
YIN BING
HU JINSHUI

Assignees

科大讯飞股份有限公司

Dates

Publication Date: 20260505
Application Date: 20230915

Claims (10)

1. An object recognition method, comprising: Acquiring a plurality of candidate categories of a text mode corresponding to a picture to be identified, wherein the candidate categories comprise real categories of objects in the picture to be identified; Respectively extracting text features of each candidate category by using the configured multi-modal large model, and extracting visual features of the picture to be identified by using the multi-modal large model; Calculating a first similarity between the visual characteristics of each frame of picture to be identified and the text characteristics of each candidate category corresponding to each frame of picture to be identified; Sending the visual characteristics of each frame of picture to be identified into a pre-configured sequence model to obtain hidden state characteristics of each frame of picture to be identified extracted by the sequence model, wherein the sequence model is configured to extract the hidden state characteristics of each frame of picture to be identified in an input picture sequence to be identified; Calculating a second similarity between the hidden layer state characteristics of each frame of picture to be identified and the text characteristics of each candidate category corresponding to each frame of picture to be identified; and determining a third similarity according to the first similarity and the second similarity of each frame of pictures to be identified and each candidate category, and taking the candidate category corresponding to the highest third similarity as the target category of the object contained in each frame of pictures to be identified.
2. The method according to claim 1, wherein extracting text features of each of the candidate categories using the configured multi-modal large model, respectively, and extracting visual features of the picture to be identified using the multi-modal large model, comprises: using the text encoder of the multi-mode large model to carry out text encoding on each candidate category to obtain the text characteristic of each candidate category; and performing visual coding on the picture to be identified by using the visual coder of the multi-mode large model to obtain visual characteristics.
3. The method of claim 1, wherein the step of determining the position of the substrate comprises, Before calculating the similarity between the visual characteristics of each frame of picture to be identified and the text characteristics of each candidate category corresponding to each frame of picture to be identified, the method further comprises the following steps: sending the visual features of each frame of picture to be identified into a pre-configured sequence model to obtain hidden layer features of each frame of picture to be identified extracted by the sequence model, and taking the hidden layer features as the latest visual features of each frame of picture to be identified, wherein the sequence model is configured to extract the hidden layer features of each frame of picture to be identified in an input picture sequence to be identified.
4. The method of claim 1, wherein determining a third similarity based on the first similarity and the second similarity for each candidate category for each frame of pictures to be identified comprises: And averaging the first similarity and the second similarity of the picture to be identified of each frame and each candidate category, and taking the average similarity as the third similarity.
5. A method according to claim 1 or 3, characterized in that the working area of the robot comprises at least one room; the process of sending the visual characteristics of each frame of picture to be identified into a preconfigured sequence model comprises the following steps: And if the current frame to-be-identified picture is the first picture shot by the robot after entering a room in the working area, resetting the hidden layer characteristic of the sequence model to 0, and then sending the visual characteristic of the current frame to-be-identified picture into the sequence model.
6. The method of claim 1, wherein the pictures to be identified are a number of pictures of frames in a video stream taken by the robot for the work area, the method further comprising: acquiring a house type map of the working area corresponding to the robot; determining the position of an object in each shot picture to be identified in the house type map; and displaying object mapping of the target category at a position corresponding to the object in the house type map according to the target category of the object in each frame of the picture to be identified.
7. The method according to any one of claims 1-4 and 6, wherein the obtaining a plurality of candidate categories of a text modality corresponding to the picture to be identified includes: And sending the picture to be identified into a preconfigured object identification model to obtain a plurality of candidate categories output by the object identification model, wherein the object identification model is obtained by training a training picture marked with object category labels.
8. An object recognition apparatus, comprising: the device comprises a candidate category acquisition unit, a candidate category acquisition unit and a display unit, wherein the candidate category acquisition unit is used for acquiring a plurality of candidate categories of text modes corresponding to pictures to be identified, and the candidate categories comprise real categories of objects in the pictures to be identified; the feature extraction unit is used for respectively extracting text features of each candidate category by using the configured multi-mode large model and extracting visual features of the picture to be identified by using the multi-mode large model; The image recognition device comprises a first similarity calculation unit, a second similarity calculation unit, a third similarity calculation unit and a fourth similarity calculation unit, wherein the first similarity calculation unit is used for calculating the first similarity between the visual characteristic of each frame of image to be recognized and the text characteristic of each candidate category corresponding to each frame of image to be recognized, sending the visual characteristic of each frame of image to be recognized into a pre-configured sequence model to obtain the hidden state characteristic of each frame of image to be recognized extracted by the sequence model, wherein the sequence model is configured to extract the hidden state characteristic of each frame of image to be recognized in an input image sequence; And the target category determining unit is used for taking the candidate category corresponding to the highest third similarity as the target category of the object contained in the picture to be identified of each frame.
9. An object recognition device is characterized by comprising a memory and a processor; the memory is used for storing programs; the processor is configured to execute the program to implement the steps of the object recognition method according to any one of claims 1 to 7.
10. A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the object recognition method according to any one of claims 1 to 7.

Description

Object identification method, device, equipment and storage medium Technical Field The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for object identification. Background The object recognition refers to recognizing the types of objects contained in the provided pictures, and the object recognition is widely applied to various scenes, such as object recognition on environment images shot by a robot, so as to help the robot determine the types of objects in a working area, and further better construct a house type map of the working area, as shown in fig. 7, various types of object maps, such as dining tables, sofas, cabinets and the like, are displayed on the house type map of the working area. The conventional object recognition method generally adopts a training image marked with object class labels to train an object recognition model, and then uses the object recognition model to recognize the object of the picture to be recognized. However, the accuracy of the object recognition model has a large relationship with the scale of the training data, and when the training data is insufficient, the object recognition model is liable to be trained with insufficient accuracy. Especially, for some pictures shot from abnormal view angles, such as pictures shot from a lower view angle by the sweeping robot, the objects contained in the pictures are incomplete, and when the traditional object recognition model is adopted to recognize the pictures, the recognition accuracy is further reduced. An example is shown in fig. 1, which is a picture shot by a sweeping robot, and since an object is not shot completely, when a conventional object recognition model is adopted for recognition, it cannot be accurately distinguished whether the object belongs to a sofa or a bed, and thus an incorrect recognition result may be given. Disclosure of Invention In view of the above problems, the present application has been made to provide an object recognition method, apparatus, device, and storage medium for solving the problem that a conventional object recognition model is limited to a training data scale and a view angle of photographing a picture to be recognized, and there is a tendency that recognition accuracy is not high. The specific scheme is as follows: In a first aspect, there is provided an object recognition method, comprising: Acquiring a plurality of candidate categories of a text mode corresponding to a picture to be identified, wherein the candidate categories comprise real categories of objects in the picture to be identified; Respectively extracting text features of each candidate category by using the configured multi-modal large model, and extracting visual features of the picture to be identified by using the multi-modal large model; And calculating the similarity between the visual features of the picture to be identified and the text features of each candidate category, and taking the candidate category corresponding to the highest similarity as the target category of the object in the picture to be identified. Preferably, extracting text features of each candidate category by using a configured multi-modal large model, and extracting visual features of the picture to be identified by using the multi-modal large model includes: using the text encoder of the multi-mode large model to carry out text encoding on each candidate category to obtain the text characteristic of each candidate category; and performing visual coding on the picture to be identified by using the visual coder of the multi-mode large model to obtain visual characteristics. Preferably, the pictures to be identified are a plurality of frames of pictures in a video stream shot by the robot on the working area; Before calculating the similarity between the visual characteristics of each frame of picture to be identified and the text characteristics of each candidate category corresponding to each frame of picture to be identified, the method further comprises the following steps: sending the visual features of each frame of picture to be identified into a pre-configured sequence model to obtain hidden layer features of each frame of picture to be identified extracted by the sequence model, and taking the hidden layer features as the latest visual features of each frame of picture to be identified, wherein the sequence model is configured to extract the hidden layer features of each frame of picture to be identified in an input picture sequence to be identified. Preferably, the pictures to be identified are a plurality of frames of pictures in a video stream shot by the robot on the working area; the process for calculating the similarity between the visual characteristics of the picture to be identified and the text characteristics of each candidate category comprises the following steps: Calculating a first similarity between the visual