CN-122019851-A - Cross-modal retrieval method, system and electronic equipment

CN122019851ACN 122019851 ACN122019851 ACN 122019851ACN-122019851-A

Abstract

The invention provides a cross-mode retrieval method, a system and electronic equipment, wherein the method comprises the steps of identifying a target mode to be retrieved for a retrieval request; the method comprises the steps of extracting features of first mode data to obtain a first semantic vector, mapping the first semantic vector to a unified semantic space through a pre-trained mapping matrix to obtain a second semantic vector, splitting the second semantic vector into first semantic features of different levels through a level feature coding model, calculating the similarity of the first semantic features and candidate semantic features in each level of different levels, obtaining comprehensive similarity based on the similarity and weight of each level, and screening N candidate semantic features from a plurality of candidate semantic features corresponding to a target mode according to the comprehensive similarity. The technical scheme of the invention can improve the retrieval accuracy of cross-mode retrieval.

Inventors

WANG CHUAN
GU JIA
ZHANG LONGFEI
DENG XUESHOU
ZHANG GONGBIN
ZHANG DINGJUN

Assignees

中车青岛四方机车车辆股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260113

Claims (20)

1. A cross-modal retrieval method, the method comprising: In response to receiving a search request, identifying a target modality that the search request needs to search; extracting features of first modal data included in the retrieval request to obtain a first semantic vector, wherein the first modal data is different from or the same as the modal class of the target modality; mapping the first semantic vector to a unified semantic space through a pre-trained mapping matrix to obtain a second semantic vector, wherein the mapping matrix is used for carrying out semantic alignment on the semantic vectors of different modes; inputting the second semantic vector into a pre-trained hierarchical feature coding model, and splitting the second semantic vector into first semantic features of different levels through the hierarchical feature coding model; Indexing candidate semantic features corresponding to the target modality in a pre-built data center; Calculating the similarity of each level of the first semantic feature and the candidate semantic feature in different levels; Acquiring weights corresponding to each of the different levels respectively; Obtaining comprehensive similarity based on the similarity of each level and the weight; And screening N candidate semantic features from the candidate semantic features corresponding to the target mode according to the comprehensive similarity, and taking N original files corresponding to the N candidate semantic features as search results.
2. The method of claim 1, wherein the different levels include global, local, and attributes; Splitting the second semantic vector into first semantic features of different levels by the hierarchical feature encoding model, comprising: and splitting the second semantic vector into first semantic features respectively corresponding to three levels of global, local and attribute through the hierarchical feature coding model.
3. The method of claim 2, wherein computing the similarity of the first semantic feature and the candidate semantic feature at each of the different levels comprises: calculating first similarity of the first semantic features and the candidate semantic features in a global level; Calculating the second similarity of the first semantic features and the candidate semantic features at a local level; and calculating a third similarity of the first semantic feature and the candidate semantic feature in an attribute level.
4. The method of claim 2, wherein splitting the second semantic vector into first semantic features corresponding to three levels of global, local, and attribute, respectively, by the hierarchical feature encoding model comprises: In response to identifying that the modal class of the first modal data is text, invoking a first-level feature coding model, and splitting the second semantic vector into overall semantic features corresponding to a global level, core viewpoint features corresponding to a local level and entity features corresponding to an attribute level through the first-level feature coding model; Or alternatively In response to identifying that the mode type of the first mode data is an image or video frame, invoking a second-level feature encoding model, and splitting the second semantic vector into an overall scene feature corresponding to a global level, a device region feature corresponding to a local level and a first attribute feature corresponding to an attribute level through the second-level feature encoding model; Or alternatively And in response to identifying that the mode type of the first mode data is audio, calling a third-level feature coding model, and splitting the second semantic vector into an overall sound feature corresponding to a global level, an event feature corresponding to a sound type corresponding to a local level and a first attribute feature corresponding to an attribute level through the third-level feature coding model, wherein the first attribute feature is used for representing at least one of frequency and intensity corresponding to the first mode data.
5. The method of claim 4, wherein the step of determining the position of the first electrode is performed, The first-level feature coding model comprises a first transducer module, a second transducer module and a third transducer module, wherein the first transducer module is used for extracting the whole semantic features from the input second semantic vector, the second transducer module is used for extracting the core viewpoint features from the input second semantic vector, and the third transducer module is used for extracting the entity features from the input second semantic vector; The second-level feature coding model comprises a first coding module, a second coding module and a third coding module, wherein the first coding module is used for extracting the whole scene feature from the second semantic vector, the second transform module is used for extracting the equipment region feature from the second semantic vector, and the third transform module is used for extracting at least one of color, shape or state from the second semantic vector, wherein the first coding module, the second coding module or the third coding module is a convolutional neural network CNN or a visual model VIT-transform based on a transform architecture; The third-level feature coding model comprises a fourth transducer module, a fifth transducer module and a sixth transducer module, wherein the fourth transducer module is used for extracting integral sound features from the second input semantic vector, the second transducer module is used for extracting event features corresponding to sound types from the second input semantic vector, and the third transducer module is used for extracting at least one of frequency and intensity from the second input semantic vector.
6. The method of any of claims 1-5, wherein prior to obtaining weights respectively corresponding to each of the different levels, the method further comprises: Identifying a modality category of the first modality data; the obtaining the weights respectively corresponding to the layers in the different layers comprises the following steps: Invoking a dynamic weight distribution model obtained by pre-training, inputting the modal category of the first modal data and the modal category of the target modal into the dynamic weight distribution model, and outputting weights respectively corresponding to all the different levels through the dynamic weight distribution model; The dynamic weight distribution model is a model adopting an attention mechanism and comprises an input layer, an attention layer and an output layer.
7. The method of claim 1, wherein before mapping the first semantic vector to a unified semantic space via a pre-trained mapping matrix to obtain a second semantic vector, the method further comprises: The method comprises the steps of acquiring training samples, wherein the training samples comprise a positive sample pair and a negative sample pair, the positive sample pair comprises a first-mode sample and a second-mode sample which are in the same scene and have aligned semantemes, and the negative sample pair comprises a third-mode sample and a fourth-mode sample which are in different scenes and have non-aligned semantemes; And training an initial mapping matrix by taking the training samples as training targets and taking the distance between two samples in the positive sample pair as a minimum and the distance between two samples in the negative sample pair as a maximum to obtain the pre-trained mapping matrix.
8. The method of claim 1, wherein the search request is a search request submitted by a user through a base search portal; Before the feature extraction is carried out on the first modal data included in the retrieval request, the method further comprises the step of carrying out light weight processing on the first modal data, wherein the light weight processing comprises the steps of removing the virtual word and reserving the key word on the first modal data with the modal category of text; Splitting the second semantic vector into first semantic features of different levels through the level feature coding model, wherein the splitting of the second semantic vector into first semantic features respectively corresponding to a global level and a local level through the level feature coding model; Calculating the similarity of the first semantic feature and the candidate semantic feature in each of different levels, including calculating a first similarity of the first semantic feature and the candidate semantic feature in a global level and a second similarity in a local level; the obtaining the weights respectively corresponding to the layers in the different layers comprises the following steps: and calling a dynamic weight distribution model obtained through pre-training, inputting the modal category of the first modal data and the modal category of the target modal into the dynamic weight distribution model, and outputting a first weight corresponding to the global level and a second weight corresponding to the local level through the dynamic weight distribution model.
9. The method of claim 1, wherein the search request is a search request submitted by a user through an advanced search portal, the search request including a search parameter including at least one of a confidence threshold, a maximum number of returns, a time range; After taking N original files corresponding to the N candidate semantic features as search results, the method further comprises: And according to the retrieval parameters, M retrieval results with confidence coefficient larger than or equal to the confidence coefficient threshold value and/or uploading time conforming to the time range are selected from the N retrieval results, wherein M is smaller than or equal to the maximum return number.
10. The method according to claim 1, wherein after taking N original files corresponding to the N candidate semantic features as search results, the method further comprises: The metadata corresponding to the search result is derived, wherein the metadata comprises at least one of an original file ID, comprehensive similarity, a data set name or ID from which the original file is derived, uploading time, an uploading person and a source description, and the original file ID is used for being linked to the corresponding original file for being downloaded or checked by a user; And/or the number of the groups of groups, And leading out N original files or compression packets corresponding to the N original files in the search result.
11. The method of claim 1, wherein the candidate semantic features corresponding to the target modality and the original file corresponding to the candidate semantic features are stored in a data center, wherein the data center stores data sets of different categories, wherein the data sets of different categories include a personal data set, a platform data set and a public data set.
12. The method of claim 1, wherein the candidate semantic features corresponding to the target modality and the original files corresponding to the candidate semantic features are stored in the data center, wherein the data center stores data sets of different categories, wherein the data sets of different categories comprise data sets associated with different semantic tags; Before indexing the candidate semantic features corresponding to the target modality in a pre-built data center, the method further comprises: constructing a multi-mode index library by using FAISS tools; Before indexing the candidate semantic features corresponding to the target modality in a pre-built data center, the method further comprises: Identifying a first semantic tag corresponding to the first modal data; indexing candidate semantic features corresponding to the target modality in a pre-built data center, including: Indexing a first target data set associated with a first semantic tag in the multimodal index library using FAISS tools according to the first semantic tag; and indexing candidate semantic features corresponding to the target modality in the first target dataset by using FAISS tools.
13. The method of claim 12, wherein after taking N original files corresponding to the N candidate semantic features as search results, the method further comprises: And receiving a second semantic label marked by the user on the search result, recording the association relation between the second semantic label and the search result, and storing the association relation in the data center.
14. The method according to claim 1, wherein the method further comprises: counting the number of various combinations of index modes and target modes which are required to be searched by a user according to a plurality of search requests; The method comprises the steps of determining the combination with the number exceeding a preset threshold value as a high-frequency combination, determining the combination with the number smaller than or equal to the preset threshold value as a low-frequency combination, updating and/or compressing corresponding index information preferentially for the high-frequency combination, and archiving the low-frequency combination.
15. The method of claim 1, wherein prior to feature extraction of the first modality data included in the retrieval request, the method further comprises: Responding to the mode category of the first mode data as text, and converting the first mode data into text data in a TXT format coded by UTF-8; Responding to the mode type of the first mode data as an image or a video frame, performing frame extraction processing on the video frame to obtain an image, and converting the image into a unified image format with unified resolution; Responsive to the modality class of the first modality data being audio, converting the first modality data into a unified audio format having a unified sampling rate; Preprocessing image data with a unified image format or audio data with a unified audio format or text data with a TXT format to obtain preprocessed first mode data; extracting basic characteristics from the preprocessed first mode data to obtain initial characteristics; Extracting features of the first modality data included in the search request includes: And extracting the features of the initial features corresponding to the first modal data included in the retrieval request.
16. The method of claim 15, wherein preprocessing image data having a uniform image format or audio data having a uniform audio format or text data in the TXT format comprises: removing Gaussian noise from the image data with the unified image format through OpenCV; Or alternatively Cutting a mute segment through Librosa for the audio data with the unified audio format; Or alternatively And removing the appointed symbol from the text data in the TXT format through a regular expression.
17. The method of claim 15, wherein extracting the basic features from the preprocessed first modality data to obtain the initial features comprises: Calculating a color histogram and edge characteristics of the image data from which Gaussian noise is removed, and obtaining initial image characteristics; Or alternatively Extracting Mel frequency spectrum coefficients from the audio data after cutting the mute section audio to obtain initial audio characteristics; Or alternatively And segmenting the text data with the specified symbols removed, generating a bag-of-word vector, and taking the bag-of-word vector as an initial text feature.
18. A cross-modal retrieval system, wherein the system comprises a data center, a retrieval engine and a multi-modal retrieval module; The multi-mode searching module is used for responding to the received searching request and identifying the target mode which needs to be searched by the searching request; The search engine comprises a feature extraction model, a semantic alignment unit, a hierarchical feature coding model and an index unit, wherein the feature extraction unit is used for extracting features of first modal data included in the search request to obtain a first semantic vector, the first modal data is different from or the same as the modal category of the target modal, the semantic alignment unit is used for mapping the first semantic vector to a unified semantic space through a pre-trained mapping matrix to obtain a second semantic vector, and the mapping matrix is used for carrying out semantic alignment on the semantic vectors of different modalities; The search engine is used for inputting the second semantic vector into a pre-trained hierarchical feature coding model, and splitting the second semantic vector into first semantic features of different levels through the hierarchical feature coding model; The indexing unit is used for indexing candidate semantic features corresponding to the target modes in a pre-constructed data center; The multi-mode retrieval module comprises a similarity calculation unit, wherein the similarity calculation unit is used for calculating the similarity of each level of the first semantic features and the candidate semantic features in different levels, and obtaining weights corresponding to each level in the different levels respectively, and obtaining comprehensive similarity based on the similarity of each level and the weights; the multi-mode searching module is used for screening N candidate semantic features from a plurality of candidate semantic features corresponding to the target mode according to the comprehensive similarity, and taking N original files corresponding to the N candidate semantic features as searching results.
19. The system of claim 18, wherein the different levels include global, local, and attributes; The search engine is specifically configured to split the second semantic vector into first semantic features corresponding to three levels of global, local and attribute through the hierarchical feature coding model.
20. The system according to claim 19, wherein the search engine is specifically configured to: In response to identifying that the modal class of the first modal data is text, invoking a first-level feature coding model, and splitting the second semantic vector into overall semantic features corresponding to a global level, core viewpoint features corresponding to a local level and entity features corresponding to an attribute level through the first-level feature coding model; Or alternatively In response to identifying that the mode type of the first mode data is an image or video frame, invoking a second-level feature encoding model, and splitting the second semantic vector into an overall scene feature corresponding to a global level, a device region feature corresponding to a local level and a first attribute feature corresponding to an attribute level through the second-level feature encoding model; Or alternatively And in response to identifying that the mode type of the first mode data is audio, calling a third-level feature coding model, and splitting the second semantic vector into an overall sound feature corresponding to a global level, an event feature corresponding to a sound type corresponding to a local level and a first attribute feature corresponding to an attribute level through the third-level feature coding model, wherein the first attribute feature is used for representing at least one of frequency and intensity corresponding to the first mode data.

Description

Cross-modal retrieval method, system and electronic equipment Technical Field The invention relates to the technical field of rail transit equipment manufacturing, and provides a cross-mode retrieval method, a system and electronic equipment. Background The existing data retrieval method is mostly limited to single-mode retrieval, such as text retrieval and image retrieval, and cannot meet the requirements of cross-mode data retrieval of users. Cross-modal retrieval refers to a technology of searching one type of data (query modality) for another type of data (target modality), and breaks through the limitation that the types of query modality and target modality are required in the traditional retrieval, such as searching pictures or videos by texts, searching texts by pictures, searching videos by voices and the like. In some current cross-mode searching methods, the returned searching result sometimes deviates from the actual expectation of the user, the returned searching result is not actually needed by the user, the searching precision is low, the user experience is affected, and especially in the production or maintenance application scenes of the rail transit equipment manufacturing, some current cross-mode searching schemes cannot meet the cross-mode searching requirements in the rail transit equipment manufacturing application scenes. Disclosure of Invention The embodiment of the invention provides a cross-modal retrieval method, a system and electronic equipment, which can improve retrieval precision and meet the cross-modal retrieval requirement under the manufacturing application scene of rail transit equipment. The embodiment of the invention provides a cross-modal retrieval method, which comprises the steps of responding to a retrieval request, identifying target modalities required to be retrieved by the retrieval request, carrying out feature extraction on first modality data included in the retrieval request to obtain first semantic vectors, enabling the first modality data to be different from or identical to the modality types of the target modalities, mapping the first semantic vectors to a unified semantic space through a pre-trained mapping matrix to obtain second semantic vectors, enabling the mapping matrix to be used for carrying out semantic alignment on the semantic vectors of different modalities, inputting the second semantic vectors to a pre-trained hierarchical feature coding model, splitting the second semantic vectors into first semantic features of different levels through the hierarchical feature coding model, indexing candidate semantic features corresponding to the target modalities in a pre-built data center, calculating the similarity of each level of the first semantic features and the candidate semantic features in different levels, obtaining comprehensive similarity, screening N candidate semantic features from a plurality of candidate semantic features corresponding to the target modalities according to the comprehensive similarity, and taking N candidate semantic features as original retrieval results. According to one embodiment of the invention, the different levels comprise global, local and attribute, and the splitting of the second semantic vector into the first semantic features of the different levels through the level feature coding model comprises the splitting of the second semantic vector into the first semantic features respectively corresponding to the three levels of global, local and attribute through the level feature coding model. According to one embodiment of the invention, the similarity of the first semantic feature and the candidate semantic feature in each of different levels is calculated, wherein the calculation comprises the steps of calculating the first similarity of the first semantic feature and the candidate semantic feature in a global level, calculating the second similarity of the first semantic feature and the candidate semantic feature in a local level and calculating the third similarity of the first semantic feature and the candidate semantic feature in an attribute level. According to one embodiment of the invention, the second semantic vector is split into first semantic features corresponding to three levels of global, local and attribute respectively through a hierarchical feature coding model, the method comprises the steps of calling the first hierarchical feature coding model in response to the fact that the mode category of the first modal data is identified as text, splitting the second semantic vector into global semantic features corresponding to the global level, core viewpoint features corresponding to the local level and entity features corresponding to the attribute level through the first hierarchical feature coding model, or calling the second hierarchical feature coding model in response to the fact that the mode category of the first modal data is identified as an image or a video frame, splitting the secon