CN-122019755-A - Retrievable data asset tag generation and retrieval method and system based on multi-mode large model
Abstract
The invention discloses a retrievable data asset tag generation and retrieval method and system based on a multi-mode large model. Relates to the technical field of data asset tag generation and retrieval. The method comprises the steps of obtaining different types of data assets, preprocessing to obtain multi-modal data, constructing a light-weight multi-modal model, training the light-weight multi-modal model by adopting a knowledge distillation and fine adjustment combined strategy, inputting the multi-modal data into the trained model to obtain different key information, generating corresponding labels according to the different key information, constructing a related data asset association network according to the generated labels, and carrying out three-level mixed search according to the related data asset association network to obtain a final search result. According to the method, through mining semantic association among the tags, diversified and high-correlation search results are intelligently expanded, and the discovery accuracy and transaction matching efficiency of the data assets are greatly improved on the premise of guaranteeing the data privacy.
Inventors
- NIU SHAOZHANG
- TU YUFEI
- Cui Haoliang
- ZHANG WEN
Assignees
- 东南数字经济发展研究院
Dates
- Publication Date
- 20260512
- Application Date
- 20260128
Claims (8)
- 1. A method for generating and retrieving retrievable data asset tags based on a multimodal big model, comprising: acquiring different types of data assets and preprocessing the data assets to obtain multi-mode data; Constructing a light-weight multi-mode model, and training the light-weight multi-mode model by adopting a knowledge distillation and fine adjustment combined strategy; Inputting the multi-mode data into the trained model to obtain different key information; generating corresponding labels according to different key information, and constructing a related data asset related network according to the generated labels; and performing three-level mixed retrieval according to the associated data asset association network to obtain a final retrieval result.
- 2. The method for generating and retrieving a retrievable data asset tag based on a multimodal big model according to claim 1, wherein the different types of data assets include text data, image data and a mixture of graphics and text data, and the preprocessing includes: The preprocessing of the text data comprises cleaning and word segmentation operations for removing invalid information and redundant contents, the preprocessing of the image data comprises size normalization and enhancement operations for improving the effectiveness of image features, the preprocessing of the image-text mixed data comprises the steps of respectively preprocessing the text and the image and then aligning the information, and the multi-mode data comprises the preprocessed text data, the preprocessed image data and the preprocessed image-text mixed data.
- 3. The method for generating and searching the retrievable data asset tag based on the multi-mode large model according to claim 1 is characterized in that the process of constructing the lightweight multi-mode model comprises the steps of constructing an original image-text sample covering various data assets and a corresponding high-quality data set of an artificial fine label tag formed by content keywords, dividing the high-quality data set into a training set and a testing set according to a certain proportion, selecting a teacher model and a student model, adopting a combination strategy of distillation and fine tuning after the teacher model to infer the training set sample to generate a soft tag, and carrying out targeted fine tuning on the student model by combining the soft tag with a low-rank adaptation method to obtain the trained model.
- 4. The method for generating and retrieving the retrievable data asset tag based on the multi-modal large model according to claim 3, wherein the multi-modal data is input into a trained model, a specific system prompt word instruction is preset in the trained model, and key information is generated in series under the driving of the specific instruction, wherein the key information comprises a core key word set and a three-section structured description.
- 5. The method for generating and retrieving the retrievable data asset tag based on the multi-modal large model according to claim 4, wherein the corresponding tag is formed according to the core keyword set and the three-section structured description, the generated tag is subjected to post-processing, the core keyword is extracted, an inverted index is constructed by means of a full-text retrieval engine, and the three-section descriptive text is converted into a high-dimensional feature vector through an embedded model and is stored in a vector database.
- 6. The method for generating and retrieving retrievable data asset tags based on a multimodal big model of claim 5, wherein the process of constructing the association network of associated data assets comprises: And constructing a knowledge graph, taking each data asset as an independent node, judging the semantic association degree between the assets through cosine similarity between high-dimensional feature vectors in the labels, and establishing a semantic highly-relevant edge for the two nodes in a graph database when the semantic association degree exceeds a preset threshold value, so that a dynamic associated data asset association network based on the semantic similarity is finally formed along with the continuous increase of the data assets.
- 7. The method for generating and retrieving a retrievable data asset tag based on a multimodal big model of claim 6, wherein the process of three-level hybrid retrieval from the associated data asset association network comprises: The first stage is parallel basic retrieval, after a user inputs query content, semantic similarity search is initiated to the vector database, keyword matching search is initiated to a full-text search engine, and then the returned results of the two are subjected to duplication removal and fusion to form a basic result set; The second stage is map association expansion, each data asset node in the basic result set is taken as a starting point, all association nodes in a preset range are searched in the constructed knowledge map, and data assets corresponding to the association nodes are collected to form an expansion result set; And thirdly, intelligent fusion sorting, namely merging the basic result set and the extended result set, and presetting a comprehensive sorting algorithm to calculate the merged result to obtain a final retrieval result.
- 8. A retrievable data asset tag generation and retrieval system based on a multimodal big model, comprising: the preprocessing module is used for acquiring different types of data assets and preprocessing the data assets to obtain multi-mode data; The light-weight multi-mode model module is used for constructing a light-weight multi-mode model, training the light-weight multi-mode model by adopting a knowledge distillation and fine adjustment combined strategy, and inputting multi-mode data into the trained model to obtain different key information; And the associated data asset associated network module is used for generating corresponding labels according to different key information, constructing an associated data asset associated network according to the generated labels, and performing three-level mixed search according to the associated data asset associated network to obtain a final search result.
Description
Retrievable data asset tag generation and retrieval method and system based on multi-mode large model Technical Field The invention relates to the technical field of data asset tag generation and retrieval, in particular to a method and a system for generating and retrieving retrievable data asset tags based on a multi-mode large model. Background At the moment of the vigorous development of the data element market, the data asset tag is core metadata of the live data value of the disc and connecting the supply and the demand parties, and an ideal tag system should have both comprehensibility (facilitating quick human cognition) and retrievability (facilitating accurate matching of the system). However, the current mainstream technical solutions have significant bottlenecks in both tag generation and search adaptation. In the aspect of label generation technology, a backward label generation mode depends on manual labeling of experts, is high in cost and long in period, and is difficult to scale, so that label generation becomes an important blocking point for data asset. In addition, the existing automatic method is designed for plain text or pure image data, so that the characterization dimension of the generated tag is single, and the increasingly-increased multi-mode data assets such as graphic mixing, table reporting and the like cannot be effectively processed, so that tag information is unilateral and incomplete. Because the description provided by the data provider often has subjective marketing color, objective assessment of data quality, potential application value (availability) may be lacking, and the original data may be previewed for generating detailed labels during the generation process, which is very prone to sensitive information leakage in the early stages of the transaction. In the aspect of label retrieval technology, the existing retrieval mode is more rigid, and single keyword matching or vector similarity searching is adopted in a transaction system. Keyword matching does not understand semantics, resulting in low recall. While a simple vector search may return a large number of homogenous results, with poor diversity. Both have difficulty meeting the complex search requirements of users that are "both accurate and broad". In addition, existing retrieval patterns lack association mining, and the system treats each data asset as a orphan, failing to build a deep semantic association network between assets. This results in an inability to implement chain mining that limits the value of data by finding one asset, and then finding the intelligent navigation effect of a series of related assets. In view of the foregoing, there is a need in the industry for an innovative solution for end-to-end solving the pain, that is, a search system capable of automatically generating objective, comprehensive and structured labels, and realizing precision, intelligence and privacy protection based on the labels. Disclosure of Invention In view of the above, the invention provides a method and a system for generating and retrieving a retrievable data asset tag based on a multi-mode large model, which aim to solve the problems of low manual labeling efficiency, incomplete single-mode characterization, strong subjectivity of the tag, serious homogenization of retrieving results, incapability of considering real technical pain points such as data privacy protection and supply and demand precise matching and the like in data asset transaction and management, and aim to protect original data privacy in the generating and retrieving process, so that the generated data asset tag achieves the uniformity of comprehensibility and retrievability. In order to achieve the above purpose, the present invention adopts the following technical scheme: A retrievable data asset tag generation and retrieval method based on a multi-modal large model comprises the following steps: acquiring different types of data assets and preprocessing the data assets to obtain multi-mode data; Constructing a light-weight multi-mode model, and training the light-weight multi-mode model by adopting a knowledge distillation and fine adjustment combined strategy; Inputting the multi-mode data into the trained model to obtain different key information; generating corresponding labels according to different key information, and constructing a related data asset related network according to the generated labels; and performing three-level mixed retrieval according to the associated data asset association network to obtain a final retrieval result. Preferably, the different types of data assets include text data, image data and graphic mixing data, and the preprocessing process includes: The preprocessing of the text data comprises cleaning and word segmentation operations for removing invalid information and redundant contents, the preprocessing of the image data comprises size normalization and enhancement operations for improving the effectiv