CN-122019744-A - Service metadata-oriented generation type semantic fusion recommendation method

CN122019744ACN 122019744 ACN122019744 ACN 122019744ACN-122019744-A

Abstract

The invention provides a service metadata-oriented generation type semantic fusion recommendation method, and belongs to the field of recommendation systems. The method comprises the steps of S1, processing an original data set containing metadata pairs and relevance marks to generate a high-order metadata training set containing semantic enhancement data and difficult negative samples, S2, performing multi-task training and iterative optimization on a pre-training Embedding model based on the high-order metadata training set to obtain a field self-adaptive semantic understanding model, S3, performing coarse-grained retrieval on query metadata by using the field self-adaptive semantic understanding model, and performing fine-grained screening and reordering on retrieval results by using a large language model in combination with a thinking chain strategy to output final recommendation results. The method combines the deep understanding of the technical terms and the complex contexts by the multidimensional deep semantic field analysis technology and the large language model, improves the context expression capability of the metadata in the vertical field, and optimizes the distinguishing capability of the similar but irrelevant metadata.

Inventors

SHEN WEI
LI LIN
LIU CHUANBO
Liu changye
LI TIANHUI
WEI MIN
XIE SONGSONG
ZHAO JUNPENG

Assignees

上汽通用五菱汽车股份有限公司

Dates

Publication Date: 20260512
Application Date: 20251217

Claims (10)

1. A service metadata-oriented generated semantic fusion recommendation method, the method comprising: S1, processing an original data set containing metadata pairs and relevance marks to generate a high-order metadata training set containing semantic enhancement data and difficult negative samples; s2, performing multi-task training and iterative optimization on the pre-training Embedding model based on the high-order metadata training set to obtain a field self-adaptive semantic understanding model; And S3, carrying out coarse-granularity retrieval on the query metadata by using a field self-adaptive semantic understanding model, carrying out fine-granularity screening and reordering on the retrieval result by using a large language model and combining a thinking chain strategy, and outputting a final recommendation result.
2. The service-metadata-oriented generated semantic fusion recommendation method according to claim 1, wherein S1 comprises: s101, inputting an original data set, dividing the original data set into a training set and a verification set, and constructing recall sets formed by all metadata; S102, performing domain knowledge analysis and semantic expansion on metadata in an original data set by using a large language model LLM to generate a semantic enhancement data set; and S103, converting the sample mark into positive and negative samples based on a dynamic threshold dividing strategy, mining difficult negative samples from the recall set based on a pre-training Embedding model and a set similarity threshold interval, and combining to generate a high-order metadata training set.
3. The service-metadata-oriented generated semantic fusion recommendation method according to claim 2, wherein the dynamic thresholding policy in S103 comprises: Based on an original relevance marker set, the fixed interval between adjacent marker data of the set is k, and the set S is expressed as follows: S={s i ∣1≤i≤n,n≥2,s i -s i-1 =k,k∈C} wherein i is marking data, k is adjacent marking data interval, n is the number of marking data; Generating a candidate threshold, wherein the candidate threshold is a dynamic threshold and is expressed as follows: t′=t 0 +k*(i-1) Wherein t' is a candidate threshold, t 0 is an initial threshold, t 0 ＝min(s 1 ,...,s n ), i is more than or equal to 1 and less than or equal to n; Dividing a training set by using different candidate thresholds, training a model, evaluating the performance of the model on a verification set, and selecting a candidate threshold with the optimal comprehensive score as a final dividing threshold t; sample pairs are marked as positive or negative samples according to a threshold t.
4. The service-oriented metadata generation type semantic fusion recommendation method according to claim 3, wherein the operation of mining the difficult negative samples in the step S103 is that cosine approximation degree of all metadata in a recall set is calculated for each metadata x i in a training set, metadata x i in the training set are arranged in descending order of approximation degree, and metadata with similarity value within a preset threshold value interval are selected to form a difficult negative sample set for the metadata x i .
5. The service-metadata-oriented generated semantic fusion recommendation method according to claim 4, wherein S2 comprises: S201, performing full-parameter fine adjustment on a pre-training Embedding model by using a field type label of metadata based on a semantic enhancement data set, and executing multi-classification tasks; s202, based on a high-order metadata training set, adopting a contrast learning loss function to further train the fine-tuned Embedding model; S203, selecting an optimal parameter combination on the verification set through self-adaptive iterative optimization, and iteratively executing the difficult negative sample mining and Embedding model training processes until the termination condition is met, so as to obtain the field self-adaptive semantic understanding model.
6. The service-metadata-oriented generated semantic fusion recommendation method according to claim 5, wherein the operation of full parameter fine tuning in S201 comprises: based on the extended dataset D', a classification header is added and fine-tuning is performed using the following multi-classification cross entropy loss function: Where C represents the total category number, y i represents the one-hot encoding of the real label, and p i represents the Softmax probability of the model output.
7. The service-oriented metadata generated semantic fusion recommendation method according to claim 5 or 6, wherein the contrast learning loss function in S202 is InfoNCE loss function expressed as follows: Where h i is Embedding representation of the initial metadata, Is a Embedding representation of positive metadata and, Is a Embedding representation of negative metadata, sim (h i ,h j ) is an approximation calculation between two Embedding, τ is a temperature super parameter.
8. The service metadata oriented generated semantic fusion recommendation method according to claim 7, wherein the adaptive iterative optimization of S203 comprises testing different threshold combinations on the validation set by grid search, evaluating macroscopic and microscopic indicators, setting an early-stop mechanism to terminate the iteration, and finally retraining the model with the full dataset.
9. The service-metadata-oriented generated semantic fusion recommendation method according to claim 8, wherein the S3 comprises: S301, adopting a field self-adaptive semantic understanding model, encoding query metadata into vector representation, and efficiently retrieving a coarse row candidate metadata set with the size ten times that of the preset recommended metadata from a recall set through an approximate nearest neighbor search algorithm; S302, constructing a prompt word containing the rough ranking candidate metadata information, domain knowledge and specific task instructions, and inputting a large language model LLM; s303, performing deep semantic analysis and correlation judgment on the coarse candidate metadata by using the large language model LLM, and outputting the final recommended metadata after screening and reordering.
10. The service metadata-oriented generation type semantic fusion recommendation method according to claim 9, wherein in the step S303, final recommendation metadata of the large language model LLM is in a structured format, and the final recommendation result is obtained by analyzing the final recommendation metadata through a regular expression.

Description

Service metadata-oriented generation type semantic fusion recommendation method Technical Field The invention relates to the field of recommendation systems, in particular to a service metadata-oriented generation type semantic fusion recommendation method. Background Recommendation engines have been widely used in the field of data services as a core technology for information filtering and content distribution. The method has the core task of analyzing the query input of the user, and accurately identifying and recommending the content highly related to the massive and heterogeneous service metadata. In recent years, with the progress of natural language processing technology, recommendation methods based on text semantic understanding have become research hotspots. Compared with the traditional collaborative filtering or rule matching-based method, the semantic recommendation can deeply analyze text connotation, can better cope with the problem of new projects (cold start) theoretically, and improves the interpretation and accuracy of recommendation. However, when semantic technology is applied to service metadata recommendation, a core challenge is to accurately understand and quantify deep semantic relatedness between metadata. Service metadata typically contains structured or semi-structured text of interface descriptions, functional specifications, technical parameters, etc., which is proprietary, term intensive, and context dependent. The traditional text similarity calculation method, such as TF-IDF based on word frequency statistics or BM25 based on a probability model, has high calculation efficiency and simple realization, but is essentially based on shallow matching of a word bag model. Such methods severely lack the ability to understand the "semantic field" in which the word is located, and cannot effectively handle the exclusive meaning of the word multi-meaning, synonymous hetero-word, and specialized terms in a specific context, resulting in significantly limited accuracy of recommendation in the face of complex business contexts. To overcome the limitations of shallow semantic representation, deep learning based text vectorization models (e.g., word2Vec, gloVe, and more advanced pre-trained language models BERT, roBERTa, etc.) are widely introduced. The models can generate context-related word vectors (Embedding) containing rich semantic information by pre-training on a large-scale corpus, so that the capturing capacity of the models on the universal language semantics is remarkably improved. However, in the context of service metadata recommendation in the vertical field, the direct application of these generic models still faces three prominent problems that firstly, the generic pre-training corpus is difficult to cover massive special terms and abbreviations with special meanings in numerous vertical fields (such as SOA service architecture, internet of vehicles protocol, bioinformatics, financial management and the like), so that the model deviates from the analysis of the real intention of the metadata, secondly, the implicit knowledge such as industry background knowledge, business logic constraint, compliance requirement and the like is not usually explicitly encoded into text, and is difficult to automatically acquire through the generic model, and the knowledge is very important for judging the business correlation among the metadata, and finally, in the real-time recommendation scene facing the massive service library, the balance needs to be obtained between extremely high recall speed and fine semantic ordering precision, and the complex depth model brings high computational overhead, so that the high-concurrency and low-delay online service requirement is difficult to directly meet. In particular, to practical application, the prior art scheme has obvious defects in a plurality of key links. At the semantic understanding level, the traditional statistical method and the general neural network model have insufficient modeling capability on the technical terms and the complex business logic, so that the generated semantic representation has limited differentiation. At the model training level, high-quality negative samples are important for advanced training patterns such as contrast learning, and the prior method mostly adopts random sampling or strategies based on fixed similarity threshold values to construct the negative samples. This strategy has difficulty generating "difficult negative samples" that are "approximate but uncorrelated" with positive samples, so that the model cannot learn subtle but critical semantic differences, affecting its final discriminatory ability. In the search ordering layer, the main stream method generally converts the recommended problem into a vector search problem, namely Embedding of the query is acquired firstly, then the search recall candidate set is searched by the approximate nearest neighbor, and finally the ordering is carried out ac