CN-121997945-A - Semantic understanding model optimization method and system based on search click log
Abstract
The invention discloses a semantic understanding model optimization method and a semantic understanding model optimization system based on search click logs, and mainly relates to the technical field of natural language processing. The method comprises the steps of obtaining a historical search click log set, executing commodity information and query text extraction to obtain a text pair set, traversing the text pair set for dropout enhancement, constructing a positive sample pair set and a negative sample pair set, constructing an initial semantic understanding model, determining a base text pair set and a long-tail refractory text pair set, introducing a staged fine tuning mechanism, and optimizing the initial semantic understanding model by combining the base text pair set and the long-tail refractory text pair set to obtain the semantic understanding model. The invention has the beneficial effects of solving the technical problems of low frequency, complexity, poor long tail intention distinguishing capability and insufficient generalization capability of the semantic understanding depending on manual or weak supervision labels in the prior art, and achieving the technical effects of improving the accuracy of intention understanding and semantic processing.
Inventors
- JIANG YUN
- Du Chuangbo
- QI HENGLIANG
- YIN DEHONG
Assignees
- 广州未来一手网络科技有限公司
- 广州未来一手网络运营有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260128
Claims (10)
- 1. The semantic understanding model optimization method based on the search click log is characterized by comprising the following steps of: Acquiring a historical search click log set, executing commodity information and query text extraction, and acquiring a text pair set, wherein each text pair comprises a query text, a query commodity information list and commodity information actually clicked by a user; traversing the text pair set for dropout enhancement to construct a positive sample pair set and a negative sample pair set; Performing unsupervised pre-training on the set and the negative sample pair set according to the positive sample pair by using a GTE-Large, and performing vertical fine adjustment by using Simcseloss to construct an initial semantic understanding model; long-tail refractory identification is carried out on the text pair set, and a base class text pair set and a long-tail refractory text pair set are determined; and introducing a staged fine tuning mechanism, and optimizing the initial semantic understanding model by combining the base text pair set and the long-tail refractory text pair set to obtain the semantic understanding model.
- 2. The semantic understanding model optimizing method based on search click log according to claim 1, wherein traversing the text pair set for dropout enhancement constructs a positive sample pair set and a negative sample pair set, comprising: extracting a plurality of text pairs from the text pair set according to a preset batch size, and respectively constructing a plurality of training batches; Respectively inputting the training batches into a semantic coding model, starting a random inactivation mechanism in the coding process, and performing forward coding operation on the same text pair at least twice to obtain a plurality of semantic representation sets; defining different semantic representations obtained by multiple Dropout codes of the same text pair in the plurality of semantic representation sets as a positive sample pair set; Within the same training batch, semantic representations generated by different text pairs are defined as a set of negative sample pairs.
- 3. The semantic understanding model optimizing method based on the search click log according to claim 1, wherein the performing unsupervised pre-training on the set and the negative sample pair set according to the positive sample by using GTE-Large and performing vertical fine tuning by using Simcseloss to construct an initial semantic understanding model comprises: Carrying out semantic coding on texts in the positive sample pair set and the negative sample pair set by adopting a GTE-Large pre-training text coding model, and mapping an input text into a high-dimensional semantic vector representation, wherein the semantic coding process comprises the steps of carrying out pooling processing on a hidden state output by the coding model and carrying out normalization processing on the obtained semantic vector; Training semantic vectors by Simcseloss, so that the semantic vectors corresponding to the same positive sample pair are kept similar in a semantic space, and the semantic vectors corresponding to different negative sample pairs are kept distinguished in the semantic space; and updating parameters of the GTE-Large pre-training text coding model through back propagation, and constructing an initial semantic understanding model.
- 4. The method for optimizing a semantic understanding model based on a search click log according to claim 1, wherein the identifying the text pair sets in long-tail refractory manner, determining the base class text pair sets and the long-tail refractory text pair sets, comprises: Based on the text pair set, carrying out double-dimensional long-tail weighted recognition on query text and commodity information actually clicked by a user, and determining a first-class long-tail refractory text pair set; the difficult-to-separate analysis of the commodity information list is inquired on the basis of the text pair set, and a second class long-tail difficult-to-separate text pair set is determined; performing union calculation on the first long-tail refractory text pair set and the second long-tail refractory text pair set to obtain a long-tail refractory text pair set; And eliminating the long-tail refractory text pair set from the text pair set, and adding the rest text pairs into a basic class text pair set.
- 5. The method for optimizing a semantic understanding model based on a search click log according to claim 4, wherein determining a second class of long-tailed refractory text pair sets based on refractory analysis of the text pair sets to query a commodity information list comprises: Traversing each text pair in the text pair set to perform semantic similarity calculation of commodity information in a list of the query commodity information list, and performing mean calculation on a calculation result to determine a text pair semantic similarity mean set; traversing the text semantic similarity mean value set to carry out mean shift screening, and determining a screened text semantic similarity mean value; Comparing the semantic similarity mean value of the screening text with a preset semantic similarity mean value threshold value, and determining a difficult-to-separate screening standard; And adding the text pairs which are larger than or equal to the text pair semantic similarity of the refractory screening standard in the text pair semantic similarity average value set into a second long-tail refractory text pair set.
- 6. The method for optimizing a semantic understanding model based on a search click log according to claim 5, wherein comparing the mean value of semantic similarity of the screened text with a preset mean value of semantic similarity threshold value, and determining the refractory screening criteria comprises: Judging whether the average value of the semantic similarity of the screened text is smaller than a preset average value of the semantic similarity, if so, taking the average value of the semantic similarity of the screened text as a refractory screening standard; If not, taking the preset semantic similarity mean threshold as a refractory screening standard.
- 7. The method for optimizing a semantic understanding model based on a search click log according to claim 1, wherein a staged fine tuning mechanism is introduced, the initial semantic understanding model is optimized by combining the base class text pair set and the long-tail refractory text pair set, and a semantic understanding model is obtained, comprising: introducing a prompt vector mechanism into a semantic coding module of the initial semantic understanding model, and constructing a base class prompt vector and a long tail refractory prompt vector which are mutually independent; Performing base fine adjustment on the initial semantic understanding model by utilizing the base text pair set, freezing the long-tail difficult-to-separate prompt vector in the fine adjustment process, and updating only the base prompt vector and model parameters related to the semantic encoding process of the base prompt vector to obtain updated base prompt vector and base fine adjustment semantic understanding model parameters; constructing a training triplet set based on the long-tailed refractory text pair set, performing long-tailed triplet fine tuning on the initial semantic understanding model according to the training triplet set, freezing the base class prompt vector in the fine tuning process, and only updating the long-tailed refractory prompt vector and model parameters related to the semantic coding process of the long-tailed refractory prompt vector to obtain updated long-tailed triplet fine tuning semantic understanding model parameters; And after the basic class prompt fine tuning and the long tail triplet fine tuning are completed, optimizing the initial semantic understanding model according to the updated basic class prompt vector and the basic class fine tuning semantic understanding model parameters and the updated long tail difficult-to-separate prompt vector and the long tail triplet fine tuning semantic understanding model parameters to obtain an optimized semantic understanding model.
- 8. The method for optimizing a semantic understanding model based on a search click log according to claim 7, wherein after the base class hint and long tail triplet fine tuning is completed, optimizing the initial semantic understanding model according to the updated base class hint vector and base class fine tuning semantic understanding model parameters and the updated long tail refractory hint vector and long tail triplet fine tuning semantic understanding model parameters, and obtaining an optimized semantic understanding model comprises: taking the updated base class prompt vector and the base class fine-tuning semantic understanding model parameters as base class parameter states; the updated long-tail difficult-to-separate prompt vector and the long-tail triplet fine tuning semantic understanding model parameters are used as long-tail parameter states; And uniformly loading the base class parameter state and the long tail parameter state into an initial semantic understanding model to take effect together, so as to obtain an optimized semantic understanding model.
- 9. The semantic understanding model optimizing method based on the search click log according to claim 8, wherein query text in each long-tail refractory text pair in the long-tail refractory text pair set is used as an anchor point sample, actual click commodity information of a user is used as a positive sample, and query commodity information except the actual click commodity information of the user in the query commodity information list is used as a negative sample.
- 10. A semantic understanding model optimization system based on search click logs, characterized in that the system is configured to implement the semantic understanding model optimization method based on search click logs according to any one of claims 1 to 9, the system comprising: The text pair set acquisition module is used for acquiring a historical search click log set, executing commodity information and inquiring text extraction to acquire a text pair set, wherein each text pair comprises an inquiring text, an inquiring commodity information list and commodity information actually clicked by a user; the text pair enhancement module is used for traversing the text pair set to conduct dropout enhancement and constructing a positive sample pair set and a negative sample pair set; The model initialization module is used for carrying out unsupervised pre-training on the set and the negative sample pair set according to the positive sample by utilizing the GTE-Large, and carrying out vertical fine adjustment by utilizing Simcseloss to construct an initial semantic understanding model; the long-tail refractory identification module is used for carrying out long-tail refractory identification on the text pair set and determining a base text pair set and a long-tail refractory text pair set; And the model optimization module is used for introducing a staged fine tuning mechanism, and optimizing the initial semantic understanding model by combining the base class text pair set and the long-tail refractory text pair set to obtain the semantic understanding model.
Description
Semantic understanding model optimization method and system based on search click log Technical Field The invention relates to the technical field of natural language processing, in particular to a semantic understanding model optimization method and system based on search click logs. Background In modern search engine systems, understanding the user's query intent and matching relevant content accurately are key to improving the retrieval efficiency and user experience. The existing search semantic understanding method generally depends on large-scale labeled data for supervised training or adopts an unsupervised characterization learning mode, such as SimCSE, sentence-BERT and the like. The method is based on context sentence enhancement, and contrast learning is carried out by constructing positive and negative sample pairs, so that semantic representation capability is improved. However, the existing method is limited by the scarcity of public data labeling and the long-tail distribution phenomenon of on-line log data, and has good performance in covering main stream high-frequency semantic scenes, but has poor generalization capability on low-frequency and long-tail scenes, and is easy to cause the problems of 'intention missed detection', 'similar word confusion', and the like in search recall and sequencing. In the prior art, semantic understanding relies on manual or weak supervision labels, and the low-frequency, complex and long-tail intention distinguishing capability is poor, and the generalization capability is insufficient. Disclosure of Invention The application provides a semantic understanding model optimization method and a semantic understanding model optimization system based on search click logs, which are used for solving the technical problems that semantic understanding depends on manual or weak supervision labels, and is poor in low-frequency, complex and long-tail intention distinguishing capability and insufficient in generalization capability in the prior art. In view of the above problems, the application provides a semantic understanding model optimization method and system based on search click logs. In a first aspect of the present application, there is provided a semantic understanding model optimization method based on search click logs, the method comprising: Obtaining a historical search click log set, executing commodity information and query text extraction to obtain a text pair set, wherein each text pair comprises a query text, a query commodity information list and commodity information actually clicked by a user, traversing the text pair set to carry out dropout enhancement to construct a positive sample pair set and a negative sample pair set, carrying out unsupervised pre-training on the positive sample pair set and the negative sample pair set by using a GTE-Large according to the positive sample pair set and the negative sample pair set, carrying out vertical fine adjustment by using Simcseloss to construct an initial semantic understanding model, carrying out long-tail difficult-to-separate identification on the text pair set to determine a basic text pair set and a long-tail difficult-to-separate text pair set, introducing a staged fine adjustment mechanism, and optimizing the initial semantic understanding model by combining the basic text pair set and the long-tail difficult-to-separate text pair set to obtain the semantic understanding model. In one embodiment, traversing the text pair set to perform Dropout enhancement to construct a positive sample pair set and a negative sample pair set, wherein the method comprises the steps of extracting a plurality of text pairs from the text pair set according to a preset batch size, respectively constructing a plurality of training batches, respectively inputting the plurality of training batches into a semantic coding model, starting a random inactivation mechanism in a coding process, performing forward coding operation on the same text pair at least twice to obtain a plurality of semantic representation sets, defining different semantic representations obtained by carrying out Dropout coding on the same text pair in the plurality of semantic representation sets as the positive sample pair set, and defining semantic representations generated by the different text pairs as the negative sample pair set in the same training batch. In one embodiment, the GTE-Large is utilized to conduct unsupervised pre-training on the set and the negative sample pair set according to the positive sample, and the Simcseloss is utilized to conduct vertical fine adjustment, so that an initial semantic understanding model is built, the GTE-Large pre-training text coding model is adopted to conduct semantic coding on texts in the positive sample pair set and the negative sample pair set, input texts are mapped into high-dimensional semantic vector representations, the semantic coding process comprises the steps of pooling hidden states ou