CN-122020146-A - Label generation method and device

CN122020146ACN 122020146 ACN122020146 ACN 122020146ACN-122020146-A

Abstract

The application provides a label generation method which comprises the steps of obtaining first content and supplementary data, determining second content related to the first content from the supplementary data, determining first labels related to the first content and the second content from a label set, and inputting the first content, the second content and the first labels into a language model to obtain second labels. The label generation method can screen the labels related to the content from the label set through the language model under the condition of considering both the label recall rate and the accuracy rate.

Inventors

ZHU CHENXU
CHEN BO
YANG YANG
GUO HUIFENG
TANG RUIMING
ZHANG WEINAN

Assignees

华为技术有限公司

Dates

Publication Date: 20260512
Application Date: 20241111

Claims (20)

1. A tag generation method, comprising: The method comprises the steps of obtaining first content and supplementary data, wherein the supplementary data is used for carrying out knowledge supplementation on the first content; Determining second content related to the first content from the supplemental data; Determining a first tag associated with the first content and the second content from the set of tags; Inputting the first content, the second content and the first label into a language model to obtain a second label, wherein the second label is used as the label of the first content.
2. The method of claim 1, wherein the determining a first tag associated with the first content and the second content from the set of tags, specifically comprises: Determining a third tag associated with the first content and a fourth tag associated with the second content from the set of tags; And determining the first label according to the third label and the fourth label.
3. The method of claim 2, wherein the determining a third tag from the set of tags that is related to the first content comprises: and retrieving a third tag related to the first content from the tag set according to the type of the first content.
4. The method of claim 2, wherein the determining a fourth tag associated with the second content from the set of tags comprises: and retrieving a fourth tag related to the second content from the tag set according to the type of the second content.
5. The method of any of claims 2-4, wherein said determining the first tag from the third tag and the fourth tag comprises: and combining the third tag and the fourth tag to generate the first tag.
6. The method according to any one of claims 1-5, wherein inputting the first content, the second content, and the first tag into a language model to obtain a second tag, specifically comprises: Acquiring a prompt word template; filling the first content, the second content and the first label into the prompt word template to obtain a corresponding prompt word; and inputting the prompt word into the language model to obtain the second label.
7. The method of claim 6, wherein the inputting the prompt word into the language model results in the second tag, specifically comprising: inputting the prompt words into the language model, and outputting a plurality of fifth labels and first confidence degrees corresponding to each fifth label, wherein the fifth labels belong to the first labels; and filtering the fifth label with the first confidence coefficient smaller than a first threshold value to obtain the second label.
8. The method of claim 6, wherein the inputting the prompt word into the language model results in the second tag, specifically comprising: inputting the first content, the second content and the first label into the language model, and outputting a plurality of fifth labels and a first confidence corresponding to each of the fifth labels, and a plurality of sixth labels and a second confidence corresponding to each of the sixth labels; And filtering the fifth label with the first confidence coefficient smaller than a first threshold value and the sixth label with the second confidence coefficient smaller than a second threshold value to obtain the second label.
9. The method of claim 6, wherein the inputting the prompt word into the language model results in the second tag, specifically comprising: Inputting the first content, the second content and the first label into the language model, and outputting a plurality of sixth labels and a second confidence corresponding to each of the sixth labels; and filtering the sixth label with the second confidence coefficient smaller than a second threshold value to obtain the second label.
10. The method of claim 8 or 9, further comprising, after outputting the sixth tag: Updating the sixth tag to the tag set.
11. The method of any of claims 1-10, wherein the first content is one or more of text, pictures, voice, and video.
12. A label producing apparatus, comprising: The acquisition module is used for acquiring the first content and the supplementary data; A second content determining module for determining second content related to the first content from the supplementary data; a first tag determination module for determining a first tag related to the first content and the second content from the tag set; The second label generating module is used for inputting the first content, the second content and the first label into a language model to obtain a second label, and the second label is used as a label of the first content.
13. The apparatus of claim 12, wherein the first tag determination module specifically comprises: a first determining module, configured to determine a third tag related to the first content and a fourth tag related to the second content in the tag set; And the second determining module is used for determining the first label according to the third label and the fourth label.
14. The apparatus of claim 13, wherein the determining a third tag from the set of tags that is related to the first content comprises: and retrieving a third tag related to the first content from the tag set according to the type of the first content.
15. The apparatus of claim 13, wherein the determining a fourth tag from the set of tags that is related to the second content comprises: and retrieving a fourth tag related to the second content from the tag set according to the type of the second content.
16. The apparatus of any of claims 13-15, wherein the determining the first tag from the third tag and the fourth tag comprises: and combining the third tag and the fourth tag to generate the first tag.
17. The apparatus according to any one of claims 12-16, wherein said inputting the first content, the second content, and the first tag into a language model to obtain a second tag, specifically comprises: Acquiring a prompt word template; filling the first content, the second content and the first label into the prompt word template to obtain a corresponding prompt word; and inputting the prompt word into the language model to obtain the second label.
18. The apparatus of claim 17, wherein said inputting the prompt word into the language model results in the second tag, comprising: inputting the prompt words into the language model, and outputting a plurality of fifth labels and first confidence degrees corresponding to each fifth label, wherein the fifth labels belong to the first labels; and filtering the fifth label with the first confidence coefficient smaller than a first threshold value to obtain the second label.
19. The apparatus of claim 17, wherein said inputting the prompt word into the language model results in the second tag, comprising: inputting the first content, the second content and the first label into the language model, and outputting a plurality of fifth labels and a first confidence corresponding to each of the fifth labels, and a plurality of sixth labels and a second confidence corresponding to each of the sixth labels; And filtering the fifth label with the first confidence coefficient smaller than a first threshold value and the sixth label with the second confidence coefficient smaller than a second threshold value to obtain the second label.
20. The apparatus of claim 17, wherein said inputting the prompt word into the language model results in the second tag, comprising: Inputting the first content, the second content and the first label into the language model, and outputting a plurality of sixth labels and a second confidence corresponding to each of the sixth labels; and filtering the sixth label with the second confidence coefficient smaller than a second threshold value to obtain the second label.

Description

Label generation method and device Technical Field The application relates to the field of artificial intelligence, in particular to a label generation method and device. Background The tag is a common technology in a search engine or a content recommendation system, and is helpful for the search engine or the content recommendation system to quickly and accurately know interests, search intention, commodity characteristics and the like of a user, so that the search or recommendation process is simplified. Referring to fig. 1, a tag system in the prior art filters corresponding tags from a tag set for news or video information through a tag selector, and applies the tags to downstream tasks such as searching or recommending. Referring to fig. 2, in a search scenario regarding game applications, a corresponding tag "owner" is generated for a newly online game application through a tag system, and a user can quickly and accurately search for a corresponding game application through a keyword "owner". At present, the label selector can be realized by a manual search or a small language model, but the manual search result is related to manually input keywords, is relatively subjective and high in cost, the small language model is limited by limited understanding capability, so that the result lacks reliability, and meanwhile, a batch of labeled label data is required for training, so that the cost is high. Along with the rise of the language model, how to screen out suitable labels from a large number of labels by using the language model under the condition of considering the recall rate and the accuracy rate of the labels is a problem to be solved. Disclosure of Invention The application provides a label generation method and a label generation device, which can screen labels related to content from a label set through a language model under the condition of considering recall rate and accuracy rate. In a first aspect, the application provides a tag generation method, which comprises the steps of obtaining first content and supplementary data, determining second content related to the first content from the supplementary data, determining first tags related to the first content and the second content from a tag set, inputting the first content, the second content and the first tags into a language model to obtain second tags, and using the second tags as tags of the first content. In the label generating method provided by the application, the second content related to the first content is determined from the supplementary data, the first labels related to the first content and the second content are determined from the label set, the first labels are used as input labels of the language model, according to the method, the recall rate of the first label input to the language model is guaranteed through the introduction of the supplementary data, the accuracy rate of the first label input to the language model is guaranteed through determining the first label related to the first content and the second content from the label set, and the recall rate and the accuracy rate of the second label generated by the language model according to the input label are improved. Further, the application alleviates the illusion of the language model by the injection of the second content related to the first content determined from the supplementary data, improving the accuracy of the generated second label. Therefore, the label generation method provided by the application can screen the labels related to the content from the label set through the language model under the condition of considering both the label recall rate and the accuracy rate. The first label is used for providing candidate labels for the second label, the second label is a final result label, and the second label can be used for tasks such as a downstream search engine or content recommendation and the like so as to mark or update the first content in the downstream tasks. The supplemental data is one or more of external knowledge, manual rules, and historical samples. The external knowledge refers to information from a non-tag set, the information comprises tags, website resources and the like, the manual rule refers to a manually specified tag generation rule, and the history sample refers to first content input in history and second tags output in history. In the application, the supplementary data is used for searching the first label related to the first content from the label set, and then the first label is utilized to guide the language model to generate the second label related to the first content, so that the recall rate and the accuracy rate of the generated second label are improved. As one possible implementation manner, the first label related to the first content and the second content is determined from the label set, and the method specifically comprises the steps of determining a third label related to the first content and a fourth lab