CN-121981107-A - Source document evaluation processing method and device for search generation
Abstract
The application relates to the technical field of artificial intelligence and discloses a source document evaluation processing method and a device for search generation, wherein the source document evaluation processing method for search generation comprises the steps of analyzing a source document to obtain text information, structured information and format information of the source document; based on the text information, the structured information and the format information, performing multidimensional evaluation processing to obtain at least two of a structural integrity evaluation result, a content definition evaluation result, a context independence evaluation result and a format normalization evaluation result of the source document; and weighting at least two of the structure integrity evaluation result, the content definition evaluation result, the context independence evaluation result and the format normalization evaluation result to obtain a comprehensive evaluation result of the source document. By the method, the Word document can be intelligently analyzed, and the Word document is quantitatively evaluated from multiple dimensions of structure, content and format, so that the document quality is improved.
Inventors
- YANG HAI
Assignees
- 北京千丁智能技术有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251224
Claims (10)
- 1. A source document evaluation processing method for search generation, the method comprising: analyzing the source document to obtain text information, structured information and format information of the source document; Based on the text information, the structured information and the format information, performing multidimensional evaluation processing to obtain at least two of a structural integrity evaluation result, a content definition evaluation result, a context independence evaluation result and a format normalization evaluation result of the source document; And weighting at least two of the structural integrity evaluation result, the content definition evaluation result, the context independence evaluation result and the format normalization evaluation result to obtain a comprehensive evaluation result of the source document.
- 2. The method of claim 1, wherein performing a multi-dimensional evaluation process based on at least one of the text information, the structured information, and the format information to obtain the structural integrity evaluation result comprises: determining a title style of the source document based on at least the structured information to obtain a title normalization evaluation result; determining list use conditions in the source document at least based on the structured information to obtain a list use evaluation result; determining the paragraph length in the source document at least based on the text information to obtain a paragraph length evaluation result; And obtaining the structural integrity evaluation result based on the title normalization evaluation result, the list use evaluation result and the paragraph length evaluation result.
- 3. The method of claim 1, wherein performing a multi-dimensional evaluation process based on at least one of the text information, the structured information, and the format information to obtain the content sharpness evaluation result comprises: based on at least the text information, carrying out natural voice processing to identify the use frequency of the fuzzy vocabulary, and obtaining a language definiteness evaluation result; performing special term identification at least based on the text information to obtain a special term evaluation result; based on at least the text information, performing wrongly written word and grammar recognition to obtain wrongly written word and grammar evaluation results; and obtaining the content definition evaluation result based on the language definition evaluation result, the technical term evaluation result and the wrongly written word and grammar evaluation result.
- 4. The method of claim 1, wherein performing a multi-dimensional evaluation process based on at least one of the text information, the structured information, and the format information to obtain the context independence evaluation result comprises: Identifying fuzzy pronouns at least based on the text information to obtain a definition evaluation result; Determining the topic of the content corresponding to the title based on at least the text information and the structural information to obtain an information atomization evaluation result; And obtaining the context independence evaluation result based on the index definition evaluation result and the information atomization evaluation result.
- 5. The method of claim 1, wherein performing a multi-dimensional evaluation process based on at least one of the text information, the structured information, and the format information to obtain the format normalization evaluation result comprises: determining the proportion information between the text information and the non-text information at least based on the text information and the format information to obtain a text extraction rate evaluation result; determining whether the picture has corresponding text description or not based on at least the text information and the format information to obtain a picture-text separation evaluation result; and obtaining the format normalization evaluation result based on the text extraction rate evaluation result and the image-text separation evaluation result.
- 6. The method according to any one of claims 1-5, further comprising: Determining first contents to be optimized in the source document based on the comprehensive evaluation result, and generating document optimization suggestions for the first contents to be optimized; generating a document evaluation report based on the comprehensive evaluation result and the document optimization suggestion, wherein the document evaluation report comprises the comprehensive evaluation result and document rating information, a multidimensional radar graph of the source document, the document optimization suggestion and position prompt information of the first content to be optimized in the source document.
- 7. The method of claim 6, wherein the method further comprises: determining second content to be optimized in the source document based on the comprehensive evaluation result; Outputting optimization indication information based on the second content to be optimized; And in response to receiving an optimization instruction aiming at the optimization instruction information, performing optimization processing on the source document, and generating an optimized document.
- 8. The method of claim 7, wherein the document evaluation report includes a search generation simulator, the method further comprising: Responding to a question input by a receiving user, searching the source document based on the question by the search generation simulator to obtain a first answer, and searching the optimized document based on the question by the search generation simulator to obtain a second answer; And generating answer comparison information based on the first answer and the second answer.
- 9. The method according to any one of claims 1-5, 7-8, wherein the method further comprises: The source document evaluation method is integrated in a document editor in the form of a plug-in to facilitate evaluation of the source document during writing of the source document by the document editor.
- 10. A source document evaluation processing apparatus for search generation, the apparatus comprising: The analysis module is used for analyzing the source document to obtain text information, structural information and format information of the source document; The evaluation module is used for carrying out multidimensional evaluation processing based on the text information, the structured information and the format information to obtain at least two of a structural integrity evaluation result, a content definition evaluation result, a context independence evaluation result and a format normalization evaluation result of the source document; and the processing module is used for carrying out weighting processing on at least two of the structural integrity evaluation result, the content definition evaluation result, the context independence evaluation result and the format normalization evaluation result to obtain the comprehensive evaluation result of the source document.
Description
Source document evaluation processing method and device for search generation Technical Field The application relates to the technical field of artificial intelligence, in particular to a source document evaluation processing method and device for search generation. Background With the popularization of Large Language Models (LLM), enterprise-level intelligent question-answering and knowledge assistant based on RAG (retrieval enhancement generation) technology has become a core application scenario. The RAG technique enhances the accuracy and reliability of LLM answers by retrieving relevant information from an enterprise internal knowledge base. The Word document is one of the most common and mainstream knowledge carriers of enterprises and is a core foundation for forming an RAG knowledge base, so that the quality of the Word document can directly influence the efficiency of an RAG system. At present, when an enterprise uses an AI question-answering system based on a RAG technology, the quality of Word documents in a bottom knowledge base is uneven, so that the problems of poor knowledge retrieval effect, lack of effective document diagnosis and optimization tools and the like exist. Disclosure of Invention Embodiments of the present application aim to solve, at least to some extent, one of the technical problems in the related art. To this end, an embodiment of the application proposes a source document evaluation processing method and apparatus for search generation. The embodiment of the application provides a source document evaluation processing method for search generation, which comprises the steps of analyzing a source document to obtain text information, structured information and format information of the source document, performing multidimensional evaluation processing to obtain at least two of a structural integrity evaluation result, a content definition evaluation result, a context independence evaluation result and a format normalization evaluation result of the source document based on the text information, the structured information and the format information, and weighting at least two of the structural integrity evaluation result, the content definition evaluation result, the context independence evaluation result and the format normalization evaluation result to obtain a comprehensive evaluation result of the source document. In some embodiments, multi-dimensional evaluation processing is performed based on at least one of text information, structured information and format information to obtain a structural integrity evaluation result, and the multi-dimensional evaluation processing comprises determining a title style of a source document based on at least the structured information to obtain a title normalization evaluation result, determining list use conditions in the source document based on at least the structured information to obtain a list use evaluation result, determining a paragraph length in the source document based on at least the text information to obtain a paragraph length evaluation result, and obtaining the structural integrity evaluation result based on the title normalization evaluation result, the list use evaluation result and the paragraph length evaluation result. In some embodiments, the multi-dimensional evaluation processing is performed based on at least one of text information, structured information and format information to obtain a content definition evaluation result, wherein the multi-dimensional evaluation processing comprises performing natural voice processing to recognize the use frequency of fuzzy vocabulary to obtain a language definition evaluation result based on at least the text information, performing professional term recognition based on at least the text information to obtain a professional term evaluation result, performing mispronounced word and grammar recognition based on at least the text information to obtain a mispronounced word and grammar evaluation result, and performing mispronounced word and grammar evaluation based on the language definition evaluation result, the professional term evaluation result and the mispronounced word and grammar evaluation result to obtain the content definition evaluation result. In some embodiments, the multi-dimensional evaluation process is performed based on at least one of text information, structured information and format information to obtain a context independence evaluation result, and the method comprises the steps of identifying fuzzy pronouns based on at least the text information to obtain a reference definition evaluation result, determining a subject of content corresponding to a title based on at least the text information and the structured information to obtain an information atomization evaluation result, and obtaining the context independence evaluation result based on the reference definition evaluation result and the information atomization evaluation result. In some embodi