Search

CN-121997894-A - Text processing method, apparatus, device, storage medium and program product

CN121997894ACN 121997894 ACN121997894 ACN 121997894ACN-121997894-A

Abstract

The embodiment of the application discloses a text processing method, a device, equipment, a storage medium and a program product, which can efficiently and accurately complete text format conversion, facilitate extraction of text information and enhance text accessibility. The method comprises the steps of obtaining a text to be processed and a format mapping rule with a format type of a first format, processing a text sample with the first format and a text sample with a second format by a pre-training language model, wherein the format mapping rule is used for representing a mapping relation between a format grammar with the first format and a format grammar with the second format, the format reading and writing difficulty of the first format is larger than that of the second format, extracting text content characteristics of the text to be processed, extracting text structure characteristics of the text to be processed based on the text to be processed and the text content characteristics, and processing the format mapping rule, the text content characteristics and the text structure characteristics based on the pre-training language model to generate a target text with the format type of the second format.

Inventors

  • ZHANG YUANG

Assignees

  • 腾讯科技(深圳)有限公司

Dates

Publication Date
20260508
Application Date
20241107

Claims (17)

  1. 1. A method of text processing, comprising: The method comprises the steps of obtaining a text to be processed with a first format and a format mapping rule, wherein the format mapping rule is obtained by processing a text sample with a first format and a text sample with a second format by a pre-training language model, the format mapping rule is used for representing a mapping relation between a format grammar with the first format and a format grammar with the second format, and the format reading and writing difficulty of the first format is greater than that of the second format; extracting text content characteristics of the text to be processed; Extracting text structural features of the text to be processed based on the text to be processed and the text content features; and processing the format mapping rule, the text content characteristic and the text structure characteristic based on the pre-training language model to generate target text with the format type of the second format.
  2. 2. The method of claim 1, wherein processing the format mapping rule, the text content feature, and the text structural feature based on the pre-trained language model to generate target text having a format type of the second format comprises: Performing content mapping on the text content features based on the format mapping rule through the pre-training language model to obtain target content conforming to the second format; Performing structural mapping on the text structural features based on the format mapping rule through the pre-training language model to obtain a target structure conforming to the second format; And generating target text with the format type of the second format based on the target content and the target structure.
  3. 3. The method of claim 2, wherein the text structure features include text heading features and text paragraph features, wherein structurally mapping the text structure features based on the format mapping rules by the pre-trained language model to obtain a target structure conforming to the second format comprises: Determining a target title based on the text title feature and the format mapping rule; determining a target paragraph based on the text paragraph feature and the format mapping rule; And obtaining a target structure conforming to the second format based on the target title and the target paragraph.
  4. 4. The method of claim 3, wherein determining a target title based on the text title feature and the format mapping rule comprises: determining title content and title level in the text to be processed based on the text title features; Determining a title level symbol of the second format corresponding to the title level based on the title level and the format mapping rule; a target title is determined based on the title content and the title level symbol.
  5. 5. The method of claim 3, wherein determining a target paragraph based on the text paragraph feature and the format mapping rule comprises: Determining paragraph signs of text paragraphs in the text to be processed according to the text paragraph features; and changing the paragraph symbol of the text paragraph into a target paragraph symbol corresponding to the second format based on the format mapping rule so as to determine a target paragraph.
  6. 6. The method of claim 3, wherein the text structure features further comprise a text list feature and a text form feature, the method further comprising: Determining a target list based on the text list features and the format mapping rules; determining a target table based on the text table features and the format mapping rules; Obtaining a target structure conforming to the second format based on the target title and the target paragraph, including: and obtaining a target structure conforming to the second format based on the target title, the target paragraph, the target list and the target table through the pre-training language model.
  7. 7. The method of claim 6, wherein determining a target list based on the text list feature and the format mapping rule comprises: Determining list symbols of a text list in the text to be processed based on the text list features; and changing the list symbol of the text list into the target list symbol of the second format based on the format mapping rule so as to obtain a target list.
  8. 8. The method of claim 6, wherein determining a target form based on the text form features and the format mapping rules comprises: Determining a table symbol of a text table in the text to be processed according to the text table characteristics; And changing the table symbol of the text table into a target table symbol of the second format based on the format mapping rule so as to obtain a target table.
  9. 9. The method according to any one of claims 1 to 8, wherein extracting text content features of the text to be processed comprises: Identifying text content from the text to be processed, the text content including a text font and a plurality of text blocks; performing text feature recognition processing on the text fonts to obtain text font features; Extracting coordinate information of each text block, wherein the coordinate information of each text block is used for representing the position of the corresponding text block in the text to be processed; sorting a plurality of text blocks based on the coordinate information of all the text blocks to obtain text position features; and obtaining the text content characteristics of the text to be processed based on the text font characteristics and the text position characteristics.
  10. 10. The method of claim 9, wherein extracting text structural features of the text to be processed based on the text to be processed and the text content features comprises: determining text title features based on the text font features and font thresholds of preset title levels; determining paragraph identification signs among the text blocks based on the text position features to obtain text paragraph features; identifying list identification symbols in the text to be processed to obtain text list characteristics; identifying grid line symbols in the text to be processed to obtain text form features; and obtaining the text structure characteristic of the text to be processed based on one or more of the text title characteristic, the text paragraph characteristic, the text list characteristic and the text table characteristic.
  11. 11. The method according to any one of claims 1 to 10, wherein after processing the format mapping rule, the text content feature and the text structure feature based on the pre-trained language model to generate a target text having a format type of the second format, the method further comprises: Performing grammar checking processing of the second format on the target text to obtain target format information which does not accord with the format grammar of the second format in the target text; And processing the target format information based on the format grammar of the second format to obtain the optimized target text.
  12. 12. The method of claim 11, wherein the target format information comprises one or more of blank lines, indented, non-compliant characters, and error symbols.
  13. 13. The method according to claim 11 or 12, wherein after processing the target format information based on the format syntax of the second format to obtain the optimized target text, the method further comprises: performing text language inspection on the optimized target text based on a language detection algorithm to obtain the text language of the optimized target text; and correcting a first symbol in the optimized target text based on a preset symbol comparison table and the text language to obtain the target text conforming to the text language, wherein the first symbol is a text symbol which does not conform to the text language in the optimized target text.
  14. 14. A text processing apparatus, comprising: The system comprises an acquisition unit, a format mapping unit and a processing unit, wherein the format type is a text to be processed of a first format, the format mapping rule is obtained by processing a text sample of the first format and a text sample of a second format by a pre-training language model, the format mapping rule is used for representing a mapping relation between a format grammar of the first format and a format grammar of the second format, and the format read-write difficulty of the first format is larger than that of the second format; The extraction unit is used for extracting text content characteristics of the text to be processed; The extraction unit is used for extracting text structural features of the text to be processed based on the text to be processed and the text content features; And the processing unit is used for processing the format mapping rule, the text content characteristic and the text structure characteristic based on the pre-training language model to generate target text with the format type of the second format.
  15. 15. A text processing device is characterized by comprising an input/output interface, a processor and a memory, wherein program instructions are stored in the memory; The processor is configured to execute program instructions stored in a memory to perform the method of any one of claims 1 to 13.
  16. 16. A computer readable storage medium comprising instructions which, when run on a computer device, cause the computer device to perform the method of any of claims 1to 13.
  17. 17. A computer program product comprising instructions which, when run on a computer device, cause the computer device to perform the method of any of claims 1 to 13.

Description

Text processing method, apparatus, device, storage medium and program product Technical Field Embodiments of the present application relate to the field of computer technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for text processing. Background Text format conversion refers to converting text from one format to another. Different text formats have different characteristics. Through format conversion, the text can be converted into different format types so as to meet the requirements of different application scenes. For example, text formatted to support the reading function may be converted to text formatted to support the editing function, etc. In some texts with complex text structures, such as titles, tables and the like, where a plurality of different text elements exist, although the text format of the text can support a certain text function, the text format is converted by using a text conversion tool matched with the text in a related scheme due to the complex structure, which is not beneficial to the processes of content extraction, reconstruction and the like. The method is easy to cause the current mode, and can not accurately and efficiently convert the text with complex structure into other text formats which are easy to edit and maintain, so that difficulties in text reading and understanding are easily caused, and the experience of a user for acquiring text information is greatly reduced. Therefore, there is a need to propose a conversion method that can be efficiently implemented to convert text from a format with complex structure to a format that is easy to edit and maintain. Disclosure of Invention The embodiment of the application provides a text processing method, a device, equipment, a storage medium and a program product, which are used for realizing efficient and accurate conversion of a text format, so that the text with the converted format can be easily edited and maintained, text information can be easily extracted by a user, and the accessibility of the text is enhanced. In a first aspect, an embodiment of the present application provides a text processing method. The method comprises the steps of obtaining a text to be processed with a format type of a first format and a format mapping rule, wherein the format mapping rule is obtained by processing a text sample with the first format and a text sample with a second format through a pre-training language model, the format mapping rule is used for representing a mapping relation between a format grammar with the first format and a format grammar with the second format, the format reading and writing difficulty of the first format is larger than that of the second format, extracting text content characteristics of the text to be processed, extracting text structure characteristics of the text to be processed based on the text to be processed and the text content characteristics, and processing the format mapping rule, the text content characteristics and the text structure characteristics based on the pre-training language model to generate a target text with the format type of the second format. In a second aspect, an embodiment of the present application provides a text processing apparatus. The text processing device comprises an acquisition unit, an extraction unit and a processing unit. The system comprises an acquisition unit, a format mapping rule and a pre-training language model, wherein the acquisition unit is used for acquiring a text to be processed with a first format type and a format mapping rule, the format mapping rule is obtained by processing a text sample with the first format and a text sample with the second format by the pre-training language model, the format mapping rule is used for representing a mapping relation between a format grammar with the first format and a format grammar with the second format, and the format reading and writing difficulty of the first format is larger than that of the second format; The extraction unit is used for extracting text content characteristics of the text to be processed; The extraction unit is used for extracting text structural features of the text to be processed based on the text to be processed and the text content features; and the processing unit is used for processing the format mapping rule, the text content characteristic and the text structure characteristic based on the pre-training language model and generating the target text with the format type of the second format. In one possible implementation of another aspect of the embodiments of the present application, the processing unit is specifically configured to: Performing content mapping on the text content characteristics based on the format mapping rule through a pre-training language model to obtain target content conforming to a second format; Through a pre-training language model, carrying out structural mapping on the text structural fe