CN-122024255-A - Text processing method, device, equipment and medium for double-engine collaborative filtering

CN122024255ACN 122024255 ACN122024255 ACN 122024255ACN-122024255-A

Abstract

The invention provides a text processing method, device, equipment and medium for double-engine collaborative filtering, which comprises double-engine feature extraction and preprocessing, collaborative filtering central processing, structured layout analysis results and text (art words and commodity fonts) analysis results, wherein the double-engine feature extraction and preprocessing comprises an attribute recognition engine and a coordinate analysis engine, the collaborative filtering central processing comprises feature alignment, dynamic weight distribution and confidence level calibration, the structured layout analysis results comprise text content W, type labels L, an attribute set F, accurate coordinates P and comprehensive credibility C, and the text (art words and commodity fonts) error retention rate is reduced from more than 30% to less than 5% in the prior art through the double-engine collaborative filtering, and the text coordinate analysis accuracy is improved from 98% to more than 99%.

Inventors

LIU ZHIHAI
YANG KENGQIANG
TONG ZHEN

Assignees

福建紫讯信息科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251223

Claims (10)

1. A text processing method of double-engine collaborative filtering is characterized by comprising the following steps: step1, extracting and preprocessing features of a double engine, wherein the double engine comprises an attribute identification engine and a coordinate analysis engine; step 2, collaborative filtering central processing, including feature alignment, dynamic weight allocation and confidence calibration; And step 3, outputting a structured layout analysis result, wherein the structured layout analysis result comprises text content W, a type label L, an attribute set F, accurate coordinates P and comprehensive credibility C.
2. The method for text processing with dual engine collaborative filtering according to claim 1, wherein the attribute identification engine performs the following operations: Performing resolution detection and text density calculation on an input image to generate scene feature parameters S= (resolution D and text density T), wherein the resolution D is obtained through conversion of pixel size and inches, and the text density T is the ratio of the area of a text area to the total area of the image; dividing text areas in an image through connected domain analysis and layering, executing attribute recognition on each layer of text, and outputting a first feature vector, wherein the first feature vector comprises text content W, a type label L, an attribute set F and an attribute confidence coefficient C_attr, the type label L comprises effective text, an artistic word and commodity fonts, the attribute set F comprises fonts, word colors, word sizes and bold marks, the type label L is generated through multi-mode training of an attribute recognition engine, and the multi-mode training fuses text morphological characteristics and image context characteristics; the coordinate resolution engine performs the following operations: The method comprises the steps of generating a feature map of 3-5 scales for an input image by adopting a pyramid scale decomposition algorithm, respectively executing text detection, correcting an inclined text of a detected text boundary box through perspective transformation, outputting a second feature vector, wherein the second feature vector comprises text content W, accurate coordinates P and recognition probability C_ ocr, the accurate coordinates P are four-point coordinate sets, the coordinate precision is calibrated to be less than or equal to 1 pixel through pixel level comparison, and filtering the second feature vector with the recognition probability C_ ocr < 70.
3. The text processing method of double-engine collaborative filtering according to claim 1, wherein the feature alignment comprises constructing a text content-space position double-dimensional correlation matrix, calculating the spatial overlapping degree IOU of a text region and the text content similarity, and when the IOU is more than or equal to 60% and the text content similarity is more than or equal to 80%, establishing the correlation mapping of a first feature vector and a second feature vector, wherein the text content similarity is calculated based on an edit distance algorithm; the method comprises the steps of carrying out complementation processing on a first feature vector which is not associated with a first feature vector and a second feature vector, specifically, carrying out prediction on a precise coordinate P through a K neighbor algorithm on the basis of coordinate distribution of a text which is associated with the periphery, wherein the K value of the K neighbor algorithm is 3, carrying out complementation type label L on a second feature vector which is not associated with the first feature vector and is not associated with the first feature vector, which is C_ ocr is not less than 80, and is not associated with the first feature vector, and calling a lightweight classification model of an attribute recognition engine based on the image context of a coordinate area; The dynamic weight distribution comprises the steps of taking a scene characteristic parameter S as input, outputting an attribute engine weight alpha and a coordinate engine weight beta through a regression model subjected to history sample training, wherein alpha+beta=1 is satisfied, the regression model is a random forest regression model, when D=300 DPI and T=0.2, alpha=0.6 and beta=0.4, when D=72 DPI and T=0.8, alpha=0.3 and beta=0.7, calculating comprehensive credibility C=alpha×C_attr+beta×C_ ocr, and screening out a correlation result with C not less than 75, wherein the correlation result with C <75 is a low credibility text; The confidence coefficient calibration comprises the steps of establishing a type-coordinate feedback model, dynamically adjusting thresholds for different types of texts, filtering texts with comprehensive credibility C <75, and outputting a structured layout analysis result, wherein the threshold dynamic adjustment is specifically that when a type label L is a valid text and a plurality of overlapped texts exist in a coordinate area, if IOU of the overlapped texts is more than or equal to 50%, an attribute confidence coefficient screening threshold is adjusted to be C_attr to be more than or equal to 70, and when the type label L is a commodity font and the coordinates are positioned at the edge of an image, if the coordinates are less than or equal to 5% of image width from the image boundary, the attribute confidence coefficient screening threshold is adjusted to be C_attr to be more than or equal to 50.
4. The method for text processing with dual engine collaborative filtering according to claim 1, wherein the attribute identification engine is doubao _seed_vision big model and the coordinate resolution engine is PaddleOCR engine.
5. A text processing device with double engines collaborative filtering is characterized by comprising: the system comprises an engine processing module, a coordinate analysis module and a coordinate analysis module, wherein the engine processing module is used for extracting and preprocessing double engine features, and the double engines comprise an attribute identification engine and a coordinate analysis engine; the filtering processing module cooperates with the central filtering processing module and comprises feature alignment, dynamic weight allocation and confidence calibration; The method comprises the steps of obtaining a result module and outputting a structured layout analysis result, wherein the structured layout analysis result comprises text content W, a type label L, an attribute set F, accurate coordinates P and comprehensive credibility C.
6. The text processing device of claim 5, wherein the attribute identification engine performs the following operations: Performing resolution detection and text density calculation on an input image to generate scene feature parameters S= (resolution D and text density T), wherein the resolution D is obtained through conversion of pixel size and inches, and the text density T is the ratio of the area of a text area to the total area of the image; dividing text areas in an image through connected domain analysis and layering, executing attribute recognition on each layer of text, and outputting a first feature vector, wherein the first feature vector comprises text content W, a type label L, an attribute set F and an attribute confidence coefficient C_attr, the type label L comprises effective text, an artistic word and commodity fonts, the attribute set F comprises fonts, word colors, word sizes and bold marks, the type label L is generated through multi-mode training of an attribute recognition engine, and the multi-mode training fuses text morphological characteristics and image context characteristics; the coordinate resolution engine performs the following operations: The method comprises the steps of generating a feature map of 3-5 scales for an input image by adopting a pyramid scale decomposition algorithm, respectively executing text detection, correcting an inclined text of a detected text boundary box through perspective transformation, outputting a second feature vector, wherein the second feature vector comprises text content W, accurate coordinates P and recognition probability C_ ocr, the accurate coordinates P are four-point coordinate sets, the coordinate precision is calibrated to be less than or equal to 1 pixel through pixel level comparison, and filtering the second feature vector with the recognition probability C_ ocr < 70.
7. The text processing device with double-engine collaborative filtering according to claim 5, wherein the feature alignment comprises constructing a text content-space position double-dimensional correlation matrix, calculating the spatial overlapping degree IOU of a text region and the text content similarity, and when the IOU is more than or equal to 60% and the text content similarity is more than or equal to 80%, establishing the correlation mapping of a first feature vector and a second feature vector, wherein the text content similarity is calculated based on an edit distance algorithm; the method comprises the steps of carrying out complementation processing on a first feature vector which is not associated with a first feature vector and a second feature vector, specifically, carrying out prediction on a precise coordinate P through a K neighbor algorithm on the basis of coordinate distribution of a text which is associated with the periphery, wherein the K value of the K neighbor algorithm is 3, carrying out complementation type label L on a second feature vector which is not associated with the first feature vector and is not associated with the first feature vector, which is C_ ocr is not less than 80, and is not associated with the first feature vector, and calling a lightweight classification model of an attribute recognition engine based on the image context of a coordinate area; The dynamic weight distribution comprises the steps of taking a scene characteristic parameter S as input, outputting an attribute engine weight alpha and a coordinate engine weight beta through a regression model subjected to history sample training, wherein alpha+beta=1 is satisfied, the regression model is a random forest regression model, when D=300 DPI and T=0.2, alpha=0.6 and beta=0.4, when D=72 DPI and T=0.8, alpha=0.3 and beta=0.7, calculating comprehensive credibility C=alpha×C_attr+beta×C_ ocr, and screening out a correlation result with C not less than 75, wherein the correlation result with C <75 is a low credibility text; The confidence coefficient calibration comprises the steps of establishing a type-coordinate feedback model, dynamically adjusting thresholds for different types of texts, filtering texts with comprehensive credibility C <75, and outputting a structured layout analysis result, wherein the threshold dynamic adjustment is specifically that when a type label L is a valid text and a plurality of overlapped texts exist in a coordinate area, if IOU of the overlapped texts is more than or equal to 50%, an attribute confidence coefficient screening threshold is adjusted to be C_attr to be more than or equal to 70, and when the type label L is a commodity font and the coordinates are positioned at the edge of an image, if the coordinates are less than or equal to 5% of image width from the image boundary, the attribute confidence coefficient screening threshold is adjusted to be C_attr to be more than or equal to 50.
8. The text processing device of claim 5, wherein the attribute recognition engine is doubao _seed_vision big model and the coordinate analysis engine is PaddleOCR engine.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 4 when the program is executed by the processor.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 4.

Description

Text processing method, device, equipment and medium for double-engine collaborative filtering Technical Field The invention relates to the technical field of image text processing, in particular to a text processing method, device, equipment and medium for double-engine collaborative filtering. Background The existing automatic layout analysis technology has three core pain points, namely, a single algorithm is unbalanced in capability, for example, a doubao _feed_vision large model can accurately filter artistic words and commodity fonts and identify text attributes, but the coordinate analysis error rate is up to 15% -20%, and PaddleOCR has the coordinate accuracy of more than 98%, but cannot distinguish text types, and the irrelevant text false recognition rate is more than 30%. Secondly, most of traditional fusion schemes are hard splicing, and only text matching screening results are adopted, so that feature complementarity of two types of algorithms is not considered, and the problem of attribute deletion or coordinate deviation still exists in the fused results. Thirdly, the dynamic adaptation mechanism is lacking, and the fixed fusion logic cannot guarantee the analysis precision and has poor adaptability for images with different resolutions (such as 300DPI documents and 72DPI webpage screenshot) and text densities (such as single page 100-word specifications and single page 500-word reports). In addition, the prior art does not introduce a collaborative filtering idea, and the effect of 1+1>2 cannot be realized through characteristic interaction and weight dynamic adjustment among algorithms. For example, when PaddleOCR identifies a text as a "product model" (no type judgment), doubao _seed_version identifies the text as a "commodity font" (to be filtered), the conventional scheme can simply discard the text, but cannot use the coordinate information of PaddleOCR to assist in correcting the similar text filtering threshold of doubao _seed_version, resulting in erroneous deletion of effective information or retention of irrelevant information, and seriously affecting the reliability of layout analysis. Disclosure of Invention The invention aims to solve the technical problem of providing a text processing method, a device, equipment and a medium for double-engine collaborative filtering, wherein the error retention rate of irrelevant texts (art words and commodity fonts) is reduced from more than 30% to less than 5% in the prior art, and the text coordinate analysis precision is improved from 98% to more than 99%. In a first aspect, the present invention provides a text processing method for collaborative filtering by using two engines, including the following steps: step1, extracting and preprocessing features of a double engine, wherein the double engine comprises an attribute identification engine and a coordinate analysis engine; step 2, collaborative filtering central processing, including feature alignment, dynamic weight allocation and confidence calibration; And step 3, outputting a structured layout analysis result, wherein the structured layout analysis result comprises text content W, a type label L, an attribute set F, accurate coordinates P and comprehensive credibility C. In a second aspect, the present invention provides a text processing apparatus for dual engine collaborative filtering, comprising: the system comprises an engine processing module, a coordinate analysis module and a coordinate analysis module, wherein the engine processing module is used for extracting and preprocessing double engine features, and the double engines comprise an attribute identification engine and a coordinate analysis engine; the filtering processing module cooperates with the central filtering processing module and comprises feature alignment, dynamic weight allocation and confidence calibration; The method comprises the steps of obtaining a result module and outputting a structured layout analysis result, wherein the structured layout analysis result comprises text content W, a type label L, an attribute set F, accurate coordinates P and comprehensive credibility C. In a third aspect, the invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect when executing the program. In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of the first aspect. The one or more technical schemes provided by the invention have at least the following technical effects or advantages: The precision is obviously improved, namely, through the collaborative filtering of the double engines, the error retention rate of irrelevant texts (art words and commodity fonts) is reduced from more than 30% to less than 5% in the prior art, the text coordin