CN-118096940-B - Tibetan text data set generation method and system

CN118096940BCN 118096940 BCN118096940 BCN 118096940BCN-118096940-B

Abstract

The application relates to a Tibetan text data set generation method and a Tibetan text data set generation system, which are applied to the technical field of data generation and comprise the steps of counting the occurrence frequency of Tibetan characters based on preset Tibetan data, acquiring high-frequency Tibetan main characters and Tibetan auxiliary characters, preprocessing the Tibetan data, and acquiring Tibetan processing information, wherein the Tibetan processing information at least comprises a Tibetan background picture, a text color and a text word size, and generating Tibetan text pictures according to a preset Tibetan distribution mode, the Tibetan auxiliary characters, the high-frequency Tibetan main characters and the Tibetan processing information. The method ensures that the Tibetan text data with high quality, diversified variants and sufficient data volume are generated under the condition that external Tibetan language data are not needed, thereby establishing a general Tibetan text data set with high availability, further improving the training effect of a Tibetan target detection model, meeting the requirements of various Tibetan application fields and promoting the development and popularization of Tibetan language.

Inventors

TIAN HUI
WANG HUAN
GUO YUGANG
ZHANG ZHIXIANG
YANG XI
MA ZEHUA
ZHANG WEIMING
YU NENGHAI

Assignees

合肥高维数据技术有限公司
中国科学技术大学

Dates

Publication Date: 20260508
Application Date: 20240104

Claims (7)

1. A method for generating a data set of Tibetan text, comprising: counting the occurrence frequency of Tibetan characters based on preset Tibetan data, and acquiring high-frequency Tibetan main characters and Tibetan auxiliary characters, wherein the method comprises the following steps: counting all Tibetan text data, extracting Tibetan main characters and the Tibetan auxiliary characters, and generating a Tibetan corpus; generating a Tibetan body character frequency table according to the Tibetan body characters in the Tibetan corpus; sequentially selecting a preset number of Tibetan body characters as the high-frequency Tibetan body characters according to the sequence from big to small in frequency in the Tibetan body character frequency table; preprocessing the Tibetan language data to obtain Tibetan language processing information, wherein the Tibetan language processing information at least comprises a Tibetan language background map, text colors and text word sizes, and comprises the following steps: Extracting a background picture from a preset Tibetan background library, and cutting and adjusting the background picture to obtain the Tibetan background picture; setting the character color and the font size of a Tibetan text in a preset Tibetan color library to obtain the text color and the text size; Corresponding the Tibetan background diagram with the text color and the text word size to obtain a text generation scheme; Corresponding a plurality of text word sizes and a plurality of text colors to the same Tibetan background picture to generate a text enhancement scheme; Combining the Tibetan auxiliary characters and the high-frequency Tibetan main characters into complete Tibetan characters by combining Tibetan grammar rules; And generating Tibetan text pictures according to the Tibetan processing information and the complete Tibetan characters according to a preset Tibetan distribution mode so as to construct a universal Tibetan text data set.
2. The method for generating a Tibetan text data set according to claim 1, wherein the generating a Tibetan text picture according to a preset Tibetan distribution pattern, the Tibetan processing information and the complete Tibetan character comprises: randomly selecting a first coordinate from the Tibetan background diagram to set the first coordinate as the position of the generated first high-frequency Tibetan main character; randomly selecting the Tibetan auxiliary characters and the high-frequency Tibetan main characters from the Tibetan corpus based on Tibetan grammar rules to generate complete Tibetan characters; Determining the embedding position of the complete Tibetan character according to a preset determinant and the first coordinate; and embedding the complete Tibetan character into the Tibetan background picture according to the embedding position and the text generation scheme to generate the Tibetan text picture.
3. The method for generating a data set of a Tibetan text according to claim 2, wherein before the complete Tibetan character is embedded in the Tibetan background map according to the embedding position, the method further comprises: Controlling the Tibetan text format according to a preset determinant interval; and setting the Tibetan character edge interval according to a preset standard interval and controlling the number of Tibetan characters embedded in the Tibetan background picture.
4. The method for generating a Tibetan text data set according to claim 1, wherein the generating a Tibetan text picture according to a preset Tibetan distribution pattern, the Tibetan processing information and the complete Tibetan character comprises: randomly selecting a plurality of non-overlapping areas as embedded coordinates of the high-frequency Tibetan body characters based on the Tibetan background diagram; randomly selecting the Tibetan auxiliary characters and the high-frequency Tibetan main characters from the Tibetan corpus based on Tibetan grammar rules to generate complete Tibetan characters; and embedding the complete Tibetan character into the Tibetan background picture according to the embedded coordinates and the text generation scheme to generate a Tibetan text picture.
5. A data set generating apparatus of a Tibetan text, the apparatus comprising: The frequency statistics module is used for counting the occurrence frequency of Tibetan characters based on preset Tibetan data, and acquiring high-frequency Tibetan main characters and Tibetan auxiliary characters, and comprises the following steps: counting all Tibetan text data, extracting Tibetan main characters and the Tibetan auxiliary characters, and generating a Tibetan corpus; generating a Tibetan body character frequency table according to the Tibetan body characters in the Tibetan corpus; sequentially selecting a preset number of Tibetan body characters as the high-frequency Tibetan body characters according to the sequence from big to small in frequency in the Tibetan body character frequency table; The Tibetan processing module is used for preprocessing the Tibetan data to obtain Tibetan processing information, wherein the Tibetan processing information at least comprises a Tibetan background picture, a text color and a text word size, and comprises the following components: Extracting a background picture from a preset Tibetan background library, and cutting and adjusting the background picture to obtain the Tibetan background picture; setting the character color and the font size of a Tibetan text in a preset Tibetan color library to obtain the text color and the text size; Corresponding the Tibetan background diagram with the text color and the text word size to obtain a text generation scheme; Corresponding a plurality of text word sizes and a plurality of text colors to the same Tibetan background picture to generate a text enhancement scheme; The image generation module combines the Tibetan auxiliary characters and the high-frequency Tibetan main characters into complete Tibetan characters by combining Tibetan grammar rules, and generates Tibetan text images according to the Tibetan processing information and the complete Tibetan characters according to a preset Tibetan distribution mode so as to construct a universal Tibetan text data set.
6. A control apparatus, characterized in that the apparatus comprises: Comprising a memory and a processor, said memory having stored thereon a computer program capable of being loaded by said processor and performing the method according to any of claims 1 to 4.
7. A computer readable storage medium, characterized in that a computer program is stored which can be loaded by a processor and which performs the method according to any of claims 1 to 4.

Description

Tibetan text data set generation method and system Technical Field The application relates to the technical field of data generation, in particular to a data set generation method and system of Tibetan texts. Background Tibetan (Tibetan) is a language of the han-Tibetan family, with unique character sets and grammar structures that make it very different from many other languages. The existing Tibetan text generation method is mainly based on the field of Natural Language Processing (NLP), so that the Tibetan text is understood and generated. The university of Qinghai nationality provides a method for automatically generating Tibetan web page abstracts in the patent literature 'a Tibetan web page abstract automatic generation method and system' (application number CN202011433753.3, application publication number CN 112328946A) applied by the university of Qinghai nationality. The method comprises the steps of firstly, using a Tibetan webpage crawler tool to crawl training and testing samples of a Tibetan webpage abstract system, secondly, judging the length of a Tibetan webpage and judging whether the Tibetan webpage hyperlink is in a database, thirdly, removing noise from the crawled Tibetan webpage to generate a Tibetan webpage text form, then automatically word-dividing the text, and fourthly, after sorting Tibetan webpage text sentences according to weight, setting a Tibetan webpage abstract extraction threshold value, and extracting an initial abstract of the Tibetan webpage according to the threshold value. The invention can effectively output Tibetan webpage abstract by combining web crawlers and natural language processing technology. Aiming at the related technology, the establishment of a high-quality Tibetan text database is considered to be impractical due to the limitation of Tibetan data, so that the general Tibetan text data set is difficult to establish in the use process of the method, and the quality of the crawling data is uneven, so that the training of a Tibetan model is influenced. Disclosure of Invention In order to solve the problem that the quality of the crawling data is uneven and the training of a Tibetan target detection model is affected due to the fact that the establishment of a Tibetan text database with high quality is impractical due to the limitation of Tibetan data, the method is difficult to establish a general Tibetan text data set in the use process, and the method and the system for generating the Tibetan text data set are provided. In a first aspect, the method for generating the data set of the Tibetan text provided by the application adopts the following technical scheme that: counting the occurrence frequency of Tibetan characters based on preset Tibetan data, and acquiring high-frequency Tibetan main characters and Tibetan auxiliary characters; Preprocessing the Tibetan language data to obtain Tibetan language processing information, wherein the Tibetan language processing information at least comprises a Tibetan language background picture, text colors and text word sizes; And generating a Tibetan text picture according to a preset Tibetan distribution mode, the Tibetan auxiliary characters, the high-frequency Tibetan main characters and the Tibetan processing information. Optionally, the counting the occurrence frequency of the Tibetan characters based on the preset Tibetan data to obtain the high-frequency Tibetan main characters includes: counting all Tibetan text data, extracting Tibetan main characters and the Tibetan auxiliary characters, and generating a Tibetan corpus; generating a Tibetan body character frequency table according to the Tibetan body characters in the Tibetan corpus; and sequentially selecting a preset number of Tibetan body characters as the high-frequency Tibetan body characters according to the sequence from big frequency to small frequency in the Tibetan body character frequency table. Optionally, the preprocessing is performed on the Tibetan data to obtain Tibetan processing information, where the Tibetan processing information at least includes a Tibetan background map, a text color and a text word size, and the method includes: Extracting a background picture from a preset Tibetan background library, and cutting and adjusting the background picture to obtain the Tibetan background picture; setting the character color and the font size of a Tibetan text in a preset Tibetan color library to obtain the text color and the text size; And the Tibetan background diagram is corresponding to the text color and the text word size, and a text generation scheme is obtained. Optionally, after the correspondence between the Tibetan background map and the text color and the text font size is performed, a text generation scheme is obtained, the method further includes: and corresponding the text word sizes and the text colors to the same Tibetan background picture to generate a text enhancement scheme. Optionally, the generating the Tibetan text