CN-115796195-B - Bilingual corpus alignment method, bilingual corpus alignment system, terminal and medium
Abstract
The invention discloses a bilingual corpus alignment method, a bilingual corpus alignment system, a bilingual corpus alignment terminal and a bilingual corpus alignment medium, which relate to the technical field of translation and have the technical scheme that: and calculating the first hit rate of the original sentence and each translated sentence, and outputting the original sentence and the translated sentence with the highest first hit rate according to the sequence number of the original sentence when the first hit rate meets a first threshold value group. When the first hit rate meets a second threshold value group, calculating the translated sentence with the highest hit rate and the second hit rate of each original sentence, and when the second hit rate meets a third threshold value group, outputting the original sentence and the translated sentence with the highest hit rate according to the original sentence serial numbers. When the second hit rate meets the fourth threshold value group, outputting translated sentences and original sentences with highest word hit rates of the translated words and the original sentences according to the sequence numbers of the original sentences. The method and the device achieve the purposes of aligning the original text and the translated text according to the hit rate between the original text sentences and the translated text sentences and combining the original text sentences in sequence so as to improve the alignment accuracy and further improve the alignment efficiency.
Inventors
- GAO LIKUN
- ZHANG MACHENG
- LIAO FULIN
- LI MING
Assignees
- 成都优译信息技术股份有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20221125
Claims (10)
- 1. The bilingual corpus alignment method is characterized by comprising the following steps of: the method comprises the steps of obtaining sequence numbers and original sentence information of all original sentences in an original sentence list and translated sentence information of all translated sentences in a translated sentence list, wherein the original sentence information comprises a translated word root list of effective words of the original sentences, and the translated sentence information comprises a word root list of the effective words of the translated sentences; Acquiring a translation root in an original sentence translation root list and a first hit rate of the root in each translation sentence root list; If the first hit rate meets the first threshold value group, outputting the original sentence and the translated sentence with the highest hit rate according to the sequence number of the original sentence; if the first hit rate meets the second threshold value group, selecting the translated sentence with the highest first hit rate as a target object, and acquiring the root of a word in a target object root list and the second hit rate of the translated root of the word in each original sentence translated root list; If the second hit rate meets the third threshold value group, outputting the original sentence with the highest second hit rate and the target object according to the sequence number of the original sentence; If the second hit rate meets the fourth threshold value group, acquiring a word list of the target object, and performing reverse translation on words in the word list according to a preset query interface to acquire a translation word list of the target object, acquiring a third hit rate of the translation words in the translation word list of the target object and words in each original sentence, and outputting the original sentence with the highest third hit rate and the target object according to the sequence number of the original sentence.
- 2. The bilingual corpus alignment method according to claim 1, wherein the obtaining process of the original sentence list and the translated sentence list comprises: analyzing the file to be aligned to obtain an analyzed original text and a translated text; Respectively carrying out sentence dividing processing on the original text and the translated text to obtain a corresponding original sentence list and a corresponding translated sentence list; the original sentence list comprises serial numbers corresponding to the original sentences; the translated sentence list includes translated sentences.
- 3. The bilingual corpus alignment method of claim 1, wherein the obtaining process of the textual sentence information comprises: Using hanlp lexical analysis of the original sentence, and removing the interference words to obtain an original sentence word list containing the effective words of the original sentence; Inquiring a translated word list of each effective word in the original sentence word list according to a preset inquiry interface to obtain a translated word list set corresponding to the original sentence word list; extracting the translation root of the translation word list in the translation word list set through stemmer algorithm to obtain a translation root list corresponding to the original sentence word list; The original sentence information comprises an original sentence, an original sentence word list, a translated word list set and a translated word root list.
- 4. A bilingual corpus alignment method according to any of claims 1 or 3, characterized in that: The preset query interface is a prototype, a root and a single complex number.
- 5. The bilingual corpus alignment method according to claim 1, wherein the obtaining process of the translated sentence list comprises: Analyzing the translated sentence by using stanford lexicon, and removing the interference words to obtain a translated sentence word list containing the effective words of the translated sentence; extracting the root of the effective words in the translated sentence word list through stemmer algorithm to obtain a root list corresponding to the translated sentence word list; the translated sentence information comprises a translated sentence, a translated sentence word list and a root list.
- 6. The bilingual corpus alignment method according to claim 1, wherein the selecting method with the highest hit rate comprises: And acquiring the highest hit rate through binary tree sequencing.
- 7. A bilingual corpus alignment system, comprising: The first acquisition module (100) is used for acquiring sequence numbers and original sentence information of all original sentences in the original sentence list and translated sentence information of all translated sentences in the translated sentence list, wherein the original sentence information comprises a translated word root list of effective words of the original sentences and the translated sentence information comprises a word root list of effective words of the translated sentences; A second obtaining module (110) for obtaining the translated word roots in the translated word root list of the original sentence and the first hit rate of the word roots in the word root list of each translated sentence; the first output module (120) is used for outputting the original sentence and the translated sentence with the highest first hit rate according to the sequence number of the original sentence when the first hit rate meets a first threshold value group; the third obtaining module (130) is used for selecting the translated sentence with the highest first hit rate as a target object when the first hit rate meets a second threshold value group, and obtaining the root in the root list of the target object and the second hit rate of the translated root in the root list of each original sentence translated word; The second output module (140) is used for outputting the original sentence with the highest second hit rate and the target object according to the sequence number of the original sentence when the second hit rate meets the third threshold value group; The third output module is used for obtaining a word list of the target object when the second hit rate meets a fourth threshold value group, performing reverse translation on words in the word list according to a preset query interface to obtain a translated word list of the target object, obtaining a third hit rate of the translated words in the translated word list of the target object and the words in each original sentence, and outputting the original sentence and the target object with the highest third hit rate according to the sequence number of the original sentence.
- 8. The bilingual corpus alignment system of claim 7, wherein: The third output module comprises a first acquisition unit (141), a first processing unit (142), a second acquisition unit (143), a second processing unit (144) and an output unit (145); The first acquisition unit (141) is connected with the third acquisition module (130) and is used for acquiring a word list of the target object when the second hit rate meets a fourth threshold value group; the first processing unit (142) is connected with the first obtaining unit (141) and is used for performing reverse translation on words in the word list according to a preset query interface based on the target object word list obtained by the first obtaining unit (141) so as to obtain a translated word list of the target object; The second acquisition unit (143) is connected with the first processing unit (142) and the first acquisition module (100) and is used for acquiring a translation word list of a target object and each original sentence; the second processing unit (144) is connected with the second obtaining unit (143) and is used for calculating a third hit rate of the translation words in the target object translation word list and the words in each original sentence; the output unit (145) is connected with the second processing unit (144) and is used for outputting the original sentence with the highest third hit rate and the target object according to the sequence number of the original sentence.
- 9. An electronic terminal, comprising: A memory for storing a computer program; A processor for executing the computer program stored in the memory, so that the electronic terminal executes a bilingual corpus alignment method according to any one of claims 1-6.
- 10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements a bilingual corpus alignment method as claimed in any of claims 1-6.
Description
Bilingual corpus alignment method, bilingual corpus alignment system, terminal and medium Technical Field The invention relates to the technical field of translation, in particular to a bilingual corpus alignment method, a bilingual corpus alignment system, a bilingual corpus alignment terminal and a bilingual corpus alignment medium. Background Corpus alignment means that the original text and the translated text are corresponding according to different segmentation granularities to form a canonical language pair. The corpus alignment units have different granularities from large to small, such as chapters, paragraphs, sentences, words and the like, and the smaller the granularity is, the more abundant the language information provided by the parallel corpus is, and the greater the application value is. In general, if the corpus is aligned according to chapters or paragraphs, the original text and the translated text can be aligned according to the sequence. However, when the original text and the translated text are aligned in the paragraphs according to sentence or smaller granularity, the simple processing cannot be performed, and because of various reasons such as source language style, target language style, translation wind, content adjustment and the like, when the original text sentence and the translated text sentence in the paragraphs are aligned in a simple sequence, a great deal of mismatching is often caused, and the verification needs to be performed manually, which is time-consuming and labor-consuming and has low efficiency. Disclosure of Invention The invention aims to provide a bilingual corpus alignment method, a bilingual corpus alignment system, a bilingual corpus alignment terminal and a bilingual corpus alignment medium, wherein the first hit rate of effective word roots in original sentences and effective word roots in each translated sentence is calculated, and when the first hit rate meets a first threshold value group, the original sentences and the translated sentences with the highest first hit rate are output according to the original sentence serial numbers. When the first hit rate meets a second threshold value set, selecting a translated sentence with the highest hit rate, calculating the second hit rate of the effective word root of the translated sentence and the word root of each original sentence, and when the second hit rate meets a third threshold value set, outputting the original sentence and the translated sentence with the highest hit rate according to the original sentence serial numbers. And when the second hit rate meets a fourth threshold value group, performing reverse translation on the translated sentence, and outputting translated words of the translated sentence and the original sentence with the highest word hit rate in each original sentence according to the sequence number of the original sentence. The method and the device achieve the purposes of aligning the original text and the translated text according to the hit rate between the original text sentences and the translated text sentences and combining the sequence of the original text sentences so as to improve the alignment accuracy of the original text sentences and the translated text sentences and further improve the alignment efficiency. The technical aim of the invention is realized by the following technical scheme: A bilingual corpus alignment method comprises the steps of obtaining sequence numbers and original sentence information of original sentences in an original sentence list and translated sentence information of the original sentences in the translated sentence list, wherein the original sentence information comprises a translated sentence root list of effective words of the original sentences, the translated sentence information comprises a word root list of the effective words of the translated sentences, obtaining a first hit rate of the translated word roots in the original sentence translation root list and the word roots in each translated sentence root list, outputting original sentences and translated sentences with the highest first hit rate according to the sequence numbers of the original sentences if the first hit rate meets a first threshold set, selecting the translated sentences with the highest first hit rate as target objects if the first hit rate meets a second threshold set, obtaining a second hit rate of the word roots in the target object root list and the word roots in each original sentence root list, outputting the first hit rate according to the sequence numbers of the original sentences, outputting the first hit rate of the original sentences and the first hit rate of the translated sentences, and obtaining the first hit rate of the translated words with the highest hit rate according to the target object, and obtaining the first hit rate of the target words and the first hit rate of the translated words. Further, the process for obtaining the original sen