CN-115409045-B - Document translation method, device, equipment and storage medium
Abstract
The invention provides a document translation method, a device, equipment and a storage medium, wherein the document translation method comprises the steps of obtaining a source language document to be translated, obtaining document attribute information corresponding to each sentence contained in the source language document, wherein the document attribute information is information which is determined based on the corresponding sentence and/or a context sentence of the corresponding sentence and is used for assisting in determining meaning of words in the corresponding sentence in the source language document, and carrying out translation on each sentence contained in the source language document by assisting in document attribute information corresponding to each sentence contained in the source language document, so as to obtain a target language translation corresponding to the source language document. The document translation method provided by the invention has a good document translation effect.
Inventors
- LIN CHAO
Assignees
- 科大讯飞股份有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20220829
Claims (12)
- 1. A document translation method, comprising: Acquiring a source language document to be translated; acquiring document attribute information which corresponds to each sentence contained in the source language document, wherein the document attribute information is information which is determined based on the corresponding sentence and/or the context sentence of the corresponding sentence and is used for assisting in determining the meaning of the word in the corresponding sentence in the source language document; the method comprises the steps of carrying out translation on each sentence contained in a source language document by assisting in document attribute information corresponding to each sentence contained in the source language document, so as to obtain a target language translation corresponding to the source language document; The document attribute information comprises one or more of temporal attribute information, ambiguity attribute information, domain attribute information and common-finger attribute information; the temporal attribute information comprises probability distribution of corresponding sentences on each set temporal; the ambiguity attribute information includes a disambiguated word set composed of words obtained from the context of the corresponding sentence to assist in disambiguating the corresponding sentence; the domain attribute information comprises probability distribution of corresponding sentences in each set domain; The co-referring attribute information includes a co-referring word set composed of words having a co-referring relationship with a corresponding sentence acquired from the context of the sentence.
- 2. The document translation method according to claim 1, wherein said document attribute information includes temporal attribute information and domain attribute information; the translating the sentences contained in the source language document by assisting with the document attribute information respectively corresponding to the sentences contained in the source language document comprises the following steps: Determining target feature vectors respectively corresponding to all sentences contained in the source language document based on all sentences contained in the source language document and corresponding temporal attribute information and domain attribute information of all sentences respectively, wherein the target feature vectors contain sentence-level text information, temporal attribute information and domain attribute information of the corresponding sentences; And determining the target language translation corresponding to the source language document based on the target feature vectors respectively corresponding to the sentences contained in the source language document.
- 3. The document translation method according to claim 2, wherein the determining the target feature vector for each sentence included in the source language document based on each sentence included in the source language document and temporal attribute information and domain attribute information for each sentence, respectively, comprises: Aiming at a target sentence of a corresponding target feature vector to be determined in the source language document: Determining a temporal attribute representation vector corresponding to the target sentence based on the temporal attribute information corresponding to the target sentence and the representation vector of each set temporal, and determining a domain attribute representation vector corresponding to the target sentence based on the domain attribute information corresponding to the target sentence and the representation vector of each set domain; Fusing the temporal attribute representation vector corresponding to the target sentence with the domain attribute representation vector corresponding to the target sentence, and taking the fused vector as a control vector corresponding to the target sentence; obtaining sentence-level text expression vectors of the target sentences; and fusing the sentence-level text representation vector of the target sentence with the control vector corresponding to the target sentence, and taking the fused vector as the target vector corresponding to the target sentence.
- 4. The document translation method according to claim 3, wherein said document attribute information includes ambiguous attribute information; the obtaining the sentence-level text representation vector of the target sentence comprises the following steps: Splicing a disambiguation word sequence consisting of disambiguation words in the disambiguation word set corresponding to the target sentence as a sentence with the target sentence to obtain a spliced sentence; determining a representation vector of each word contained in the spliced sentence; and determining sentence-level text representation vectors of the target sentences based on the representation vectors of each word contained in the spliced sentences.
- 5. The method of document translation according to claim 4, wherein said determining a representation vector for each word contained in said concatenated sentence comprises: For each word in the concatenated sentence: Acquiring a text representation vector of the word, a position representation vector of the word in a sentence and a position representation vector of the sentence in which the word is located; And fusing the text representation vector of the word, the position representation vector of the word and the position representation vector of the sentence in which the word is located, and taking the fused vector as the representation vector of the word.
- 6. The document translation method according to claim 2, wherein the document attribute information includes co-index attribute information; the determining the target language translation corresponding to the source language document based on the target feature vectors respectively corresponding to the sentences contained in the source language document comprises the following steps: Performing first-pass decoding on target vectors corresponding to sentences contained in the source language document, and caching co-instruction translations corresponding to sentences contained in the language document, wherein the co-instruction translations are translations of co-instruction words in a co-instruction word set corresponding to the corresponding sentences, and the co-instruction word translations are obtained through first-pass decoding; And combining the cached information, performing second-pass decoding on the target vectors corresponding to each sentence contained in the source language document, wherein the decoding result of the second-pass decoding is used as the target language translation corresponding to the source language document.
- 7. The method for translating documents according to claim 6, wherein the combining the cached information, performing a second decoding on the target vectors corresponding to each sentence included in the source language document, includes: For each sentence contained in the source language document: acquiring a co-instruction translation corresponding to the sentence from the cached information; at each decoding moment, determining a context vector of the current decoding moment according to a target vector corresponding to the sentence, and determining a decoding result of the current decoding moment according to the co-instruction translation corresponding to the sentence, the decoded word sequence and the context vector of the current decoding moment; and the decoding results of all decoding moments form the decoding results corresponding to the sentences.
- 8. The method for translating a document according to claim 7, wherein determining the decoding result at the current decoding time according to the co-mingled word translation corresponding to the sentence, the decoded word sequence, and the context vector at the current decoding time comprises: if a plurality of co-instruction translations corresponding to the sentence exist, sequencing the plurality of co-instruction translations according to the occurrence sequence of the corresponding co-instruction in the source language document to obtain a co-instruction translation sequence corresponding to the sentence; and determining a decoding result of the current decoding moment according to the corresponding co-referent word translation sequence of the sentence, the decoded word sequence and the context vector of the current decoding moment.
- 9. The document translation method according to any one of claims 1 to 8, wherein the translating each sentence included in the source language document with the document attribute information corresponding to each sentence included in the source language document to obtain a target language translation corresponding to the source language document includes: Constructing a structured document based on the source language document and document attribute information corresponding to each sentence contained in the source language document, wherein the structured document comprises structured information corresponding to each sentence contained in the source language document, and the structured information comprises corresponding sentences and document attribute information corresponding to the corresponding sentences; and processing the structured document based on a pre-trained document translation model to obtain a target language translation corresponding to the source language document, wherein the document translation model is obtained by training the structured document constructed based on the training source language document and the document attribute information corresponding to each sentence contained in the training source language document and the real target language translation corresponding to the training source language document.
- 10. The document translation device is characterized by comprising a document acquisition module, a document understanding module and a document translation module; The document acquisition module is used for acquiring a source language document to be translated; The document understanding module is configured to obtain document attribute information corresponding to each sentence included in the source language document, where the document attribute information is information that is determined based on the corresponding sentence and/or a context sentence of the corresponding sentence and is used to assist in determining a meaning of a word in the corresponding sentence in the source language document; The document translation module is used for assisting in translating each sentence contained in the source language document by the document attribute information corresponding to each sentence contained in the source language document, so as to obtain a target language translation corresponding to the source language document; The document attribute information comprises one or more of temporal attribute information, ambiguity attribute information, domain attribute information and common-finger attribute information; the temporal attribute information comprises probability distribution of corresponding sentences on each set temporal; the ambiguity attribute information includes a disambiguated word set composed of words obtained from the context of the corresponding sentence to assist in disambiguating the corresponding sentence; the domain attribute information comprises probability distribution of corresponding sentences in each set domain; The co-referring attribute information includes a co-referring word set composed of words having a co-referring relationship with a corresponding sentence acquired from the context of the sentence.
- 11. A document translation apparatus includes a memory and a processor; the memory is used for storing programs; The processor is configured to execute the program to implement the steps of the document translation method according to any one of claims 1 to 9.
- 12. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the document translation method according to any one of claims 1 to 9.
Description
Document translation method, device, equipment and storage medium Technical Field The present invention relates to the field of translation technologies, and in particular, to a method, an apparatus, a device, and a storage medium for translating a document. Background Document translation is a document process that converts a document in one natural language (source language) into another natural language (target language). When translating a document, in order to obtain a better document translation effect, a context sentence is often introduced during translation. In order to greatly improve the translation effect of the document, the current document translation schemes introduce rich contexts (such as a large number of context sentences of sentences to be translated in the document), however, the introduction of the rich contexts can cause huge resource consumption, and in order to reduce the resource consumption, the current document translation schemes restrict the contexts to one sentence or a plurality of sentences before and after the sentences to be translated in the document, and the strategy reduces the resource consumption but can cause the reduction of the translation effect of the document. Therefore, the current document translation scheme cannot achieve the trade-off between the document translation effect and the resource consumption, and the trade-off between the document translation effect and the resource consumption can cause that the document translation cannot truly fall to the ground. Disclosure of Invention In view of this, the present invention provides a document translation method, device, apparatus and storage medium, which are used to solve the problem that the current document translation scheme cannot achieve the trade-off between the document translation effect and the resource consumption, and the technical scheme is as follows: A document translation method, comprising: Acquiring a source language document to be translated; acquiring document attribute information which corresponds to each sentence contained in the source language document, wherein the document attribute information is information which is determined based on the corresponding sentence and/or the context sentence of the corresponding sentence and is used for assisting in determining the meaning of the word in the corresponding sentence in the source language document; and translating each sentence contained in the source language document by assisting with document attribute information respectively corresponding to each sentence contained in the source language document to obtain a target language translation corresponding to the source language document. Optionally, the document attribute information comprises one or more of temporal attribute information, ambiguity attribute information, domain attribute information and common-finger attribute information; the temporal attribute information comprises probability distribution of corresponding sentences on each set temporal; the ambiguity attribute information includes a disambiguated word set composed of words obtained from the context of the corresponding sentence to assist in disambiguating the corresponding sentence; the domain attribute information comprises probability distribution of corresponding sentences in each set domain; The co-referring attribute information includes a co-referring word set composed of words having a co-referring relationship with a corresponding sentence acquired from the context of the sentence. Optionally, the document attribute information comprises temporal attribute information and domain attribute information; the translating the sentences contained in the source language document by assisting with the document attribute information respectively corresponding to the sentences contained in the source language document comprises the following steps: Determining target feature vectors respectively corresponding to all sentences contained in the source language document based on all sentences contained in the source language document and corresponding temporal attribute information and domain attribute information of all sentences respectively, wherein the target feature vectors contain sentence-level text information, temporal attribute information and domain attribute information of the corresponding sentences; And determining the target language translation corresponding to the source language document based on the target feature vectors respectively corresponding to the sentences contained in the source language document. Optionally, the determining, based on each sentence included in the source language document and the temporal attribute information and the domain attribute information respectively corresponding to each sentence, a target feature vector respectively corresponding to each sentence included in the source language document includes: Aiming at a target sentence of a corresponding target feature vector