CN-122021670-A - Multimode translation equipment based on scene perception and transparent display

CN122021670ACN 122021670 ACN122021670 ACN 122021670ACN-122021670-A

Abstract

The invention discloses a multi-mode translation device based on scene perception and transparent display, which relates to the technical field of natural language processing and comprises a base, a transparent display unit, image collectors, a voice interaction module and a central processing unit, wherein the transparent display unit is arranged at the top of the base, the two image collectors are symmetrically arranged at two sides of the base in the length direction, the two voice interaction modules are symmetrically arranged at two sides of the base in the width direction, and the central processing unit is arranged in the base. The multi-mode translation equipment based on scene perception and transparent display effectively improves the context adaptability of translation and solves the problem of word ambiguity by introducing a visual mode into the translation equipment and assisting semantic understanding through the visual mode, and also can keep contact with eyes of the opposite side while a user reads translation captions by adopting a transparent display unit, thereby capturing facial expressions and limb languages of the opposite side, greatly improving the communication naturalness and enhancing the practicability and the use effect of the equipment.

Inventors

LI YIZHEN
WU LI

Assignees

武汉城市职业学院

Dates

Publication Date: 20260512
Application Date: 20260108

Claims (9)

1. A multi-modal translation device based on scene perception and transparent display, comprising The device comprises a base (1), a transparent display unit (2) arranged at the top of the base (1), two image collectors (3) symmetrically arranged at two sides of the base (1) in the length direction, two voice interaction modules (5) symmetrically arranged at two sides of the base (1) in the width direction, and a central processing unit (4) arranged in the base (1); The base (1) comprises a bottom shell (11) and a top cover (12) arranged at the top of the bottom shell (11), wherein the top cover (12) comprises a flat plate (121) positioned above the bottom shell (11) and a bevel board (122) arranged outside the flat plate (121) and connected with the top edge of the bottom shell (11); The transparent display unit (2) is vertically arranged on the top surface of the flat plate (121), and the transparent display unit (2) is transparent on both sides, allows sight to penetrate and supports text information display; The image collectors (3) are arranged on two sides of the inclined panel (122) in the length direction, and the image collectors (3) are used for collecting environmental image data around equipment or carrying out face tracking; the voice interaction module (5) is used for collecting voice signals and playing audio; The central processing unit (4) is arranged in the base (1), and the central processing unit (4) is electrically connected with the transparent display unit (2) and the voice interaction module (5) of the image collector (3), wherein a neural machine translation model is arranged in the central processing unit (4); the central processing unit (4) is configured to perform the steps of: receiving environment image data transmitted by the image collector (3), identifying key object characteristics in an image, and determining a domain weight coefficient of a current dialogue scene based on an identification result; receiving a voice signal transmitted by the voice interaction module (5) and converting the voice signal into a source text; inputting the source text into a neural machine translation model, and introducing the domain weight coefficient as context constraint to generate a target text; the transparent display unit (2) is driven to display the target text on the side facing the recipient.
2. The multi-modal translation device based on scene perception and transparent display according to claim 1, characterized in that the central processing unit (4) in determining the domain weight coefficients, performs in particular the following steps: matching the identified key object features with a pre-stored industry feature library, wherein the industry feature library comprises at least one scene tag of catering, medical treatment, business meetings, transportation trips and daily social contact; if the specific industry characteristics are matched, the selection probability of the corresponding terms of the industry in the translation model is improved.
3. The scene-aware and transparent display-based multimodal translation device of claim 1, further comprising a manual correction interactive interface; When the user negates the current domain weight coefficient by touching the transparent display unit (2) or a gesture instruction, the central processing unit (4) switches to a general translation mode or manually locks to another specific domain weight coefficient according to user selection.
4. The multi-modal translation device based on scene perception and transparent display according to claim 1, wherein the central processing unit (4) distinguishes the sound source direction of the speech signal through beamforming technology, automatically judges the current speaker identity, and controls the transparent display unit (2) to display the target text on the side facing away from the speaker.
5. The multi-modal translation device based on scene perception and transparent display according to claim 1, characterized in that, when the transparent display unit (2) displays target text to a recipient, the central processing unit (4) is further configured to: And tracking the face positions of the two parties of the conversation in real time according to the image data acquired by the image acquisition device (3), and dynamically adjusting the display coordinates of the target text on the transparent display unit (2) so as to ensure that the text display position and the face position of the speaker keep a preset relative spatial relationship.
6. The multi-modal translation device based on scene perception and transparent display according to claim 1, characterized in that it further comprises an attitude sensor mounted inside the base (1) and adapted to detect the physical attitude change of the device; The central processing unit (4) is configured to automatically adjust the display direction of the text on the transparent display unit (2) according to the monitoring data of the attitude sensor when the attitude sensor detects that the base (1) is overturned or rotated, so as to ensure that the text always faces the observer.
7. The multi-mode translation device based on scene perception and transparent display according to claim 1, further comprising a physical interaction interface (6), wherein the physical interaction interface (6) comprises a power key (61), a connection port (62), a switching key (63) and a display control key (64), the power key (61) is used for controlling start and stop of the device, the connection port (62) is used for connecting an external device, the switching key (63) is used for controlling mode switching to switch a current domain weight coefficient to another specific domain weight coefficient, and the display control key (64) is used for enabling the transparent display unit (2) to switch between a transparent state and an opaque state.
8. The multi-modal rendering device based on scene perception and transparent display according to claim 1, wherein the voice interaction module (5) comprises two microphone arrays (51) symmetrically mounted on both sides of the diagonal panel (122) in the width direction and four speakers (52) symmetrically mounted on both sides of the bottom case (11) in the width direction, and the two speakers (52) located on the same side of the bottom case (11) in the width direction are symmetrically arranged with the transparent display unit (2) as a boundary.
9. The multi-modal rendering device based on scene perception and transparent display as claimed in claim 8, characterized in that the speaker (52) is used for auxiliary voice interaction and system control or for playing the rendered audio in cooperation with the transparent display unit (2).

Description

Multimode translation equipment based on scene perception and transparent display Technical Field The invention relates to the technical field of natural language processing, in particular to a multi-mode translation device based on scene perception and transparent display. Background With the acceleration of global economic integration process, cross-language communication is increasingly frequent. In order to solve the language barrier, various auxiliary translation devices (such as smart phone translation APP, handheld translation machine, translation earphone, etc.) have been widely used. The existing mainstream translation devices generally adopt a technical path of 'speech recognition (ASR) +machine translation (MT) +speech synthesis (TTS)', that is, speech is collected through a microphone, converted into source language text, then translated into target language text, and played or displayed. Although the prior art has met the basic communication needs to some extent, the following significant technical drawbacks still exist in practical use: 1. The multi-modal environment awareness capability is lacking, that is, the traditional translation device only depends on voice input, and cannot perceive the physical scene of the dialogue occurrence, so that the ambiguous words are translated in different contexts. For example, the english word "Check" means "ticket" in restaurant scene, means "Check in/out" in hotel scene, means "Check in" in medical scene, meaning difference is significant in different scenes, and the single voice input cannot accurately determine its semantics. 2. The interaction mode affects natural communication, namely the existing translation equipment mostly adopts an opaque screen to display translation results, and a user needs to lower his head or shift his line of sight when looking up the translation, thereby blocking the eye communication of the two parties of the conversation and reducing the naturalness and trust feeling of communication. Therefore, the application provides the bidirectional translation equipment which can fuse the environment visual information for semantic understanding and support the natural sight interaction. Disclosure of Invention In view of the above, the present invention is to provide a multi-mode translation device based on scene perception and transparent display, so as to combine Natural Language Processing (NLP) with Computer Vision (CV) technology, so that the translation device can utilize environmental image information to assist semantic understanding, thereby improving context adaptability and reliability of translation. In order to achieve the above purpose, the invention provides a multimode translation device based on scene perception and transparent display, which is characterized by comprising a base, a transparent display unit arranged at the top of the base, two image collectors symmetrically arranged at two sides of the length direction of the base, two voice interaction modules symmetrically arranged at two sides of the width direction of the base and a central processing unit arranged in the base; The base comprises a bottom shell and a top cover arranged at the top of the bottom shell, wherein the top cover comprises a flat plate positioned above the bottom shell and an inclined panel arranged outside the flat plate and connected with the top edge of the bottom shell; the transparent display unit is vertically arranged on the top surface of the flat plate, and the double sides of the transparent display unit are transparent to allow vision to penetrate and support text information display; the image collectors are arranged on two sides of the inclined panel in the length direction and are used for collecting environmental image data around equipment or carrying out face tracking; The voice interaction module is used for collecting voice signals and playing audio; The central processing unit is arranged in the base and is electrically connected with the transparent display unit, the image collector and the voice interaction module, and a neural machine translation model is arranged in the central processing unit; the central processing unit is configured to perform the steps of: receiving environment image data transmitted by the image collector, identifying key object characteristics in an image, and determining a domain weight coefficient of a current dialogue scene based on an identification result; receiving a voice signal transmitted by the voice interaction module and converting the voice signal into a source text; inputting the source text into a neural machine translation model, and introducing the domain weight coefficient as context constraint to generate a target text; and driving the transparent display unit to display the target text on the side facing the receiver. Further, when determining the domain weight coefficient, the central processing unit specifically executes the following steps: matching the identified k