EP-4736059-A1 - REALTIME AI SIGN LANGUAGE TRANSLATION WITH AVATAR

EP4736059A1EP 4736059 A1EP4736059 A1EP 4736059A1EP-4736059-A1

Abstract

Disclosed herein are method and system aspects for translating between a sign language and a target language and presenting such translations. For example, a method receives input language data and translates the input language data into sign language grammar. The method retrieves phonetic representations that correspond to the sign language grammar from a sign language database and generates coordinates from the phonetic representations using a generative network. The phonetic representations are digital representations of individual signs created through manual input of body configuration information corresponding to the individual signs. Further, the method renders an avatar that moves between the coordinates. In another example, a bidirectional communication system allows for realtime communication between a signing entity and a non-signing entity.

Inventors

KELLY, NIKOLAS ANTHONY
WILKINS, NICHOLAS
PAYANO, YAMILLET

Assignees

Sign-Speak Inc.

Dates

Publication Date: 20260506
Application Date: 20240628

Claims (20)

1. A method, comprising: receiving input language data; translating the input language data to sign language grammar; retrieving phonetic representations that correspond to the sign language grammar; generating coordinates from the phonetic representations using a generative network; and rendering an avatar or video that moves between the coordinates.
2. The method of claim 1, wherein translating the input language data to sign language grammar comprises: tokenizing the input language data into individual words; removing punctuation, determiners, or predetermined vocabulary from the individual words to form a resulting string; reducing the resulting string according to a lemmatization scheme to form a lemmatized string; and performing a transduction on the lemmatized string to produce sign language grammar.
3. The method of claim 2, wherein performing the transduction comprises performing a tree transduction using synchronous grammar models configured to parse the lemmatized string using parsing algorithms.
4. The method of claim 1, wherein generating the coordinates comprises training the generative network to generate the coordinates based on artificial coordinates.
5. The method of claim 1, wherein the phonetic representations are digital representations of individual signs created through manual input of body configuration information corresponding to the individual signs.
6. The method of claim 1, wherein the body configuration information comprises at least one of hand symmetry, handshape, palm hand orientation, finger vectors, hand position source, hand position destination, global motion of the hand, local motion of the hand, or mouth morpheme.
7. A system, comprising: a processor; and a memory, wherein the memory contains instructions stored thereon that when executed by the processor cause the processor to: receive input language data; translate the input language data to sign language grammar; retrieve phonetic representations that correspond to the sign language grammar, wherein the phonetic representations are digital representations of individual signs created through manual input of body configuration information corresponding to the individual signs; generate coordinates from the phonetic representations using a generative network; and render an avatar or video that moves between the coordinates.
8. The system of claim 7, wherein to translate the input language data to sign language grammar, the processor: tokenizes the input language data into individual words; removes punctuation, determiners, or predetermined vocabulary from the individual words to form a resulting string; reduces the resulting string according to a lemmatization scheme to form a lemmatized string; and performs a transduction on the lemmatized string to produce sign language grammar.
9. The system of claim 8, wherein to perform the transduction, the processor performs a tree transduction using synchronous grammar models configured to parse the lemmatized string using parsing algorithms.
10. The system of claim 7, wherein to generate the coordinates, the processor trains the generative network to generate the coordinates based on artificial coordinates.
11. The system of claim 10, wherein to train the generative network, the processor determines an accuracy of the artificial coordinates using a discriminator to measure coordinates detected through pose recognition.
12. The system of claim 7, wherein the body configuration information comprises at least one of hand symmetry, handshape, palm hand orientation, finger vectors, hand position source, hand position destination, global motion of the hand, local motion of the hand, or mouth morpheme.
13. A system, comprising: a camera configured to capture an image; a computing device coupled to the camera, the computing device comprising: a display; a processor; and a memory, wherein the memory contains instructions stored thereon that when executed by the processor cause the processor to: translate sign language in the image to a target language output, comprising: capturing the image; detecting pose information from the image; converting the pose information into a feature vector; converting the feature vector into a sign language string; and translating the sign language string into the target language output; and present a sign language translation of a target language input on the display, comprising: receiving the target language input; translating the target language input to sign language grammar; retrieving phonetic representations that correspond to the sign language grammar; generating coordinates from the phonetic representations using a generative network; rendering an avatar or video that moves between the coordinates; and presenting the avatar or video on the display.
14. The system of claim 13, wherein to translate the target language input to sign language grammar, the processor: tokenizes the target language input into individual words; removes punctuation, determiners, or predetermined vocabulary from the individual words to form a resulting string; reduces the resulting string according to a lemmatization scheme to form a lemmatized string; and performs a transduction on the lemmatized string to produce sign language grammar.
15. The system of claim 14, wherein the transduction is based on synchronous grammar models configured to parse the lemmatized string using parsing algorithms.
16. The system of claim 13, wherein the generative network is configured to be trained to generate the coordinates based on artificial coordinates.
17. The system of claim 16, further comprising a discriminator configured to measure coordinates detected through pose recognition to determine the accuracy of the artificial coordinates.
18. The system of claim 13, wherein the phonetic representations are digital representations of individual signs created through a manual input of body configuration information corresponding to the individual signs, and wherein the body configuration information comprises at least one of hand symmetry, handshape, palm hand orientation, finger vectors, hand position source, hand position destination, global motion of the hand, local motion of the hand, or mouth morpheme.
19. The system of claim 13, wherein to convert the feature vector, the processor applies a Convolutional Neural Network configured to output one or more flag values associated with an intrasign region, an intersign region, or a non-signing region, and wherein the one or more flag values correspond to an individual sign.
20. The system of claim 13, wherein to convert the feature vector, the processor: splits the feature vector into individual regions; and processes the individual regions into a sign language string.

Description

REALTIME Al SIGN LANGUAGE TRANSLATION WITH AVATAR CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims priority to U.S. Patent Application No. 18/216,917, filed on June 30, 2024, which is incorporated by reference in its entirety. FIELD [0002] Aspects of the present disclosure relate to components, systems, and methods for translation between a spoken or written language and a sign language, and the presentation of such translations. BACKGROUND [0003] Deaf individuals typically have little or no functional hearing, and hard-of-hearing (HoH) individuals typically have hearing loss that can be partially mitigated by an auditory device. Deaf and HoH individuals can communicate using a sign language. A sign language is a visual communication system. There may be over 200 different sign languages, such as American Sign Language (ASL), British Sign Language (BSL), or German Sign Language (DGS). For both Deaf and HoH individuals, systems for translating between a sign language and a spoken or written language can improve daily interactions. BRIEF DESCRIPTION OF THE DRAWINGS [0004] The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate aspects of the present disclosure and, together with the description, further serve to explain the principles of the disclosure and to enable a person skilled in the pertinent art to make and use the disclosure. [0005] FIG. l is a block diagram of a system for generating and presenting a sign language translation of source text in an input language, according to some aspects of the present disclosure. [0006] FIG. 2 is an illustration of a user interface used when translating source text in an input language to sign language grammar, according to some aspects of the present disclosure. [0007] FIG. 3 is a block diagram of a training process for a generative network generating coordinates from phonetic representations, according to some aspects of the present disclosure. [0008] FIG. 4 is a block diagram of an inference process for a generative network used when generating coordinates from phonetic representations, according to some aspects of the present disclosure. [0009] FIG. 5 is an illustration of an avatar presenting a sign language translation of source text in an input language, according to some aspects of the present disclosure. [0010] FIG. 6 is a flowchart of a method for presenting a sign language translation of source text in an input language, according to some aspects of the present disclosure. [0011] FIGs. 7A and 7B are block diagrams of an example system for detecting and translating a sign language input to an output language, according to some aspects of the present disclosure. [0012] FIG. 8 is an illustration of a user interface of a bidirectional communication system allowing for realtime communication between a signing entity and a non-signing entity, according to some aspects of the present disclosure. [0013] FIG. 9 is a block diagram of an example computer system useful for implementing various aspects. [0014] In the drawings, like reference numbers generally indicate identical or similar elements. Additionally, generally the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears. [0015] Aspects of the present disclosure will be described with reference to the accompanying drawings. DETAILED DESCRIPTION [0016] Provided herein are apparatus, device, system, and/or method aspects, and/or combinations and sub-combinations thereof, for translating between a sign language and a spoken or written language, and the presentation of such translations. [0017] For many Deaf and hard-of-hearing (HoH) individuals, American Sign Language (ASL) is their first language. ASL is a language with significantly different grammatical structures from English. For example, ASL has a distinct grammatical structure whose morphological inflection is not easily captured by annotation mechanisms. Signs for “Help You,” “Help Me,” and “Help Them” are all inflected differently. In another example, ASL has a construct called “classifiers” that have no equivalent structure in spoken language. Classifiers are visual depictions of certain words in three-dimensional space. In yet another example, ASL has no written form, making lexicalization difficult. These sign language characteristics pose challenges when translating from English to ASL, including how the translations are presented to a user. As such, when translating between ASL and a spoken or written language, it is useful to present ASL in a video format. This is also true with translations involving other sign languages, such as British Sign Language, which are within the scope of the present disclosure. [0018] Systems can use a word-level or a motion-capture method for presenting sign language translations to a user. A word-level method presents a sign language translation through an avatar on a per-word basis. Howev