US-12619815-B2 - Magnitude invariant multimodal agent for efficient image-text interface automation

US12619815B2US 12619815 B2US12619815 B2US 12619815B2US-12619815-B2

Abstract

A system for magnitude-invariant image-text agentic interface automation is disclosed. A bit vectorization logic is configured to convert image patches in a plurality of image patches into magnitude-invariant bit vectors, and generate a plurality of lines of magnitude-invariant bit vectors. A tokenization logic is configured to translate the input text sequence into a sequence of input text tokens, and to translate the successive lines of magnitude-invariant bit vectors interleaved with a newline character into a sequence of input magnitude-invariant bit vector tokens. A linear projection logic is configured to linearly project a single token stream of the sequence of input text tokens and the sequence of input magnitude-invariant bit vector tokens into a decoder-only Transformer logic, wherein the linear projection of the single token stream bypasses any embedding lookup.

Inventors

Curtis Hawthorne
Ulas Kirazci
Joe Gershenson
Shaya Zarkesh
Erich Elsen
Augustus Odena
Maxwell Nye
Arushi Somani
Kyle Vigen
Rohan BAVISHI
Sagnak Tasirlar
Warut Vijitbenjaronk

Assignees

ANTHROPIC, PBC

Dates

Publication Date: 20260505
Application Date: 20241008

Claims (18)

1 . A system for magnitude-invariant image-text agentic interface automation, comprising: memory storing an input image and an input text sequence; patch extraction logic configured to extract image patches from the input image on a line-by-line basis, and generate a plurality of lines of image patches for the input image; bit vectorization logic configured to convert image patches in the plurality of lines of image patches into magnitude-invariant bit vectors, and generate a plurality of lines of magnitude-invariant bit vectors; newline insertion logic configured to interleave a newline character between successive lines of magnitude-invariant bit vectors in the plurality of lines of magnitude-invariant bit vectors, wherein the newline character specifies an end of a line in the input image; tokenization logic configured to translate the input text sequence into a sequence of input text tokens, and to translate the successive lines of magnitude-invariant bit vectors interleaved with the newline character into a sequence of input magnitude-invariant bit vector tokens; linear projection logic configured to linearly project a single token stream of the sequence of input text tokens and the sequence of input magnitude-invariant bit vector tokens into a decoder-only Transformer logic, wherein the linear projection of the single token stream bypasses any embedding lookup; and the decoder-only Transformer logic configured to process the linearly projected, embedding lookup-bypassed single token stream to generate a sequence of output tokens that are responsive to the input image and the input text sequence.
2 . The system of claim 1 , wherein the bit vectorization logic is further configured to apply a RGB555 format compression to convert the image patches in the plurality of lines of image patches into the magnitude-invariant bit vectors, and generate the plurality of lines of magnitude-invariant bit vectors.
3 . The system of claim 2 , wherein the RGB555 format compression produces three 5-bit values, one for each of subpixel channels R (red), G (green), and B (blue).
4 . The system of claim 3 , wherein the three 5-bit values take either a 1 value or a −1 value.
5 . The system of claim 4 , wherein the three 5-bit values are magnitude-invariant to scale modification functions of the decoder-only Transformer logic.
6 . The system of claim 5 , wherein a layer normalization (LayerNorm) function is one of the scaling functions of the decoder-only Transformer logic.
7 . The system of claim 1 , wherein the bit vectorization logic is further configured to apply a RGB888 format compression to convert the image patches in the plurality of lines of image patches into the magnitude-invariant bit vectors, and generate the plurality of lines of magnitude-invariant bit vectors.
8 . The system of claim 7 , wherein the RGB888 format compression produces three 8-bit values, one for each of subpixel channels R (red), G (green), and B (blue).
9 . The system of claim 8 , wherein the three 8-bit values take either a 1 value or a −1 value.
10 . The system of claim 9 , wherein the three 8-bit values are magnitude-invariant to scale modification functions of the decoder-only Transformer logic.
11 . The system of claim 10 , wherein a layer normalization (LayerNorm) function is one of the scaling functions of the decoder-only Transformer logic.
12 . The system of claim 1 , wherein the bit vectorization logic is further configured to apply a RGB565 format compression to convert the image patches in the plurality of lines of image patches into the magnitude-invariant bit vectors, and generate the plurality of lines of magnitude-invariant bit vectors.
13 . The system of claim 12 , wherein the RGB565 format compression produces 5-bit values for R (red) and B (blue) subpixel channels and 6-bit values for G (green) subpixel channel.
14 . The system of claim 13 , wherein the 5-bit and the 6-bit values take either a 1 value or a −1 value.
15 . The system of claim 14 , wherein the 5-bit and the 6-bit values are magnitude-invariant to scale modification functions of the decoder-only Transformer logic.
16 . The system of claim 15 , wherein a layer normalization (LayerNorm) function is one of the scaling functions of the decoder-only Transformer logic.
17 . A system for magnitude-invariant image-text agentic interface automation, comprising: memory storing an input image; patch extraction logic configured to extract image patches from the input image on a line-by-line basis, and generate a plurality of lines of image patches for the input image; bit vectorization logic configured to convert image patches in the plurality of lines of image patches into magnitude-invariant bit vectors, and generate a plurality of lines of magnitude-invariant bit vectors; newline insertion logic configured to interleave a newline character between successive lines of magnitude-invariant bit vectors in the plurality of lines of magnitude-invariant bit vectors, wherein the newline character specifies an end of a line in the input image; tokenization logic configured to translate the successive lines of magnitude-invariant bit vectors interleaved with the newline character into a sequence of input magnitude-invariant bit vector tokens; linear projection logic configured to linearly project the sequence of input magnitude-invariant bit vector tokens into a decoder-only Transformer logic, wherein the linear projection of the sequence of input magnitude-invariant bit vector tokens bypasses any embedding lookup; and the decoder-only Transformer logic configured to process the linearly projected, embedding lookup-bypassed sequence of input magnitude-invariant bit vector tokens to generate a sequence of output tokens that are responsive to the input image.
18 . A system for magnitude-invariant image-text agentic interface automation, comprising: memory storing an input image; patch extraction logic configured to extract image patches from the input image on a line-by-line basis, and generate a plurality of lines of image patches for the input image; bit vectorization logic configured to convert image patches in the plurality of lines of image patches into magnitude-invariant bit vectors, and generate a plurality of lines of magnitude-invariant bit vectors; tokenization logic configured to translate lines of the plurality of lines of magnitude-invariant bit vectors into a sequence of input magnitude-invariant bit vector tokens; linear projection logic configured to linearly project the sequence of input magnitude-invariant bit vector tokens into a decoder-only Transformer logic; and the decoder-only Transformer logic configured to process the linearly projected sequence of input magnitude-invariant bit vector tokens to generate a sequence of output tokens that are responsive to the input image.

Description

PRIORITY DATA This patent application claims the benefit of and priority to the following eight U.S. Provisional Patent Applications: U.S. Provisional Patent Application No. 63/567,667, titled “Persimmon-8B,” filed Mar. 20, 2024;U.S. Provisional Patent Application No. 63/567,681, titled “Adventure of the Errant Hardware,” filed Mar. 20, 2024;U.S. Provisional Patent Application No. 63/567,698, titled “Fuyu-8B: A Multimodal Architecture for AI Agents,” filed Mar. 20, 2024;U.S. Provisional Patent Application No. 63/567,721, titled “Adept Experiments,” filed Mar. 20, 2024;U.S. Provisional Patent Application No. 63/567,714, titled “Adept Fuyu-Heavy: A new multimodal model,” filed Mar. 20, 2024;U.S. Provisional Patent Application No. 63/638,613, titled “Adept Recorder,” filed Apr. 25, 2024;U.S. Provisional Patent Application No. 63/638,631, titled “Adept Workflow Language (AWL),” filed Apr. 25, 2024; andU.S. Provisional Patent Application No. 63/638,644, titled “Adept Frankenmodel,” filed Apr. 25, 2024. The priority U.S. Provisional Patent Applications are incorporated herein by reference in their entirety and for all purposes as if completely and fully set forth herein. FIELD OF THE TECHNOLOGY The technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks. In particular, the technology disclosed relates to automating artificial intelligence-based multimodal agentic workflows, specifically user interface-based multimodal agentic workflows. BACKGROUND The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology. Deep learning is a frontier for artificial intelligence, aiming to be closer to its primary goal—artificial intelligence. Deep learning has seen great success in a wide variety of applications, such as natural language processing, speech recognition, medical applications, computer vision, and intelligent transportation systems. The great success of deep learning is due to the larger models. The scale of these models has included hundreds of millions of parameters. These hundreds of millions of parameters allow the model to have more degrees of freedom enough to produce awe-inspiring description capability. However, the large number of parameters requires a massive amount of training data with labels. Improving model performance by data annotation has two crucial challenges. On the one hand, the data growth rate is far behind the growth rate of model parameters, so data growth has primarily hindered the further development of the model. On the other hand, the emergence of new tasks has far exceeded the speed of data updates, and annotating for all samples is laborious. To tackle this challenge, new datasets are built by generating synthetic samples, thereby speeding up model iteration and reducing the cost of data annotation. Pre-training methods and transfer learning have also been used to solve this challenge, such as Transformers, BERT, and GPT. These works have achieved incredible results. However, the generated data is only used as base data to initialize the model. In order to obtain a high-precision usable model, it is often necessary to label and update specific data. Integrating apriori knowledge in the learning framework is an effective means to deal with sparse data, as the learner does not need to induce the knowledge from the data itself. As special agents, humans have rich prior knowledge. If the machine can learn human wisdom and knowledge, it will help deal with sparse data. Human-in-the-loop (HITL) addresses these issues by incorporating human knowledge into the modeling process. HITL aims to train an accurate prediction model with minimum cost by integrating human knowledge and experience. Humans can provide training data for machine learning applications and directly accomplish some tasks that are hard for computers in the pipeline with the help of machine-based approaches. At present, there is still a high degree of coupling between deep learning tasks and data, and the performance of deep learning largely depends on the quality of the data. For a new task, if you want to obtain better performance, you need to provide a large amount of high-quality labeled data. However, the labeled data requires a larg