US-12620248-B1 - Training-free framework for zero-shot check field detection
Abstract
A first Vision-Language Model (VLM) in a first branch identifies a first set of fields in input data using a visualized first set of bounding boxes (BB). The first VLM labels and outputs a labeled first set of fields. A first agentic AI in the first branch localizes and outputs an identified field as a desired type of field using a visualized identified BB. A second VLM in a second branch identifies a second set of fields in the input data using a visualized second set of BBs. An MLLM uses the input data with the second set of BBs to output a set of recognizing field from the second set of BBs. A second agentic AI in the second branch and labels a target field. A training data set is formed by combining the input data, labeled identified field, and the labeled target field.
Inventors
- Sourav Halder
- Jinjun Tong
- Xinyu Wu
Assignees
- U.S. BANK NATIONAL ASSOCIATION
Dates
- Publication Date
- 20260505
- Application Date
- 20251107
Claims (20)
- 1 . A computer-implemented method, comprising: receiving input data, wherein the input data includes image data; providing the input data to a first branch comprising a first vision model; processing the input data with the first branch, wherein the processing includes using the first vision model to identify a first set of fields in the input data using a first set of bounding boxes; recognizing at least one field in the first set of fields as an identified field; labeling the recognized at least one field in the first set of fields as a labeled identified field; providing the input data to a second branch comprising a second vision model and a third model; processing the input data with the second branch, wherein the processing includes: using the second vision model to identify a second set of fields in the input data using a second set of bounding boxes; passing the input data with the second set of bounding boxes to the third model; and causing the third model to output a set of recognized fields within the bounding boxes of the second set of bounding boxes; recognizing at least one field in the second set of fields as a target field; labeling the recognized at least one field in the second set of fields as a labeled target field; and combining the input data, the labeled identified field, and the labeled target field to form labeled data.
- 2 . The computer implemented method of claim 1 , further comprising: training, using the labeled data, a model, the training configuring the model to detect a set of fields in a set of specimen.
- 3 . The computer implemented method of claim 2 , further comprising: deploying the model as a field detection model.
- 4 . The computer implemented method of claim 3 , further comprising: outputting from the field detection model an identification of a signature field in production data; and inputting a content of the signature field into a processing system as a part of processing one or more documents.
- 5 . The computer implemented method of claim 4 , further comprising: outputting from the trained field detection model an identification of at least one other field in the production data; and additionally inputting a content of the at least one other field into the processing system.
- 6 . The computer implemented method of claim 5 , wherein the at least one other field is one field selected from the group consisting of (i) a payer field, (ii) a payee field, (iii) a courtesy amount field, and (iv) a legal amount field.
- 7 . The computer implemented method of claim 1 , wherein the target field is a magnetic ink character recognition field; and wherein the method further comprises extending a bounding box bounding the magnetic ink character recognition field to an edge of the input data.
- 8 . The computer implemented method of claim 1 , further comprising: resizing the input data such that a resized image comprising the input data conforms to a specified minimum dimension; and padding the input data.
- 9 . The computer implemented method of claim 1 , wherein recognizing the identified field and the target field includes using one or more agentic artificial intelligences.
- 10 . The computer implemented method of claim 1 , further comprising: converting using a different model, a complex query about the input data into a short prompt; and providing the short prompt as an input to the first vision model.
- 11 . A computer-implemented method, comprising: providing production input data as input into a field detection model, wherein the field detection model was trained by a method comprising: accessing input data, wherein the input data includes image data; providing the input data to a first branch comprising a first vision model; processing the input data with the first branch, wherein the processing includes using the first vision model to identify a first set of fields in the input data using a first set of bounding boxes; recognizing at least one field in the first set of fields as an identified field; labeling the recognized at least one field in the first set of fields as a labeled identified field; providing the input data to a second branch comprising a second vision model and a third model; processing the input data with the second branch, wherein the processing includes: using the second vision model to identify a second set of fields in the input data using a second set of bounding boxes; passing the input data with the second set of bounding boxes to the third model; and causing the third model to output a set of recognized fields within the bounding boxes of the second set of bounding boxes; recognizing at least one field in the second set of fields as a target field; labeling the recognized at least one field in the second set of fields as a labeled target field; and combining the input data, the labeled identified field, and the labeled target field to form labeled data.
- 12 . The computer implemented method of claim 11 , further comprising: outputting from the field detection model an identification of a signature field in production input data.
- 13 . The computer implemented method of claim 12 , further comprising: inputting a content of the signature field into a processing system as a part of processing one or more documents.
- 14 . The computer implemented method of claim 11 , further comprising: outputting from the trained field detection model an identification of at least one other field in the production data; and additionally inputting a content of the at least one other field into the processing system.
- 15 . The computer implemented method of claim 14 , wherein the at least one other field is one field selected from the group consisting of (i) a payer field, (ii) a payee field, (iii) a courtesy amount field, and (iv) a legal amount field.
- 16 . The computer implemented method of claim 11 , wherein the target field is a magnetic ink character recognition field; and wherein the method further comprises extending a bounding box bounding the magnetic ink character recognition field to an edge of the input data.
- 17 . The computer implemented method of claim 11 , further comprising: resizing the input data such that a resized image comprising the input data conforms to a specified minimum dimension; and padding the input data.
- 18 . The computer implemented method of claim 11 , wherein recognizing the identified field and the target field includes using one or more agentic artificial intelligences.
- 19 . A computer system comprising a non-transitory computer readable medium having stored thereon code of a field detection model trained by operations comprising: receiving input data in a first branch comprising a first model, the input data comprising an image of a document, the first model adapted to identify a first set of fields in the input data using a first set of bounding boxes; labeling, by the first model, the first set of fields to output a labeled first set of fields; identifying an identified field as a desired type of field using a corresponding bounding box from the first set of bounding boxes; outputting from the first branch, a localized and labeled identified field; passing the input data to a second branch comprising a second model and a third model, the second model adapted to identify a second set of fields in the input data using a visualized second set of bounding boxes; passing the input data with the second set of bounding boxes to the third model executing in the second branch, the third model outputting a set of recognizing field within bounding boxes of the second set of bounding boxes; identifying at least one recognized field of the second set of fields as a target field; labeling the target field to output from the second branch a labeled target field; combining to form labeled data in a training data set, the input data, the labeled identified field, and the labeled target field; and training the model with the training data set.
- 20 . The computer system of claim 19 , wherein the computer system is configured to: provide production input data as input into the field detection model; and receive, as output from the field detection model, an identification of at least one field in the production input data.
Description
RELATED APPLICATION The present application is a CONTINUATION of U.S. patent application Ser. No. 19/270,987, titled TRAINING-FREE FRAMEWORK FOR ZERO-SHOT CHECK FIELD DETECTION and filed on Jul. 16, 2025. BACKGROUND Zero-shot detection refers to the ability of a model to detect and recognize objects or entities that it has never seen during training, based only on semantic descriptions, such as text labels, attributes, or natural language prompts. “Zero-shot” means the model hasn't been trained on examples of the specific target class. Detection refers to not just recognizing a class but also locating an object belonging to that class in an image or data, e.g., by drawing a bounding box around an object in an image. Suppose a model has been trained on animals like “dog,” “cat,” and “horse.” In zero-shot detection, when the model is asked to detect a “zebra”, which the model has never seen, the model would use its understanding of what a “zebra” is, such as from word embeddings, text descriptions, or language models. Using that understanding, the model would search the source image for regions that match that semantic concept, and output a bounding box around the object the model concludes is a zebra. This type of detection is often implemented by combining a visual backbone such as CLIP (Contrastive Language-Image Pre-training) or ViT (Vision Transformer)) with a language model or text embedding that understands concepts. An object detection model is a type of machine learning model—often a deep learning model—that can identify what objects are present in an image and locate each object by drawing bounding boxes around them. For a given input image, an object detection model outputs the classes of objects it finds, e.g., “dog”, “car”, “person”, the location of each object as a bounding box (x, y, width, height), and often a confidence score for each detection. Common object detection models include YOLO (You Only Look Once), which is fast and widely used for real-time detection; SSD (Single Shot MultiBox Detector), which detects multiple objects in a single pass; Faster R-CNN, which is more accurate but slower and uses region proposals and CNNs (Convolution Neural Networks); and DETR (DEtection TRansformer), a transformer-based model that treats detection like language with less hand-tuning. Object detection can be distinguished from related tasks such as image classification, which labels the whole image with a class (e.g., “dog”); instance segmentation, which is similar to detection but with pixel-level masks; and semantic segmentation, which labels each pixel by class, not instances. A transformer-based architecture is a deep learning model architecture originally designed for sequence modeling tasks, but it's now used across many domains, including vision, audio, and multimodal tasks. The transformer replaces traditional RNNs (Recurrence Neural Networks) or CNNs with a mechanism called self-attention, which allows the model to look at all parts of the input at once, learn relationships between all tokens, no matter how far apart they are, and scale better with parallel computation, especially on GPUs (Graphical Processing Units). The core components of a transformer include self-attention, where each token attends to all other tokens to compute its contextual representation; multi-head attention, which runs multiple attention mechanisms in parallel to learn different types of relationships; positional encoding, which adds position information to the input so the model knows the order of tokens; feedforward networks, which pass each token's representation through fully connected layers independently; and layer normalization and residual connections, which help with training stability and gradient flow. An open vocabulary vision model is a computer vision system that can recognize and detect objects beyond a fixed, predefined set of labels using natural language or textual descriptions as input instead. Traditional models (like YOLO or Faster R-CNN) are closed vocabulary—they can only detect the classes they were explicitly trained on (e.g., 80 COCO classes like “dog”, “car”, “person”). In contrast, an open vocabulary model can take any text prompt (e.g., “penguin”, “red sports car”, “traffic light with cracks”), find that object or concept in the image, even if it was never trained on that specific class. Open vocabulary vision models typically combine a visual encoder (e.g., a CNN, ViT) that converts an image or image regions into embeddings; a text encoder (e.g., from CLIP or a transformer) that converts the text label or prompt into an embedding; an alignment space where the model is trained so that matching images and text have similar vector representations; and similarity matching at test time to find visual regions that match the given text input. SUMMARY The present disclosure includes inventive concepts relating generally to generating code for a graphical user interface, such as methods, system