US-20260127900-A1 - ACTIVE PROMPT TUNING OF VISION-LANGUAGE MODELS FOR HUMAN-CONFIRMABLE DIAGNOSIS AND EXPLANATION OF MICROSCOPY IMAGES

US20260127900A1US 20260127900 A1US20260127900 A1US 20260127900A1US-20260127900-A1

Abstract

Systems and methods are provided herein for developing and deploying active-prompt-tuned applications for automated diagnosis and categorization of microscopy slide images. Processes of the present disclosure may control and provide for human-confirmable diagnostic classifications of such slide images, based on controlled instructions provided to vision-language models. The controlled instructions may be developed by systems provided herein, which generate system prompts and example prompt sets using active prompt tuning approaches.

Inventors

Lawrence Hall
Dmitry Goldgof
Abhiram Kandiyana
Peter Mouton

Assignees

UNIVERSITY OF SOUTH FLORIDA

Dates

Publication Date: 20260507
Application Date: 20250602

Claims (6)

1 . A method for providing active prompt tuned automated microscopy image analysis comprising: receiving criteria information for a given class of microscopy study, the criteria information comprising: a set of possible slide classification labels, descriptions of a set of predefined visual features of anatomy anticipated to be found in microscopy slide images for each of the slide classification labels, and slide image preparation information describing: a sample acquisition method used to prepare biological samples appearing in the slides; a stain method used to stain structures within the samples; and a magnification used to acquire images of the samples; receiving a set of unclassified slide images prepared using the sample acquisition method, the stain method, and the magnification; creating an Active Set of unclassified images by random sampling of the set of unclassified images; generating a System Prompt based on the criteria information and input/output characteristics of a diagnostic review platform for microscopy images; presenting an Initial Prompt subset of the Active Set of images to a human reviewer, and requiring the human reviewer to select one of the image classification labels and provide an unstructured feature explanation, for each image of the Initial Prompt subset; providing images of the Active Set as input to a vision-language model (VLM) with an instruction comprising the System Prompt and a Prompt Set comprising the Initial Prompt subset; in an iterative fashion, expanding the Prompt Set by performing (i)-(iii) until the Prompt Set contains a threshold number of image-caption pairs exhibiting each label of the slide classification labels and each feature of the set of visual features: displaying to the human reviewer candidate image-caption pairs comprising images of the Active Set together with associated predicted labels and predicted unstructured feature explanations derived from VLM outputs; requiring the human reviewer to review the image-caption pairs and confirm, reject, or edit them; and for each image-caption pair approved or edited by the human reviewer, adding them as an entry to the Prompt Set; generating a diagnostic profile specific to the given class of microscopy study based on the System Prompt and Prompt Set; and storing the diagnostic profile in a memory associated with the diagnostic review platform for use in automatically generating image classifications and supporting feature explanations for future patient slide images of the given class of microscopy study.
2 . The method of claim 1 further comprising presenting a user interface of the diagnostic review platform to allow a user to select among stored diagnostic profiles for a plurality of classes of microscopy study.
3 . The method of claim 1 further comprising monitoring patient slide images reviewed via the diagnostic review platform for the given class of microscopy study to determine changes in average image attributes relating to the set of visual features, and if such a change is detected, generating a second Active Set from among such patient slide images and requiring a human reviewer to iteratively supplement the Prompt Set.
4 . The method of claim 1 , wherein image-caption pairs are not admitted to the Prompt Set until having been approved more than once by a human reviewer.
5 . The method of claim 1 , wherein slide images of the Active Set that are rejected five (5) times during iterative expansion of the Prompt Set are removed from the Active Set.
6 . The method of claim 1 , wherein the unclassified slide images comprise stacks of image slices of three-dimensional biological samples, and further wherein the stacks are provided as inputs to the VLM as groups, and the System Prompt instructs the VLM on the spatial relationship of the images and instructs the VLM to generate one predicted classification label and one unstructured explanation for each group.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application claims priority to U.S. provisional patent application Nos. 63/654,302, filed May 31, 2024, and 63/715,252, filed Nov. 1, 2024, the entire contents of which are incorporated herein by reference. STATEMENT OF GOVERNMENT SUPPORT This invention was made with government support under grant numbers 1513126, 1746511, and 1926990 awarded by the National Science Foundation. The government has certain rights in the invention. BACKGROUND Accurate diagnosis of real-world problems based on visual perception and visual-based reasoning-whether from first-person viewing, images, or video—is an important requirement across numerous domains, including medical diagnostics, industrial inspection, environmental monitoring, food and agricultural inspection, and scientific research (to name just a few). Having a human expert personally examine the objects, organisms, etc. to diagnose problems is also not feasible: humans simply cannot see many of the visual traits important to diagnosis (e.g., because they are too small, too numerous, too fast, etc.). Thus, inclusion of computer assistance in visual diagnosis has become a common approach in some fields. Recent approaches to computer-assisted visual diagnoses have relied on supervised deep learning models, such as convolutional neural networks (CNNs), which require extensive labeled datasets and significant computational resources. These methods often involve time-consuming processes for data annotation, model selection, hyperparameter tuning, and iterative training cycles. Moreover, the need for domain-specific expertise to generate accurate ground truth labels presents a bottleneck in scalability and reproducibility, and can redirect experts' time away from current diagnostic tasks. In medical imaging, for example, rendering diagnoses of disease conditions based on imaging usually requires expert (radiologist, pathologist, etc.) review and annotation of images (e.g., adding annotations to MRI studies; identifying and annotating cellular features in microscopy images of tissue sections, etc.). While CNN-based models have relatively high accuracy in such tasks, they are limited by their dependence on large, annotated datasets and the need for retraining when applied to new imaging modalities, magnifications, or biological targets. Recent advances in vision-language models (VLMs) could, in theory, offer a promising alternative. These models are capable of interpreting and reasoning over both visual and textual inputs, enabling them to perform classification tasks of images. As shown in line 502 of FIG. 2, initial VLMs generally require millions of image-caption pairs for training (meaning they are almost all general-purpose at the start), which itself can be incredibly expense and resource intensive (e.g., several millions of dollars and 10+ days to train or retrain certain VLMs). As shown in line 504, for the VLM to be trained to a given task or subject matter still would require around 50,000 image-caption pairs and significant cost, and still may not reliably provide diagnoses. Due to absence of readily-available, public datasets, attempts at fine-tuning VLMs have not been feasible or widespread (in part, given the cost and constant updating of VLM base models). Thus, some attempts at improving behavior and accuracy of general VLMs have involved using a few representative examples of the type of classification that is desired, and providing them at inference time—a technique known as few-shot prompting. Such approaches can reduce the need for extensive training and domain-specific fine-tuning, but can result in inconsistent outputs, rely on general VLMs that are subject to modification and retraining by their owners (e.g., Google, OpenAI, etc.) and still require expert development of the few examples. And, users generally are not able to interpret results or ascertain why/whether a given output occurred (e.g., if incorrect or unexpected). Accordingly, a need exists for a feasible and reliable approach to leveraging the power of VLMs for highly-accurate, domain-specific visual diagnostic assistance. Such approach should be able to utilize the strength of very high parameter models, but avoid a need for large-scale retraining or fine tuning. Additionally, the approach should ensure consistency and reliable output, avoiding background changes to model weights or structure implemented by developers/owners of the models. Finally, the approach should be capable of easy refinement, customization, and personalization within a given domain-specific diagnostic task, without needing a large amount of new training data. SUMMARY The following presents a simplified summary of the disclosed technology herein in order to provide a basic understanding of some aspects of the disclosed technology. This summary is not an extensive overview of the disclosed technology. It is intended neither to identify key or critical elements of the disclos