CN-122023930-A - Diabetes retinopathy detection system and device based on vision-language model

CN122023930ACN 122023930 ACN122023930 ACN 122023930ACN-122023930-A

Abstract

The application belongs to the technical field of computer vision, and particularly relates to a system and equipment for detecting diabetic retinopathy based on a vision-language model, wherein the system comprises an image acquisition unit, an image processing unit, an adjusting unit and an output unit; collecting a large number of DR fundus images, adding corresponding diagnosis reports or labeling texts to each image, preprocessing to construct a multi-mode combined data set of fundus images-diagnosis texts, adopting an existing large-scale vision-language pre-training model as a basis, combining the data set, utilizing LoRA low-rank adaptation technology, taking indexes such as reasoning accuracy and the like as reward signals, adopting a group relative strategy optimization algorithm to finely tune the model, enabling the model to learn graphic-text combined characteristics of DR diagnosis, deploying the finely tuned intelligent prediction model, automatically predicting the newly acquired fundus images, outputting a prediction result and corresponding interpretation information, and providing auxiliary decision support for doctors.

Inventors

LI FENG
CHEN PENGYU
ZHANG YONG
LI SHAOWEI
HU CHUNHUA
ZHANG HENG

Assignees

山东大学
山东第一医科大学附属内分泌与代谢病医院(山东省内分泌与代谢病研究所、山东省内分泌与代谢病医院)

Dates

Publication Date: 20260512
Application Date: 20260211

Claims (8)

1. A diabetic retinopathy detection system based on a vision-language model, which is characterized by comprising an image acquisition unit, an image processing unit, an adjusting unit and an output unit; The image acquisition unit acquires an image-text combined data set which comprises a patient retina color fundus image and corresponding clinical description information; The image processing unit is used for taking a pre-trained vision-language basic model as a prediction model, inserting a prediction LoRA into a language processing layer or an image-language fusion module of the vision-language basic model to predict a low-rank adapter, and only updating parameters of the adapter; The adjusting unit optimizes the reasoning strategy of the prediction model by utilizing a group relative strategy optimization algorithm GRPO, namely generating a plurality of candidate prediction outputs for each input, and carrying out strategy updating through a reward function; And the output unit is used for outputting an inference result, wherein the inference result corresponds to the preset diabetic retinopathy severity grade.
2. The vision-language model based diabetic retinopathy detection system of claim 1 wherein the dataset acquisition comprises: (1) Acquiring data related to diabetic retinopathy of a real patient, wherein the data at least comprises a fundus prediction RGB prediction color image of the patient and a diagnosis text or a clinical examination report corresponding to the image; (2) Performing quality screening and preprocessing on the acquired fundus images, and performing cleaning and standardization processing on the diagnosis text; (3) Matching the preprocessed fundus images with corresponding diagnosis texts one by one to form an image-text combined multi-mode sample; (4) Marking each sample as a class I-VI diabetic retinopathy according to clinical diagnosis rules and expert consensus, wherein the grading comprises non-diabetic retinopathy, mild non-proliferative diabetic retinopathy, moderate non-proliferative diabetic retinopathy, severe non-proliferative diabetic retinopathy, proliferative diabetic retinopathy and after diabetic retinopathy photocoagulation; (5) The data sets are divided according to patient level to form training sets and test sets, so that the data of the same patient are prevented from being simultaneously displayed in different data subsets.
3. Predicting a vision-language model based diabetic retinopathy detection system as claimed in claim 1, wherein the image processing unit processes contents comprising: Inserting a prediction LoRA low-rank adapter into a language processing module and an image-language fusion module of a visual-language basic model, and setting the original weight of a certain linear layer as LoRA prediction uses two low rank matrices Prediction and prediction method Representing its correction term, wherein prediction Prediction represents low rank dimension while introducing scaling factor prediction Forward computation uses additive correction: ; LoRA the prediction matrix A, B is initialized by adopting a small random value, and the forward calculation flow of the model is as follows: input encoding, the system receives fundus image With clinical text prompt The image is extracted by a visual encoder, the text is converted into a vector sequence by a prediction layer of a prediction Embedding, and the image and the text form an input representation together ; Multi-layer mapping input Several layers predicted by the predicted transducer in turn, in each layer's linear mapping layer, the original weights The prediction remains frozen, wherein Representing the input feature dimension of the linear layer, Representing the output feature dimension of the linear layer; LoRA predictive branch computation, predictive input Parallel entry LoRA prediction adapter, prediction by low rank matrix Prediction and prediction method Mapping the dimension reduction and dimension increase, and multiplying by a scaling factor ; And finally generating, namely outputting a predicted token sequence by a decoder layer after all layers are processed until a complete reasoning result is generated.
4. The vision-language model-based diabetic retinopathy detection method according to claim 3, wherein the adjusting unit processes contents including: receiving fundus images for each inclusion system in the training set With clinical text prompt Input device Based on the current policy model Proceeding with Sub-random sampling to obtain Strip candidate output sequence Each output contains an inference chain and an inference conclusion, defining candidates Class prediction of (2) The correct prediction result is The reward function adopts a binary design: ; calculating an average prize for the candidate set based on the prize values for all candidate diagnostic outputs for the same input sample: ; Calculating an average prize for the candidate set And standard deviation of And define the first The relative advantage of the set of candidate diagnostic outputs is that Wherein Is a very small normal number constant.
5. The vision-language model based diabetic retinopathy detection method of claim 4 wherein the objective function prediction of policy update is constructed based on candidate set relative dominance Model parameters are updated by weighting the advantages of each answer in the generated sequence and combining a cut-off mechanism and a reference strategy constraint, and an objective function is predicted The method comprises the following steps: ; Wherein the method comprises the steps of , Representing a reference strategy model for limiting the update amplitude of the current model; Representing KL divergence for measuring current policy With reference strategy A distribution difference between them; And the super-parameter coefficient representing the KL divergence penalty term.
6. The method according to claim 1, wherein the outputting unit is configured to display the reasoning step in a query tag package, wherein the reasoning step includes observation of image/text evidence, corresponding medical judgment and inferences derived therefrom, and wherein the prediction final reasoning is configured to display the reasoning step in a query tag package, wherein the inferences explicitly output by using a predefined six-level DR category tag, and include non-diabetic retinopathy, mild non-proliferative diabetic retinopathy, moderate non-proliferative diabetic retinopathy, severe non-proliferative diabetic retinopathy, and diabetic retinopathy photocoagulation, and wherein the inferences do not confuse multiple types of results in the same tag.
7. The vision-language model-based diabetic retinopathy detection method according to claim 1, wherein the output unit respectively and structurally expresses the reasoning process and the final reasoning result through a canonical model output format.
8. A vision-language model-based diabetic retinopathy detection device, comprising: A processor; A memory for storing processor-executable instructions; Wherein the processor is configured to implement the method of claims 1-7 when executing the executable instructions.

Description

Diabetes retinopathy detection system and device based on vision-language model Technical Field The application belongs to the technical field of computer vision, and particularly relates to a diabetic retinopathy detection system based on a vision-language model. Background Diabetic retinopathy (Diabetic predicts Retinopathy, DR) is one of the most common ophthalmic complications of diabetes and one of the main causes of blindness. The existing vision (image) based method still has several limitations that firstly, DR prediction diagnosis not only depends on structural lesions (such as micro-aneurysms, exudation, hemorrhage, glass membranes, etc.) of fundus images, but also is often related to clinical information (course of disease, blood sugar control, complications, etc.) of patients, and that simple image input may be degraded when faced with complex clinical situations or low quality images. Second, depth models are commonly in the form of "black boxes", which have insufficient clinical interpretability, preventing doctors from trust and adoption of model conclusions. Third, the high cost of labeling, the subjectivity of labeling grading (consistency of different experts), and the differences in real world data distribution and research set can all affect the generalization ability of the algorithm. Disclosure of Invention In order to solve the technical problems, the invention provides a diabetes retina pathological change detection system based on a vision-language model. The system adopts a large visual language model as a basis, and improves the detection accuracy and enhances the interpretation of the prediction result by introducing a multi-mode reasoning chain structure and a reinforcement learning optimization strategy. In order to achieve the above purpose, the technical scheme is as follows: A vision-language model-based diabetic retinopathy detection system, which comprises an image acquisition unit, an image processing unit, an adjusting unit and an output unit; The image acquisition unit acquires an image-text combined data set which comprises a patient retina color fundus image and corresponding clinical description information; The image processing unit is used for taking a pre-trained vision-language basic model as a prediction model, inserting a prediction LoRA into a language processing layer or an image-language fusion module of the vision-language basic model to predict a low-rank adapter, and only updating parameters of the adapter; The adjusting unit optimizes the reasoning strategy of the prediction model by utilizing a group relative strategy optimization algorithm GRPO, namely generating a plurality of candidate prediction outputs for each input, and carrying out strategy updating through a reward function; And the output unit is used for outputting an inference result, wherein the inference result corresponds to the preset diabetic retinopathy severity grade. Preferably, the data set acquisition includes the following: (1) Acquiring data related to diabetic retinopathy of a real patient, wherein the data at least comprises a fundus prediction RGB prediction color image of the patient and a diagnosis text or a clinical examination report corresponding to the image; (2) Performing quality screening and preprocessing on the acquired fundus images, and performing cleaning and standardization processing on the diagnosis text; (3) Matching the preprocessed fundus images with corresponding diagnosis texts one by one to form an image-text combined multi-mode sample; (4) Marking each sample as a class I-VI diabetic retinopathy according to clinical diagnosis rules and expert consensus, wherein the grading comprises non-diabetic retinopathy, mild non-proliferative diabetic retinopathy, moderate non-proliferative diabetic retinopathy, severe non-proliferative diabetic retinopathy, proliferative diabetic retinopathy and after diabetic retinopathy photocoagulation; (5) The data sets are divided according to patient level to form training sets and test sets, so that the data of the same patient are prevented from being simultaneously displayed in different data subsets. Preferably, the image processing unit processes the content including: Inserting a prediction LoRA low-rank adapter into a language processing module and an image-language fusion module of a visual-language basic model, and setting the original weight of a certain linear layer as LoRA prediction uses two low rank matricesPrediction and prediction methodRepresenting its correction term, wherein predictionPrediction represents low rank dimension while introducing scaling factor predictionForward computation uses additive correction: ; LoRA the prediction matrix A, B is initialized by adopting a small random value, and the forward calculation flow of the model is as follows: input encoding, the system receives fundus image With clinical text promptThe image is extracted by a visual encoder, the text is converted into a vector