EP-4020331-B1 - METHOD AND SYSTEM FOR ACTIVE LEARNING AND AUTOMATIC ANALYSIS OF DOCUMENTS

EP4020331B1EP 4020331 B1EP4020331 B1EP 4020331B1EP-4020331-B1

Inventors

GUELORGET, Paul
GRILHERES, BRUNO
ZAHARIA, Titus

Dates

Publication Date: 20260513
Application Date: 20211214

Claims (10)

Method implemented by an active learning and automatic analysis system, the method executing a learning mode and a production mode in parallel, the production mode including: - responding to requests for the automatic analysis of documents using a machine learning model trained with annotated documents; and the learning mode including: - receiving and storing non-annotated documents; - updating a descriptor, stored in association with the non-annotated documents, with information relating to a prediction, in accordance with the model used in production mode, for the automatic analysis of said non-annotated documents; - sampling the stored non-annotated documents whose descriptor has been updated in accordance with the model used in production mode, and determining an order, in accordance with the information relating to the automatic analysis prediction, of the sampled non-annotated documents for annotation by an oracle; - distributing the annotated documents between documents to be used in training mode for training at least one candidate machine learning model and documents to be used in validation mode for comparing the performance of said at least one candidate machine learning model with the model used in production mode; - training said at least one randomly structured candidate machine learning model, and when the trained candidate machine learning model exhibits better performance in terms of validation than the model used in production mode, replacing the model used in production mode with the trained candidate machine learning model and reiterating by updating the descriptor in accordance with the replacement model, wherein said documents are images and said automatic analysis is recognition of at least one object type in said images, characterized in that the method furthermore comprises, when a document is newly annotated with an annotation A: - computing a first score that takes account of the proximity between a setpoint ratio R and the current ratio for this annotation A, the ratios being representative of the number of annotated documents used in validation mode in comparison with the total number of annotated documents; - computing a second score relating to the lack of representation of the annotation A in the documents used in validation mode; - determining whether the newly annotated document should be used in training mode or in validation mode on the basis of these first and second scores.
Method according to Claim 1, including: - upon replacement of the model used in production mode, interrupting any ongoing operation regarding the updating of the descriptors and the sampling in accordance with the replaced model.
Method according to Claim 1 or 2, including: - performing preprocessing upon receiving the non-annotated documents, which retrieves activation maps that are obtained using a pre-trained portion that is common to the machine learning models and that is applied to the received non-annotated documents, and writes, to the descriptors associated with said non-annotated documents, embeddings resulting from the pre-trained portion.
Method according to any one of Claims 1 to 3, including: - creating multiple training instances that operate in parallel so as to train machine learning models of different random structures.
Method according to any one of Claims 1 to 4, including: - creating multiple automatic analysis instances that operate in parallel with the model used in production mode, so as to distribute operations of updating the descriptors in learning mode and operations of responding to requests for the automatic analysis of documents in production mode.
Method according to any one of Claims 1 to 5, wherein the formation of the order moves forward the more the already ordered documents are considered to be annotated so as to update a diversity contribution in an informativeness and uncertainty score.
Method according to any one of Claims 1 to 6, wherein the automatic analysis is a classification.
Computer program product comprising instructions for implementing the method according to any one of Claims 1 to 7 when said instructions are executed by a processor.
Information storage medium storing a computer program comprising instructions for implementing the method according to any one of Claims 1 to 7 when said instructions are read from the information storage medium and executed by a processor.
Active learning and automatic analysis system including electronic circuitry configured so as to execute a learning mode and a production mode in parallel, the production mode including: - responding to requests for the automatic analysis of documents using a machine learning model trained with annotated documents; and the learning mode including: - receiving and storing non-annotated documents; - updating a descriptor, stored in association with the non-annotated documents, with information relating to a prediction, in accordance with the model used in production mode, for the automatic analysis of said non-annotated documents; - sampling the stored non-annotated documents whose descriptor has been updated in accordance with the model used in production mode, and determining an order, in accordance with the information relating to the automatic analysis prediction, of the sampled non-annotated documents for annotation by an oracle; - distributing the annotated documents between documents to be used for training at least one candidate machine learning model and documents to be used for comparing the performance of said at least one candidate machine learning model with the model used in production mode; - training said at least one randomly structured candidate machine learning model, and when the trained candidate machine learning model exhibits better performance in terms of validation than the model used in production mode, replacing the model used in production mode with the trained candidate machine learning model and reiterating by updating the descriptor in accordance with the replacement model, wherein said documents are images and said automatic analysis is recognition of at least one object type in said images, characterized in that the learning system furthermore comprises electronic circuitry configured, when a document is newly annotated with an annotation A, to: - compute a first score that takes account of the proximity between a setpoint ratio R and the current ratio for this annotation A, the ratios being representative of the number of annotated documents used in validation mode in comparison with the total number of annotated documents; - compute a second score relating to the lack of representation of the annotation A in the documents used in validation mode; - determine whether the newly annotated document should be used in training mode or in validation mode on the basis of these first and second scores.

Description

The present invention relates to the field of active learning and automatic document analysis such as the field of classification by artificial neural networks. STATE OF PRIOR ART In the context of automated document analysis (any formatted set of data, for example, text, images, audio, and/or video), such as automatic document classification, active learning is a technique that involves sampling documents of unknown a priori classification (i.e., unannotated documents) and submitting them to a human expert, known as an "oracle." All unannotated documents fed into a machine learning model, such as an artificial neural network or a Support Vector Machine (SVM), are not submitted to the oracle for manual classification. Only a subset of these documents is automatically selected to be annotated (and therefore classified) by the oracle. This is referred to as "semi-supervised" learning. This allows the use of mostly unannotated documents (also called "unlabeled" documents) for learning, thereby reducing human intervention, as well as the cost and speed of setting up an effective classifier. To best utilize the oracle's intervention, the system performs a classification prediction for unannotated documents and typically determines a document informativeness score and classification uncertainty score. The system then uses this score to rank the unannotated documents, thereby submitting to the oracle those documents whose annotation is theoretically expected to provide the greatest learning benefits. Although active learning accelerates the implementation of an effective classifier by minimizing the human resources needed for annotation of Despite the existence of documents, learning today requires the following steps in succession: (a) training the machine learning model, (b) analyzing the automatic classification of unannotated documents, (c) sampling the unannotated documents, (d) annotating one or more documents, (e) adding new documents as input for training, and repeating steps (a) through (d) until the machine learning model's configuration stabilizes and is validated using a dedicated set of annotated documents. In such a sequential process, step (d) is particularly limiting, as it is time-consuming and dependent on the oracle's availability. Furthermore, building the validation set also involves an annotation cost. The document US 2020/0202171 A1 describes systems and methods for rapidly building, managing, and sharing machine learning models. Besides being time-consuming, machine learning requires AI system providers to work closely with end users to refine the machine learning model's configuration for improved performance and to accommodate new data types (e.g., managing seasonality in photographs, introducing new data patterns for recognition, etc.). These systems may be used to process confidential documents, making it desirable to provide a solution that automatically generates and delivers the best available model as the machine learning model is used, without requiring further intervention from the AI provider. This would more easily ensure the confidentiality of the processed documents. It is also desirable to provide a solution that is simple and cost-effective to implement, and, most importantly, one that can be used by users without expertise in machine learning. DESCRIPTION OF THE INVENTION A process is proposed, implemented by an active learning and automatic analysis system. This process executes a learning mode and a production mode in parallel. The production mode includes responding to automatic document analysis requests using a machine learning model trained on annotated documents. The learning mode includes receiving and storing unannotated documents and updating a descriptor stored in association with the... unannotated documents, with information relating to a prediction, according to the model used in production mode, of automatic analysis of said unannotated documents; sample the stored unannotated documents whose descriptor has been updated according to The model used in production mode, and to determine a schedule, based on information relating to the automatic analysis prediction, of unannotated documents sampled for annotation by an oracle; to divide the annotated documents into documents to be used for training to train at least one candidate machine learning model and documents to be used for validation to compare the performance of said at least one candidate machine learning model with the model used in production mode; to train said (at least one) candidate machine learning model of random structure, and when the trained candidate machine learning model shows better performance in validation than the model used in production mode, to replace the model used in production mode with a model corresponding to the trained candidate machine learning model and to iterate by updating the descriptor according to the replacement model. Said documents are images and said automatic