EP-4742085-A1 - METHOD FOR TESTING A LARGE LANGUAGE MODEL IMPLEMENTED WITHIN A CONVERSATIONAL AGENT

EP4742085A1EP 4742085 A1EP4742085 A1EP 4742085A1EP-4742085-A1

Abstract

A computer-implemented method for testing the performance of a large language model (LLMt) comprising: ▪ Receiving a dataset (ENS 1 ) from at least one data source (Si) from a time period; ▪ Generation (GEN 1 ) of at least one first message (M 1 ) from the received data set (ENS 1 ) and by application of at least one first large variation language model (LLM 1 ); ▪ Generation (GEN 2 ) of a plurality of messages (M 1 ) by application of at least one large variation language model; ▪ Generation (GEN3) of a plurality of message exchanges ({SEQ 1k } k∈[1 ; N] ) by application of a large language model to be tested (LLMt); ▪ Calculation (TEST 1 ) of a first error indicator (IND 1 ); ▪ Calculation (TEST 2 ) of a second conformity indicator (IND 2 ).

Inventors

MESSIAEN, Kevin
LE JEUNE, Pierre
DORA, Matteo
JOHN MATHEWS, Jean-Marie

Assignees

Giskard AI

Dates

Publication Date: 20260513
Application Date: 20251107

Claims (15)

A computer-implemented method for testing the performance of a large language model (LLMt) implemented within a conversational agent comprising: ▪ Receiving a dataset (ENS 1 ) from at least one data source (Si), said data corresponding to an encoding of discrete symbol sequences in natural language, the dataset (ENS 1 ) being previously extracted from at least the data source (Si) automatically according to a given frequency (TEMP 1 ) and from a selection of a natural language (LG 1 ); ▪ Generation (GEN 1 ) of at least one first test message (M 1 ) from the received data set (ENS 1 ) and by application of at least one first large language model (LLM 1 ) configured with a main context (CT 1 ) comprising a definition of a language and a given instruction specific to the data source (Si); ▪ Generation (GEN 2 ) of a plurality of variation sequences (SEQ VARi ) by applying at least a plurality of large variation language models (LLM vk ) configured from a plurality of secondary contexts (CT 2i ) allowing to generate on the one hand the variations (VAR i ) of the first test message (M 1 ) and the associated responses generated by at least one large language model; ▪ Generation (GEN3) of a plurality of message sequences (SEQ i ) by application of a large language model to be tested (LLMt), a message sequence comprising an input and the corresponding generated output of a large language model; ▪ Calculation (TEST 1 ) of a first error indicator (IND 1 ) evaluating a set of error criteria by comparing the sequences produced by the large language model to be tested (LLMt) and the variation sequences (SEQ VARi ); ▪ Calculation (TEST 2 ) of a second compliance indicator (IND 2 ) from a compliance domain (DOMc) containing rules of conformity (RGL 1 ) defining sets of validity for sequences produced by the large language model to be tested (LLMt), ▪ Generation of an alert when at least one first error indicator and/or a second compliance indicator is generated.
A method according to claim 1 characterized in that the responses of the variation sequences (SEQ VARi ) are generated by: ▪ the plurality of large variation language models (LLMvk) and/or; ▪ at least one large evaluation language model (LLM e ) taking as input a variation produced by a large variation language model (LLM vk ) and producing as output an associated response.
A method according to any one of claims 1 to 2 characterized in that the data source (Si) is pre-selected from a uniform resource locator (URL 1 ) within a data network (NET 1 ) and an organization name enabling the selection of a subset of the data accessible from the uniform resource locator, the pre-defined frequency being used to select data from the data sources (Si) published from a given date.
A method according to any one of claims 1 to 3 , characterized in that the data reception (ENS1) originates from one of the data sources , characterized by : ▪ A source of data accessible from a social network via an authentication process; ▪ A data source defining comments or opinions from a plurality of individuals, ▪ A source of freely accessible information data. ▪ a data source defining one or more internal databases of an organization such as a product or item database, a service database, or a database of vehicles in stock; ▪ a data source defining conversational agent(s) data recorded in production or in a test environment, ▪ a data source defining electronic documentation.
A method according to any one of claims 1 to 4 characterized in that it comprises the execution of a large source data processing language model (LLMs) in order to filter, format and/or normalize the datasets extracted from the data sources (Si).
A method according to any one of claims 1 to 5 characterized in that each message sequence (SEQ i ) comprises a sequence of natural language symbols defining a question, said sequence of natural language symbols being produced from at least one first large variation language model (LLM vk ) and an answer produced by use of a large language model to be tested (LLMt).
A method according to any one of claims 1 to 6 characterized in that the main context (CT 1 ) of a first major variation language model (LLM 1 ) includes the definition of a domain associated with a lexical field or a set of keywords.
A method according to any one of claims 1 to 7 , characterized in that : ▪ a first major variation language model (LLMv 1 ) includes a configured context allowing to automatically generate a sequence of discrete symbols in a natural language from one or more paraphrases of the first message (M 1 ) and/or; ▪ A second major variation language model (LLM V2 ) includes a configured context allowing the automatic generation of a sequence of discrete symbols in a natural language from a translation of the first message (M 1 ) into another natural language and/or; ▪ a third major variation language model (LLM V3 ) includes a configured context allowing automatic generation of a sequence of discrete symbols in a natural language from an exaggeration of the first message (M 1 ) and/or; ▪ a fourth major variation language model (LLM V4 ) includes a configured context enabling the automatic generation of a sequence of discrete symbols in a natural language from a change in tone of the first message (M 1 ) and/or; ▪ a fifth major variation language model (LLM V5 ) includes a configured context enabling the automatic generation of a sequence of discrete symbols in a natural language from an introduction of at least one insult in the first message (M 1 ) and/or; ▪ A sixth major variation language model (LLM V6 ) includes a configured context enabling the automatic generation of a sequence of discrete symbols in a natural language from an introduction of at least one error in the first message ( M1 ), said error being, for example, a spelling or grammatical error in a natural language.
A method according to any one of claims 1 to 8 characterized in that it comprises the generation of a plurality of message variations (M 1 ) for each major variation language model (LLM i ).
A method according to any one of claims 1 to 9 , characterized in that the conformance domain (DOMc) is defined from: ▪ of a set of reference responses produced by another large language model called conformance configured from a context defining a conformance domain and/or; ▪ Conformance rules (RGL 1 ) defining predefined sets of validity of natural language symbol sequences and/or predefined sets of invalidity of natural language symbol sequences.
A method according to any one of claims 1 to 10 characterized in that the conformance rules (RGL 1 ) include the specification of a response language, the specification of a topic to be excluded from the response field or that a link to a resource of a data network (NET 1 ) must be present in a given response type and in that the conformance rules (RGL 1 ) define invalidity sets include a knowledge base listing a set of topics, categories, labels or keywords each defining a sequence of discrete symbols in a natural language and possibly variations of this sequence.
A method according to any one of claims 1 to 11 characterized in that an error criterion of the first error indicator (IND 1 ) includes a check that a set of common concepts are present on the one hand in the response produced from the variation sequence (SEQ VARk ) produced and on the other hand in the response produced by the large language model to be tested (LLMt) to which a variation of the first message (M 1 ) has been provided and in that when an error indicator and/or a conformance indicator is generated, a notification is automatically issued to a remote server or a memory resource of equipment on which the method is executed and/or an error counter is generated to produce an evaluation of the conversational agent over a given period.
Product computer program comprising instructions which, when executed by a computer, cause the computer to perform the process according to any one of claims 1 to 12.
A computer-readable medium on which a computer program is stored, containing instructions that, when executed, executed by a computer, cause the computer to execute the process according to any one of claims 1 to 12.
System comprising an electronic user terminal (PC 1 ) having a user interface, at least one data server (SERV 1 ) hosting all or part of a first data source (S i ), a second data server (SERV 2 ) having at least one computer and a memory in which the large language model to be tested (LLM t ) is executed and at least one third data server (SERV 3 ) having means for executing a variation language model (LLM vi ) and having a computer for executing the steps of the method of any one of claims 1 to 12.

Description

Scope of the invention The field of the invention relates to that of automatically generated tests of conversational agents implementing large language models in order to strengthen their robustness over time. State of the art Currently, solutions exist for interacting with a conversational agent, called a "chatbot" in English-language literature, to assist humans in gathering specific information. These conversational agents require precise contextual information depending on how they are used. A known challenge is the ability of a conversational agent to provide a persistent service over time, capable of adapting to variations related to new concepts, such as those arising from news published on data sources accessible via a data exchange network like the internet. Summary of the invention According to a first aspect, the invention relates to a computer-implemented method for testing the performance of a large language model implemented within a conversational agent comprising: ▪ Receiving a dataset from at least one data source, said data corresponding to an encoding of discrete symbol sequences in natural language, the dataset being previously extracted from at least the data source automatically at a given frequency and from a selection of a natural language; ▪ Generation of at least one first test message from the received data set and by applying at least one first large language model configured with a main context including a definition of a language and a given instruction specific to the data source; ▪ Generation of a plurality of variation sequences by applying at least a plurality of large variation language models configured from a plurality of secondary contexts allowing to generate on the one hand the variations of the first test message and the associated responses generated by at least one large language model; ▪ Generation of a plurality of message sequences by applying a large language model to be tested, a sequence comprising an input and the corresponding generated output of a large language model; ▪ Calculation of a first error indicator evaluating a set of error criteria by comparing the sequences produced by the large language model to be tested and the variation sequences; ▪ Calculation of a second conformity indicator from a conformity domain comprising conformity rules defining sets of validity of the sequences produced by the large language model to be tested. One advantage of the invention is that it allows for the evaluation of a chatbot's robustness over time by automatically generating tests. These tests make it possible to diagnose and identify areas of validity for a chatbot. Furthermore, the tests allow for the redefinition or refinement of a chatbot's prompt so that it can automatically generate reliable responses. In one embodiment, the responses of the variation sequences are generated by a plurality of large variation language models. One advantage is that it extends the testing domain. According to one embodiment, the responses of the variation sequences are generated by at least one large evaluation language model taking as input a variation produced by a large variation language model and producing as output an associated response. According to one embodiment, the first message is a first sequence of symbols in natural language defining a question in a natural language. According to one embodiment, the process is executed at a predefined frequency on a set of predefined sources. According to one embodiment, frequency is used to select data from data sources published from a given date. According to one embodiment, each source is associated with a given frequency. According to one embodiment, the data source is pre-selected from a uniform resource locator within a data network and an organization name allowing selection of a subset of the data accessible from the uniform resource locator. According to one embodiment, the data reception originates from one of the data sources characterized by: ▪ A source of data accessible from a social network via an authentication process; ▪ A data source defining comments or opinions from a plurality of individuals, ▪ A source of freely accessible information data. ▪ A data source defining one or more internal databases of an organization such as a product or item database, a service database, or a database of vehicles in stock; ▪ A data source defining conversational agent(s) data recorded in production or in a test environment, ▪ A data source defining electronic documentation. One advantage is that it allows the generation of tests whose heterogeneity is obtained through the diversity of the selected sources. According to one embodiment, the process includes generating an alert when at least one first error indicator and/or a second conformity indicator is generated. According to one embodiment, the process includes configuring access to a data source. According to one embodiment, the process includes executing a large data pr