US-20260127086-A1 - METHOD FOR TESTING A LARGE LANGUAGE MODEL IMPLEMENTED IN A CONVERSATIONAL AGENT

US20260127086A1US 20260127086 A1US20260127086 A1US 20260127086A1US-20260127086-A1

Abstract

A computer-implemented method for testing the performance of a large language model includes receiving a set of data from at least one data source over a period of time; generating at least a first message from the received data set and by applying at least a first large variation language model; generating a plurality of messages by applying at least one large variation language model; generating a plurality of message exchanges by applying a large language model to be tested; calculating a first error indicator, and calculating a second compliance indicator.

Inventors

Kevin MESSIAEN
Pierre LE JEUNE
Matteo DORA
Jean-Marie JOHN MATHEWS

Assignees

GISKARD AI

Dates

Publication Date: 20260507
Application Date: 20241202
Priority Date: 20241107

Claims (20)

1 . A computer-implemented method for testing the performance of a large language model implemented within a conversational agent, the method comprising: receiving a data set from at least one data source, said data corresponding to an encoding of sequences of discrete symbols in natural language, the data set being previously extracted from at least the data source automatically according to a given frequency and from a selection of a predefined natural language; generating at least a first test message from the received data set and by application of at least a first large language model configured with a main context comprising a definition of a language and a given instruction specific to the data source; generating a plurality of variation sequences by applying at least one plurality of large variation language models configured from a plurality of secondary contexts making it possible to generate, on the one hand, the variations of the first test message and the associated responses generated by at least one large language model; generating a plurality of message sequences by application of a large language model to be tested, a sequence comprising an input and the corresponding generated output of a large language model; calculating a first error indicator evaluating a set of error criteria by comparing sequences produced by the large language model to be tested and variation sequences; calculating a second conformance indicator from a conformance domain containing conformance rules defining validity sets for sequences produced by the large language model to be tested, generating an alert when at least a first error indicator and/or a second compliance indicator is generated.
2 . The method according to claim 1 , wherein the variation sequence responses are generated by the plurality of large variation language models.
3 . The method according to claim 1 , wherein the responses of the sequences of variations are generated by at least one large evaluation language model considering as input a variation produced by a large variation language model and producing as output an associated response.
4 . The method according to claim 1 , wherein the predefined frequency is used to select data from data sources published from a given date.
5 . The method according to claim 1 , wherein the data source is pre-selected from a uniform resource locator within a data network and an organization name for selecting a sub-part of the data accessible from the uniform resource locator.
6 . The method according to claim 1 , wherein the data reception comes from one of the data sources characterized by: a data source accessible from a social network using an authentication process; a data source defining comments or opinions from a plurality of individuals, an open-access information data source. a data source defining one or more databases internal to an organization, such as a product or item database, a service database, or a stock vehicle database; a data source defining conversational agent(s) conversation data recorded in production or in a test environment, a data source defining electronic documentation.
7 . The method according to claim 1 , comprising executing a large source data processing language model in order to filter, format and/or normalize the data sets extracted from the data sources.
8 . The method according to claim 1 , wherein each exchange sequence comprises a sequence of natural language symbols defining a question, said sequence of natural language symbols being generated from at least a first large variation language model and an answer generated by using a large test language model.
9 . The method according to claim 1 , wherein the main context of a first large variation language model comprises the definition of a domain associated with a lexical field or a set of keywords.
10 . The method according to claim 1 , wherein a first large variation language model comprises a context configured to automatically generate a sequence of discrete symbols in a natural language from one or more paraphrases of the first message.
11 . The method according to claim 1 , wherein a second large variation language model comprises a context configured to automatically generate a sequence of discrete symbols in a natural language from a translation of the first message into another natural language.
12 . The method according to claim 1 , wherein a third large variation language model comprises a context configured to automatically generate a sequence of discrete symbols in a natural language from an exaggeration of the first message.
13 . The method according to claim 1 , wherein a fourth large variation language model comprises a context configured to automatically generate a sequence of discrete symbols in a natural language from a change in tone of the first message.
14 . The method according to claim 1 , wherein a fifth large variation language model comprises a context configured to automatically generate a sequence of discrete symbols in a natural language from an introduction of at least one insult in the first message.
15 . The method according to claim 1 , wherein a sixth large variation language model comprises a context configured to automatically generate a sequence of discrete symbols in a natural language from an introduction of at least one error in the first message, said error being for example a spelling or grammatical error in a natural language.
16 . The method according to claim 1 , comprising generating a plurality of message variations for each large language variation model.
17 . The method according to claim 1 , wherein the conformance domain is defined from the response of a third large language model configured from a context defining a conformance domain.
18 . The method according to claim 1 , wherein the conformance domain is defined from a set of rules defining validity sets of predefined natural language symbol sequences and/or invalidity sets of predefined natural language symbol sequences.
19 . The method according to claim 1 , wherein the set of rules comprises the specification of a response language, the specification of topics to be excluded from the response field or that a link to a data network resource being present in a given response type.
20 . The method according to claim 1 , wherein the set of rules defining invalidity sets comprises a knowledge base listing a set of themes, categories, labels or keywords each defining a sequence of discrete symbols in a natural language and possibly variations of this sequence.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to French Patent Application No. 2412218, filed Nov. 7, 2024, the entire content of which is incorporated herein by reference in its entirety. FIELD The field of the invention relates to that of automatically generated tests of conversational agents implementing large language models in order to enhance their robustness over time. BACKGROUND Currently, there are solutions that enable interaction with a conversational agent, called a “chatbot”, to help a human gather specific information. These conversational agents require contextual accuracy, depending on how they are to be used. A well-known problem is the ability of a conversational agent to offer a persistent service over time, capable of taking into account variations linked to new concepts, new concepts emanating from news published on data sources accessible from a data exchange network such as the Internet. SUMMARY According to a first aspect, the invention relates to a computer-implemented method for testing the performance of a large language model implemented within a conversational agent, the method comprising: Receipt of a set of data from at least one data source, said data corresponding to an encoding of sequences of discrete symbols in natural language, the set of data being previously extracted from at least a data source automatically at a given frequency and from a selection of a natural language;Generation of at least a first test message from the received data set by applying at least a first large language model configured with a main context comprising a definition of a language and a given instruction specific to the data source;Generation of a plurality of variation sequences by application of at least one plurality of large variation language models configured on the basis of a plurality of secondary contexts making it possible to generate, on the one hand, the variations of the first test message and the associated responses generated by at least one large language model;Generation of a plurality of message sequences by applying a large language model to be tested, a sequence comprising an input and the corresponding generated output of a large language model;Calculation of a first error indicator evaluating a set of error criteria by comparing sequences produced by the large language model to be tested and sequences of variations;Calculation of a second conformance indicator from a conformance domain comprising conformance rules defining validity sets of sequences produced by the large language model to be tested. In an embodiment, the invention comprises computing errors by comparing the output of the tested model and the expected output. A benefit of the invention is that it enables the robustness of a conversational agent to be assessed over time by automatically generating tests. In an embodiment, these tests are used to diagnose and identify validity domains of a conversational agent. The tests also make it possible to redefine or specify a conversational agent prompt so that it can automatically generate reliable responses. According to an embodiment, variation sequence responses are generated by the plurality of large variation language models. A benefit is to extend the test domain. According to an embodiment, the responses of the variation sequences are generated by at least one large evaluation language model that considers as input a variation produced by a large variation language model and produces as output an associated response. In an embodiment, the first message is a first sequence of natural language symbols defining a question in a natural language. In an embodiment, the process is run at a predefined frequency on a set of predefined sources. In an embodiment, frequency is used to select data from published data sources from a given date. In an embodiment, each source is associated with a given frequency. In an embodiment, the data source is pre-selected from a uniform resource locator within a data network and an organization name for selecting a subset of the data accessible from the uniform resource locator. According to an embodiment, the data reception comes from one of the data sources characterized by: A data source accessible from a social network using an authentication process;A data source defining comments or opinions from a plurality of individuals;An open-access information data source;A data source defining one or more databases internal to an organization, such as a product or item database, a service database, or a stock vehicle database;A data source defining conversational agent(s) conversation data recorded in production or in a test environment,A data source defining electronic documentation. A benefit is that one can generate tests that are heterogeneous thanks to the diversity of the sources selected. In an embodiment, the method comprises generating an alert when at least a first error indicator and/or a second com