WO-2026094025-A1 - IMPROVEMENTS IN DATA PROCESSING

WO2026094025A1WO 2026094025 A1WO2026094025 A1WO 2026094025A1WO-2026094025-A1

Abstract

A method of producing a trained artificial neural network (ANN) is disclosed. The method receives model data defining a base trained ANN having at least 3 billion parameters, the model data defining an ordered sequence of at least 6 layers including one or more input layers, a plurality of core layers, and one or more output layers, each model layer associated with respective model weights and adapted to process layer inputs according to the model weights. At least one block of one or more layers are selected from the core layers in the sequence of model layers, the or each block having a respective starting and ending layers selected in dependence on the base trained ANN. Variant model data is generated defining a variant ANN, wherein the variant ANN is a variant of the base ANN in which the selected block of layers is repeated as a block of repeated layers, the repeated layers having the same model weights as the corresponding base model layers. The method outputs the variant model data, and may then create the variant model according to the variant model data or may execute the variant model dynamically.

Inventors

NG, DAVID

Dates

Publication Date: 20260507
Application Date: 20251104
Priority Date: 20241104

Claims (20)

CLAIMS
1. A method of producing a trained artificial neural network (ANN), comprising:
receiving model data defining a base trained ANN having at least 3 billion parameters, the model data defining an ordered sequence of at least 6 layers including one or more input layers, a plurality of core layers, and one or more output layers, each core layer associated with respective model weights and adapted to process layer inputs according to the model weights, the method comprising:
selecting at least one block of one or more layers from the core layers in the sequence of model layers, the or each block having a respective starting and ending layer selected in dependence on the base trained ANN;
generating variant model data defining a variant ANN, wherein the variant ANN is a variant of the base ANN in which the selected block of layers is repeated as a block of repeated layers, the repeated layers having the same model weights as the corresponding base model layers, and
outputting the variant model data.
2. A method according to claim 1, wherein said at least one block includes at least one block comprising a plurality of contiguous layers of the ANN.
3. A method according to claim 1 or 2, wherein the block of repeated layers immediately follows the selected block of layers in the variant ANN.
4. A method according to any of the preceding claims, wherein generating variant model data comprises outputting a layer map which identifies for each of a sequence of layers of the variant model a corresponding layer of the base model.
5. A method according to claim 4, comprising executing the variant model dynamically in accordance with the layer map.
6. A method according to claim 5, comprising applying layers of the base model to inputs in a sequence defined by the layer map.
7. A method according to claim 5 or 6, wherein executing comprises:
after a first execution of the last layer of the selected block of layers, directing execution of the ANN to return to the first layer of the selected block of layers; after a second execution of the last layer of the selected block of layers, directing execution of the ANN to continue at the next layer of the ANN following the last layer of the selected block.
8. A method according to any of Claims 5 to 7 wherein said executing is performed substantially without duplicating repeated layers in a stored representation of the ANN.
9. A method according to any of Claims 1 to 7 comprising generating the variant ANN and outputting and/or storing the variant ANN, wherein the repeated block layers are duplicated as separate layers in a storage representation of the ANN.
10. A method according to any of Claims 1 to 7 wherein the duplicated layers are duplicated in VRAM of a processing arrangement for executing the ANN but not duplicated in a long term storage representation of the ANN.
11. A method according to any preceding claim wherein said at least one block includes at least one block comprising a single layer of the ANN.
12. A method according to any preceding claim wherein said at least one block includes a plurality of non-contiguous blocks.
13. A method according to any preceding claim further comprising tuning the variant model data by modifying weights of the model preferentially in or adjacent the or each repeated block.
14. A method according to Claim 13 wherein tuning comprises fine tuning arranged to focus tuning on the or each repeated block or layers adjacent thereto.

Description

Improvements in data processing FIELD OF THE INVENTION The present application relates to methods and systems and tools for producing improved trained artificial neural networks, for example large language models, to improve their performance and/or to reduce the compute resource required to create them and to improved models produced thereby as well as to a variety of tools and methods and components useful in various elements of the production or execution process. BACKGROUND OF THE INVENTION Large Language Models (LLMs) have revolutionized the field of natural language processing (NLP), enabling machines to understand and generate human-like text. These models are built on deep learning architectures, often using transformer networks, which are trained on vast amounts of data to predict and generate coherent sequences of words, sentences, or entire paragraphs. LLMs such as GPT (Generative Pretrained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) are based on billions of parameters that allow them to perform complex language-related tasks including text generation, translation, summarization, and question-answering. LLMs are typically structured with multiple layers of attention mechanisms and feedforward networks. Each layer captures contextual information from the input text at different levels of granularity. The training process for these models involves optimizing weights across the entire network by exposing them to massive datasets. This process requires significant computational resources and time, often making it infeasible to train specific models for different applications. As an alternative, general models can be fine-tuned for specific tasks but while this makes the models better at those tasks, overall performance is typically degraded. Even where it is feasible, training a bespoke model takes significant computing resource, time and energy. Methods and tools which could provide a trained model with better performance for a task than with less energy input than conventional techniques would be extremely beneficial. However despite vast human (and machine!) intelligence and resource devoted to the problem and many billions of dollars in both computing resource, hardware and obtaining some of the brightest teams on the planet, although numerous advances have been made, making a “better” model generally requires a lot of compute time to obtain an appreciable benefit. A further problem that has also hitherto defeated the many leading researchers in this highly active and competitive field is that even evaluating the performance of LLMs also presents several serious challenges. First, their large scale means that they require significant computational resources for inference, especially when operating in real-time or resource-constrained environments. Second, while LLMs are highly effective at generating fluent and coherent text, they may sometimes produce irrelevant, incorrect, or biased content. By their very nature, these issues are difficult to detect and measure systematically as the models are not simple - given two complex models with billions of parameters nobody would reasonably expect that they could sensibly be evaluated with a few quick questions and ever more complex and comprehensive evaluation metrics and methodologies are being developed to give a reliable method of evaluating complex models. Furthermore, the black-box nature of LLMs, due to their complex internal structure, makes it challenging to pinpoint areas of underperformance or inefficiency. Thus as models become more capable, some with hundreds of billions of parameters, enormous compute power is required simply to run them in inference mode, vastly more is needed to train them over extended time periods of many months, and evaluation of the resulting models also requires extended time and compute resource. In addition to the technical, hardware and energy resources, the cost of this exercise has meant that while there is a lot of open source sharing of trained models and techniques, cost and resource of training and evaluating alone means that ability to produce the best models may become the preserve of only a few of the largest corporations. Instead of fine-tuning by retraining a model, techniques for merging of trained models have been proposed. These techniques are aimed at leveraging the strengths of multiple models to improve overall performance without retraining from scratch. One common approach is model ensembling, where predictions from several trained models are combined, often through averaging or weighted voting, to produce more robust outputs. This technique enhances accuracy by reducing the variance and bias associated with individual models. Another method involves knowledge distillation, where a smaller, more efficient "student" model is trained to mimic the behavior of a larger "teacher" model, capturing its knowledge in a compressed form. Additionally, model fusion