EP-4009183-B1 - NETWORK-ON-CHIP DATA PROCESSING METHOD AND DEVICE

EP4009183B1EP 4009183 B1EP4009183 B1EP 4009183B1EP-4009183-B1

Inventors

ZHANG, Yao
LIU, SHAOLI
LIANG, JUN
CHEN, YU
LI, ZHEN

Dates

Publication Date: 20260506
Application Date: 20191018

Claims (14)

A data processing method, comprising: receiving (S101) a data operation signal sent by a device or processing circuit which includes one or more machine learning units, wherein said data operation signal is received by a transmission circuit, wherein the data operation signal includes an operation field and an opcode, the opcode includes a first-type flag bit, and the operation field includes a second-type flag bit and a data reception flag bit, wherein the first-type flag bit is used to indicate whether the data operation signal is an I/O instruction, and the second-type flag bit is used to indicate whether the data operation signal is a broadcast or multicast instruction in the I/O instruction in the case that the first-type flag bit indicates that the data operation signal is an I/O instruction; and the data reception flag bit is used to indicate one or more machine learning units that are to receive input data; wherein the operation field further includes information of data to be operated, said information comprising: a source address of the data to be operated in a memory, a length of the data to be operated, and a data return address for use after the data is operated; and performing (S102) a corresponding operation according to the data operation signal on the data to be operated in the memory to obtain required input data.
The method of claim 1, wherein a count of data reception flag bits represents a count of machine learning unit that can interact with the memory.
The method of claim 1 or 2, wherein performing a corresponding operation on the data to be operated in the memory according to the data operation signal to obtain the required input data includes: reading (S201) the memory from the source address to obtain input data that satisfies the data length; determining (S202) one or more machine learning unis that receive the input data according to the data reception flag bit; and according to the data return address, returning (S203) the input data to a storage space corresponding to the data return address in the one or more machine learning units.
The method of claim 3, wherein each machine learning unit includes a primary processing circuit and a plurality of secondary processing circuits.
The method of claim 4, wherein the operation field further includes a jump sub-operation-field, and the jump sub-operation-field includes a jump stride and a jump data length which is obtained after each jump operation is performed, and the reading the memory from the source address to obtain input data that satisfies the data length includes: reading (S301) the memory from the source address, and obtaining first jump data according to a jump data length after a current jump; obtaining (S302) a last address of the jump data, and jumping from the last address to a target jump address according to the jump stride; and starting from the target jump address, obtaining (S303) second jump data according to a length of jump data after the jump until the length of the jump data obtained after each jump satisfies the data length.
The method of claim 5, wherein the jump sub-operation-field includes a stride operation field and/or a segment operation field, wherein the stride operation field is used to indicate a stride for each jump of the data operation signal, and the segment operation field is used to indicate a preset size for each segment of the data operation signal.
The method of claim 6, wherein the operation field further includes a function flag bit which is used to indicate a processing operation performed on data that is read.
The method of claim 7, comprising: if a value of the first-type flag bit is I/O, determining that the data operation signal is an I/O instruction, and at the case that the data operation signal is an I/O instruction, if a value of the second-type flag bit is 1, determining that the data operation signal is a broadcast or multicast instruction in the I/O instruction.
The method of claim 8, wherein the receiving a data operation signal sent by the device or processing circuit includes: parsing (S401) the data operation signal to obtain a type flag bit of the data operation signal and information of data to be operated; and executing (S402) the parsed data operation signal according to an instruction queue, where the instruction queue is used to indicate an execution order of the data operation signal.
The method of claim 9, wherein before executing the parsed data operation signal according to the instruction queue, the method further includes: determining (S501) a dependency of adjacent parsed data operation signals to obtain a determination result, where the dependency represents whether there is an association between an s th data operation signal and an s-1 th data operation signal before the s th data operation signal; and if the determination result is that there is an association between the s th data operation signal and the s-1 th data operation signal, caching (S502) the s th data operation signal, and after the s-1 th data operation signal is executed, fetching the s th data operation signal.
The method of claim 10, wherein the determining a dependency of adjacent parsed data operation signals includes: obtaining a first storage address interval of required data in the s th data operation signal fetched according to the s th data operation signal, and obtaining a zeroth storage address interval of required data in the s-1 th data operation signal fetched according to the s-1 th data operation signal, respectively; if the first storage address interval and the zeroth storage address interval have an overlapping area, determining that there is a dependency between the s th data operation signal and the s-1 th data operation signal; and if the first storage address interval and the zeroth storage address interval do not have an overlapping area, determining that there is no dependency between the s th data operation signal and the s-1 th data operation signal.
A neural network operation device, comprising a processor and a memory, wherein the memory stores a computer program, and the processor implements the steps of any of claims 1-11 when executing the computer program.
A combined processing device, comprising the neural network operation device of claim 12, a universal interconnection interface, and other processing devices, wherein the neural network operation device interacts with the other processing devices to cooperatively perform computations specified by users.
The combined processing device of claim 13, comprising a storage device, wherein the storage device is connected to the neural network operation device and the other processing devices respectively, and is configured to store data of the neural network operation device and the other processing device.

Description

TECHNICAL FIELD The present disclosure relates to the field of information processing technology, and particularly relates to a network-on-chip data processing method and device. BACKGROUND With the development of semi-conductor technology, it has become a reality to integrate hundreds of millions of transistors on a single chip. The network-on-chip (NoC) is capable of integrating plenty of computation resources on a single chip and implementing on-chip communication. US 2012/0303933A1 relates to a processor which comprises processing elements that execute instructions in parallel and are connected together with point-to-point communication links called data communication links (DCL). The instructions use DCLs to communicate data between them. In order to realize those communications, they specify the DCLs from which they take their operands, and the DCLs to which they write their results. The DCLs allow the instructions to synchronize their executions and to explicitly manage the data they manipulate. Communications are explicit and are used to realize the storage of temporary variables, which is decoupled from the storage of long-living variables. US 2017/083338A1 relates to prefetching data associated with predicated loads of programs in block-based processor architectures. In one example of the disclosed technology, a processor includes a block-based processor core for executing an instruction block comprising a plurality of instructions. The block-based processor core includes decode logic and prefetch logic. The decode logic is configured to detect a predicated load instruction of the instruction block. The prefetch logic is configured to calculate a target address of the predicated load instruction and issue a prefetch request to a memory hierarchy of the processor for data at the calculated target address. As plenty of computations are required in a neural network, some of the computations, such as a forward operation, a backward operation, and weight update, need to be processed in parallel. In a chip architecture with a large number of transistors, chip design may face problems such as high memory access overhead, high bandwidth blockage, and low data reading/writing efficiency. SUMMARY In order to at least overcome the problems existing in the related technology to a certain extent, the present disclosure provides an interaction method, a device, and a smart terminal. The invention is set out in the independent claims. Preferred embodiments are defined by the dependent claims. An embodiment of the present disclosure provides a network-on-chip (NoC) processing system. The system includes: a storage device and a plurality of computation devices, where the storage device and the plurality of computation devices are arranged on a same chip. At least one computation device is connected to the storage device, and at least two computation devices are connected to each other. In an embodiment, any two of the plurality of computation devices are directly connected to each other. In an embodiment, the plurality of computation devices include a first computation device and a plurality of second computation devices, where the first computation device is connected to the storage device, and at least one of the plurality of second computation devices is connected to the first computation device. In an embodiment, at least two of the plurality of second computation devices are connected to each other, and are connected to the storage device through the first computation device. In an embodiment, any two of the plurality of second computation devices are directly connected to the first computation device. In an embodiment, each of the plurality of computation devices is connected to the storage device, and at least two computation devices are connected to each other. An embodiment of the present disclosure provides a data processing method, where the method includes: receiving a data operation signal sent by an internal or external device, where the data operation signal includes an operation field and an opcode, where the opcode includes a first-type flag bit, and the operation field includes a second-type flag bit. The first-type flag bit is used to indicate whether the data operation signal is an I/O instruction, and the second-type flag bit is used to indicate whether the data operation signal is a broadcast or multicast instruction in the I/O instruction; andperforming a corresponding operation according to the data operation signal on data to be operated in the memory to obtain required input data. In an embodiment, the operation field further includes a data reception flag bit which is used to indicate a device or a processing circuit that receives the input data. In an embodiment, a count of data reception flag bits represents a count of devices or processing circuits that can interact with the memory. In an embodiment, the operation field further includes information of data to be operated, where the in