US-12619277-B2 - Computer system and control method therefor
Abstract
An embodiment is a computer system which processes input data which includes a plurality of arithmetic parts; and a host part connected to the plurality of arithmetic parts and configured to control the plurality of arithmetic parts, in which the processed data is transferred between the plurality of arithmetic parts, the arithmetic part includes trace parts which record trace data using detection of a predetermined event from the input data as a trigger, the trace data has a timestamp value which is a detection time of the event based on an operating frequency of the arithmetic part, and the timestamp values of the plurality of arithmetic parts are synchronized.
Inventors
- Yuki Arikawa
- Naoki Miura
- Kenji Tanaka
- Tsuyoshi Ito
- Takeshi Sakamoto
- Yusuke Muranaka
Assignees
- NTT, INC.
Dates
- Publication Date
- 20260505
- Application Date
- 20211112
Claims (17)
- 1 . A computer system, comprising: a plurality of arithmetic parts; and a host part connected to the plurality of arithmetic parts and configured to control the plurality of arithmetic parts, wherein processed data is transferred between the plurality of arithmetic parts, each arithmetic part includes a trace part which record trace data using detection of a predetermined event from input data as a trigger, the trace data has a timestamp value which is a detection time of the event based on an operating frequency of a respective arithmetic part, and the timestamp values of the plurality of arithmetic parts are synchronized, wherein the computer system processes input data through the plurality of arithmetic parts, the computer system further comprises an internal communication part configured to transfer the processed data between the plurality of arithmetic parts, each trace part comprises event generators disposed at observation points, a timestamp part, and a trace buffer configured to record the trace data, the trace data including an instance ID identifying a location where the event is detected, and the host part configured to measure a processing time by calculating a difference between timestamp values from different observation points and detect a failure when the processing time exceeds a predetermined threshold.
- 2 . The computer system according to claim 1 , wherein the trace data further includes a type of the input data, information indicating a location in which the event is detected, information distinguishing the content of the event, or arbitrary data.
- 3 . The computer system according to claim 1 , wherein the host part sets the timestamp value to a predetermined value when the timestamp value exceeds an allowable range of a reference value.
- 4 . The computer system according to claim 1 , wherein either the host part or each arithmetic part is configured to adjust a difference between counter values which are different for each of the arithmetic parts.
- 5 . The computer system according to claim 1 , further comprising: a plurality of host parts.
- 6 . A method for controlling a computer system which includes a plurality of arithmetic parts and a host part, wherein the computer system processes input data through the plurality of arithmetic parts and further comprises an internal communication part configured to transfer processed data between the plurality of arithmetic parts, wherein each arithmetic part includes a trace part comprising event generators disposed at observation points, a timestamp part having a clock counter, and a trace buffer configured to record trace data, the trace data including an instance ID identifying a location where an event is detected, in which the plurality of arithmetic parts obtain a timestamp value on the basis of an operating frequency of the arithmetic parts using detection of a predetermined event from input data as a trigger, and the data is processed and recorded, the method comprising: measuring, by the host part, a processing time by calculating a difference between timestamp values from different observation points; detecting, by the host part, a failure when the processing time exceeds a predetermined threshold; setting, by the host part, a predetermined value to the timestamp value; comparing, by the host part, the timestamp value with a reference value; and setting, by the host part, the timestamp value to the predetermined value when the timestamp value exceeds an allowable range of the reference value to synchronize the timestamp values of the plurality of arithmetic parts.
- 7 . The method of claim 6 wherein either one of the host part and the arithmetic parts multiplies a timestamp value acquired by each of the other arithmetic parts by a coefficient set in each of the other arithmetic parts so that an operating frequency of one arithmetic part of operating frequencies of the plurality of arithmetic parts is the same as operating frequencies of the other arithmetic parts.
- 8 . The method of claim 6 , wherein the trace data further includes arbitrary data, and the method further comprises: restarting processing using the arbitrary data recorded in the trace buffer when the failure is detected.
- 9 . The method of claim 6 , further comprising: calculating, by the host part, a data flow rate by dividing a data volume by a data passage time, wherein the data passage time is calculated from a difference between a leading timestamp value and a trailing timestamp value of data passing through one of the observation points; and comparing the data flow rate with a predetermined threshold value.
- 10 . The method of claim 9 , further comprising: setting, by the host part, a data flow path that avoids a location where the data flow rate exceeds the predetermined threshold value, wherein the location is identified by the instance ID.
- 11 . The method of claim 6 , wherein the trace data further includes a data type, and the method further comprises: monitoring, by the host part, the trace buffer at predetermined intervals; determining whether a plurality of pieces of trace data for a same data type are recorded in the trace buffer; and when the plurality of pieces of trace data for the same data type are recorded, erasing trace data with older timestamp values while retaining trace data with a latest timestamp value.
- 12 . The method of claim 7 , wherein: the operating frequency of a first arithmetic part of the plurality of arithmetic parts is different from an operating frequency of a second arithmetic part of the plurality of arithmetic parts, and the coefficient is set to convert a timestamp value from the second arithmetic part to correspond to the operating frequency of the first arithmetic part.
- 13 . A computer system, comprising: a plurality of arithmetic parts, each arithmetic part including: a trace part comprising event generators disposed at observation points and configured to detect a predetermined event from input data, a timestamp part having a clock counter configured to generate a timestamp value based on an operating frequency of the arithmetic part, and a trace buffer configured to record trace data including the timestamp value and an instance ID identifying the observation point where the event is detected; an internal communication part configured to transfer processed data between the plurality of arithmetic parts; and a host part connected to the plurality of arithmetic parts and configured to: calculate a data flow rate at one of the observation points by dividing a data volume by a data passage time, wherein the data passage time is calculated from a difference between a leading timestamp value and a trailing timestamp value of data passing through the observation point, compare the data flow rate with a predetermined threshold value, and when the data flow rate exceeds the predetermined threshold value, set a data flow path that avoids the observation point identified by the instance ID.
- 14 . The computer system according to claim 13 , wherein the trace data further includes a data type, and the host part is configured to ascertain an operation state for each data type based on the data type included in the trace data.
- 15 . The computer system according to claim 13 , wherein the trace data further includes an event type, and the event type distinguishes between a head of a stream and an end of the stream.
- 16 . The computer system according to claim 13 , wherein the host part is further configured to: read the trace data from the trace buffer; and visualize the trace data in a time graph format.
- 17 . The computer system according to claim 13 , wherein: the trace part is configured to detect the predetermined event based on at least one of: a head of data passing through the observation point, an end of data passing through the observation point, a data type associated with the input data, or a detection flag in the input data.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS This patent application is a national phase filing under section 371 of PCT/JP2021/041776, filed Nov. 12, 2021, which application is hereby incorporated herein by reference in its entirety. TECHNICAL FIELD The present invention relates to a computer system having a plurality of arithmetic parts and a control method thereof. BACKGROUND Technological innovation is progressing in many fields such as machine learning, artificial intelligence (AI), and the Internet of Things (IoT). In addition, the sophistication of services and the provision of added value are being actively carried out by utilizing various information and data. Such processing requires a large amount of calculation and an information processing infrastructure for that is essential. For example, although an attempt has been made to update existing information processing infrastructures in NPL 1, current computers cannot cope with the rapidly increasing amount of data. In order for future development, it has been pointed out that a “post-Moore technique” which goes beyond Moore's law needs to be established. As a post-Moore technique, for example, NPL 2 discloses a technique called flow-centric computing. Flow-centric computing introduces the new concept of moving data to a place in which the computing power resides, rather than the traditional computing idea of doing processing where the data resides. In order to realize the above flow-centric computing, not only is a broadband communication network necessary for data movement necessary, but in addition, in order to obtain the desired computational performance, it is necessary to efficiently control computational resources. Flow-centric computing (for example, NPL 2) discloses a technique for interlocking a plurality of arithmetic functions. CITATION LIST Non Patent Literature [NPL 1] “NTT Technology Report for Smart World 2020,” Nippon Telegraph and Telephone Corporation, 2020. https://www.rd.ntt/_assets/pdf/techreport/NTT_TRFSW_2020_EN_W.pdf.[NPL 2] R. Takano and T. Kudoh, “Flow-centric computing leveraged by photonic circuit switching for the post-moore era, “Tenth IEEE/ACM International Symposium on Networks-on-Chip (NOCS), Nara, 2016, pp. 1-3. Https://ieeexplore.ieee.org/abstract/document/7579339. Technical Problem SUMMARY However, in a computer system in which a plurality of arithmetic parts work together, it has been difficult to identify a fault which has occurred in the arithmetic parts because the arithmetic parts independently move data without going through a host part. In addition, it is difficult to ascertain an internal state of a computer system such as identifying an arithmetic part through which input data has passed at a certain time. Solution to Problem In order to solve the problems described above, a computer system according to embodiments of the present invention is a computer system which processes input data which includes a plurality of arithmetic parts; and a host part connected to the plurality of arithmetic parts and configured to control the plurality of arithmetic parts, in which the processed data is transferred between the plurality of arithmetic parts, the arithmetic part includes trace parts which record trace data using detection of a predetermined event from the input data as a trigger, the trace data has a timestamp value which is a detection time of the event based on an operating frequency of the arithmetic part, and the timestamp values of the plurality of arithmetic parts are synchronized. Also, a control method of a computer system according to embodiments of the present invention is a control method of a computer system which includes a plurality of arithmetic parts and a host part, in which the plurality of arithmetic parts obtain a timestamp value on the basis of an operating frequency of the arithmetic parts using detection of a predetermined event from input data as a trigger, and the data is processed and recorded, the method including: a step of setting, by the host part, a predetermined value to the timestamp value; a step of comparing, by the host part, the timestamp value with a reference value; and a step of setting, by the host part, the timestamp value to the predetermined value when the timestamp value exceeds an allowable range of the reference value. Furthermore, a control method of a computer system according to embodiments of the present invention is a control method of a computer system which includes a plurality of arithmetic parts and a host part, in which the plurality of arithmetic parts obtain a timestamp value on the basis of an operating frequency of the arithmetic parts using detection of a predetermined event from input data as a trigger, and the data is processed and recorded, and in which either one of the host part and the arithmetic parts multiplies a timestamp value acquired by each of the other arithmetic parts by a coefficient set in each of the other arithmetic parts so that a