Search

CN-121996199-A - Floating point number processing method, device, computing equipment and storage medium

CN121996199ACN 121996199 ACN121996199 ACN 121996199ACN-121996199-A

Abstract

The application discloses a floating point number processing method, a floating point number processing device, computing equipment and a storage medium, and relates to the technical field of computers. The computing device obtains the first floating point number, decodes the first floating point number and obtains the second floating point number. The first floating point number includes a first symbol field, a bit width indication field, a first order code field, and a first mantissa field, and the second floating point number includes a second symbol field, a second order code field, and a second mantissa field. The absolute value of a first code represented by a first code field of the first floating point number is larger than or equal to a set value, the bit width of the first mantissa field is larger than the difference value between the total bit width of the first floating point number and the first bit width, and the first bit width is the sum of the bit width of the first symbol field, the bit width of the bit width indication field and the value of the bit width indication field. When the absolute value of the step code of the floating point number is larger, the bit width of the first mantissa domain can be properly increased, so that the requirement on the numerical precision of the floating point number is met.

Inventors

  • Hu tianchi
  • LUO YUANYONG
  • WANG JUNSONG

Assignees

  • 华为技术有限公司

Dates

Publication Date
20260508
Application Date
20241105

Claims (20)

  1. 1. A method of floating point number processing, the method comprising: The method comprises the steps of obtaining a first floating point number, wherein a data format adopted by the first floating point number comprises a first symbol domain, a bit width indication domain, a first order code domain and a first mantissa domain, the first symbol domain is used for representing a symbol of the first floating point number, the first order code domain is used for representing a mantissa of the first floating point number, the bit width indication domain is used for indicating a bit width of the first order code domain, when an absolute value of the first floating point number is larger than or equal to a set value, the bit width of the first mantissa domain is larger than a difference value between the total bit width of the first floating point number and the first bit width, and the first bit width is a sum of the bit width of the first symbol domain, the bit width of the bit width indication domain and the bit width indication domain; And decoding the first floating point number to obtain a second floating point number, wherein a data format adopted by the second floating point number comprises a second symbol domain, a second level code domain and a second tail number domain, and the second floating point number is the same as the data represented by the first floating point number.
  2. 2. The method according to claim 1, wherein the method further comprises: The method comprises the steps of obtaining a third floating point number, wherein the data format of the third floating point number is the same as that of the first floating point number, the absolute value of a step code of the third floating point number is smaller than the set value, and the bit width of a first mantissa field of the third floating point number is equal to the difference value between the total bit width of the third floating point number and the first bit width; and decoding the third floating point number to obtain a fourth floating point number, wherein the data format of the fourth floating point number is the same as that of the second floating point number, and the data represented by the fourth floating point number is the same as that of the third floating point number.
  3. 3. The method of claim 2, wherein the first field of the third floating point number includes a sign bit that characterizes a sign of the third floating point number's code.
  4. 4. The method of any of claims 1-3, wherein the first level field of the first floating point number does not include sign bits, and the level of the first level field representation of the first floating point number is negative.
  5. 5. The method of any of claims 1-4, wherein the value of the bit width indication field is positively correlated with an absolute value of a step code characterized by the first step code field.
  6. 6. The method of any one of claims 1-5, wherein the value of the bit width indication field is inversely related to the bit width of the bit width indication field.
  7. 7. The method according to any one of claims 1-6, wherein the obtaining a first floating point number includes: The first floating point number is read from memory or obtained through a communication network.
  8. 8. The method of any of claims 1-7, wherein the second floating point number is a normalized floating point number, the method further comprising: and adopting the second floating point number to participate in the calculation task.
  9. 9. A method of floating point number processing, the method comprising: The method comprises the steps of obtaining a second floating point number, wherein the second floating point number is a normalized floating point number, and a data format adopted by the second floating point number comprises a second symbol domain, a second order code domain and a second mantissa domain, wherein the second symbol domain is used for representing a symbol of the second floating point number, the second order code domain is used for representing a order code of the second floating point number, and the second mantissa domain is used for representing a mantissa of the second floating point number; The method comprises the steps of obtaining a first floating point number based on a second symbol domain, a second level code domain and a second mantissa domain, wherein a data format adopted by the first floating point number comprises a first symbol domain, a bit width indication domain, a first order code domain and a first mantissa domain, the bit width indication domain is used for indicating the bit width of the first order code domain, the absolute value of the first order code domain of the first floating point number is larger than or equal to a set value, the bit width of the first mantissa domain is larger than the difference value between the total bit width of the first floating point number and the first bit width, the first bit width is the sum of the bit width of the first symbol domain, the bit width of the bit width indication domain and the value of the bit width indication domain, and the data represented by the first floating point number and the second floating point number are identical.
  10. 10. The method according to claim 9, wherein the method further comprises: The method comprises the steps of obtaining a fourth floating point number, wherein the data format of the fourth floating point number is the same as the data format of the second floating point number; And obtaining a third floating point based on the second symbol domain, the second code domain and the second mantissa domain of the fourth floating point, wherein the third floating point and the fourth floating point represent the same data, the data format of the third floating point is the same as the data format of the first floating point, the absolute value of the step code of the third floating point is smaller than the set value, and the bit width of the first mantissa domain of the third floating point is equal to the difference value between the total bit width and the first bit width of the third floating point.
  11. 11. The method of claim 10, wherein the first field of the third floating point number comprises a sign bit that characterizes a sign of the third floating point number's code.
  12. 12. The method of any of claims 9-11, wherein the first order field of the first floating point number does not include sign bits, and wherein the order of the first order field representation of the first floating point number is negative.
  13. 13. A floating point number processing apparatus, the apparatus comprising: The system comprises a floating point number acquisition module, a first bit width indication module and a second bit width indication module, wherein a data format adopted by the first floating point number comprises a first symbol domain, a bit width indication domain, a first order code domain and a first tail number domain, the first symbol domain is used for representing a symbol of the first floating point number, the first order code domain is used for representing an order code of the first floating point number, the first tail number domain is used for representing a mantissa of the first floating point number, the bit width indication domain is used for indicating a bit width of the first order code domain, when an absolute value of the order code of the first floating point number is larger than or equal to a set value, the bit width of the first tail number domain is larger than a difference value between the total bit width of the first floating point number and the first bit width, and the first bit width is the sum of the bit width of the first symbol domain, the bit width indication domain and the bit width indication domain; The decoding module is used for decoding the first floating point number to obtain a second floating point number, a data format adopted by the second floating point number comprises a second symbol domain, a second level code domain and a second tail number domain, and the second floating point number is the same as the data represented by the first floating point number.
  14. 14. The apparatus of claim 13, wherein the second floating point number is a normalized floating point number, the apparatus further comprising a calculation module to participate in a calculation task with the second floating point number.
  15. 15. A floating point number processing apparatus, the apparatus comprising: The data acquisition module is used for acquiring a second floating point number, wherein the second floating point number is a normalized floating point number, and a data format adopted by the second floating point number comprises a second symbol domain, a second order code domain and a second mantissa domain, wherein the second symbol domain is used for representing a symbol of the second floating point number, the second order code domain is used for representing a order code of the second floating point number, and the second mantissa domain is used for representing a mantissa of the second floating point number; the encoding module is configured to obtain a first floating point number based on the second symbol domain, the second code domain and the second mantissa domain, wherein a data format adopted by the first floating point number includes a first symbol domain, a bit width indication domain, a first order code domain and a first mantissa domain, the bit width indication domain is used for indicating the bit width of the first order code domain, when the absolute value of the first order code domain of the first floating point number is greater than or equal to a set value, the bit width of the first mantissa domain is greater than the difference value between the total bit width of the first floating point number and the first bit width, the first bit width is the sum of the bit width of the first symbol domain, the bit width of the bit width indication domain and the value of the bit width indication domain, and the data represented by the first floating point number and the second floating point number are the same.
  16. 16. The apparatus of claim 15, wherein the first code field of the first floating point number does not contain sign bits, the first code field of the first floating point number characterizing a code that is negative.
  17. 17. The computing device is characterized by comprising a processor and a memory, wherein the memory is stored with a computer program; The processor is configured to read the computer program stored in the memory and perform the method of any one of claims 1 to 8 or perform the method of any one of claims 9 to 12.
  18. 18. A chip comprising a processor and power supply circuitry for powering the processor, the processor for executing a computer program to implement the method of any one of claims 1 to 8, or the method of any one of claims 9 to 12.
  19. 19. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method of any one of claims 1 to 8 or the method of any one of claims 9 to 12.
  20. 20. A computer program product comprising computer executable instructions for causing a computer to perform the method according to any one of claims 1 to 8 or the method according to any one of claims 9 to 12.

Description

Floating point number processing method, device, computing equipment and storage medium Technical Field The present application relates to the field of computer technologies, and in particular, to a floating point number processing method, apparatus, computing device, and storage medium. Background In a computer system, floating Point (FP) is an approximate numerical representation for real numbers, also known as a floating point type data representation. Illustratively, the floating-point data representation may include FP8, FP16, and FP32, FP8 representing an 8-bit floating-point number, FP16 representing a 16-bit floating-point number, and FP32 representing a 32-bit floating-point number. Typically, floating point numbers contain 3 fields, a sign field, a code (exponent) field, and a mantissa (mantissa) field. Where the value of the step code field represents an exponent of a certain radix, for example, the radix may be 2, and the value of the step code field (i.e., an exponent of 2) is typically an integer, and may represent an integer power of 2. The value of the mantissa field is multiplied by the exponent of the radix to obtain a data, and the sign field is used to indicate the positive or negative of the data. In each floating point data representation, the bit width of each field is fixed. For example, FP16 comprises a 1-bit (bit) symbol field, a 5-bit step code field and a 10-bit mantissa field, FP32 comprises a 1-bit symbol field, an 8-bit step code field and a 23-bit mantissa field, and FP8 may comprise two types, one of which comprises a 1-bit symbol field, a 5-bit step code field and a 2-bit mantissa field, and the other of which comprises a 1-bit symbol field, a 4-bit step code field and a 3-bit mantissa field. The bit width of the step code domain determines the numerical range that the floating point number can represent, and the bit width of the tail number domain determines the numerical precision that the floating point number can represent. Along with the application of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) and the rapid development of training and reasoning of a neural network model, the scale of data (such as network parameters) used in an AI scene is rapidly increased, and the AI training and reasoning is performed by using floating point numbers with smaller bit width, so that the data storage and data transfer cost can be saved. However, in AI scenarios, the accuracy of the data is also critical to the performance of model training and reasoning. Currently, how to use a limited floating point digital width to ensure the precision of data to meet the requirements is a problem to be solved. Disclosure of Invention The embodiment of the application provides a floating point number processing method, a device, a computing device and a storage medium, which can ensure the precision of data to meet the requirement by using limited floating point digital width. In a first aspect, embodiments of the present application provide a floating point number processing method, which may be performed by a computing device, or may be performed by a chip, a system-on-chip, or a circuit in the computing device. The floating point number processing method comprises the steps that a computing device obtains a first floating point number, wherein a data format adopted by the first floating point number comprises a first symbol domain, a bit width indication domain, a first order code domain and a first mantissa domain, the first symbol domain is used for indicating a symbol of the first floating point number, the first order code domain is used for representing a step code of the first floating point number, the first mantissa domain is used for representing mantissa of the first floating point number, the bit width indication domain is used for indicating bit width of the first order code domain, the absolute value of the step code of the first floating point number is larger than or equal to a set value, the bit width of the first mantissa domain is larger than the difference value between the total bit width of the first floating point number and the first bit width, and the first bit width indication domain is the sum of the bit width of the first symbol domain and the bit width indication domain. After the computing device obtains the first floating point number, the computing device may decode the first floating point number to obtain a second floating point number, where a data format adopted by the second floating point number includes a second symbol domain, a second level code domain, and a second mantissa domain. The second floating point number is the same as the data represented by the first floating point number. In the embodiment of the application, the first floating point number comprises a first symbol domain, a bit width indication domain, a first order code domain and a first mantissa domain, wherein the bit width indication domain is used for indicating the bit width