What are the FP16, FP32 and FP64?
Handling numerical data efficiently and accurately is a cornerstone of modern computing, particularly in scientific simulations, machine learning, and graphics. One of the most common ways to represent real (non-integer) numbers digitally is through floating-point (FP) formats.
In this post, we’ll explore the basics of floating-point representation, various precision modes (FP16, FP32, FP64), and even take a brief look at integer representations (INT8, INT16, INT32).
What is a Floating Point Number?
Floating point literally stands for a number that can “float” in its position of the decimal (or binary) point—allowing for representation of very large or very small values, as well as fractions. Most floating-point implementations follow the IEEE 754 standard, which typically uses base 2 (binary) for calculations.
The Scientific Notation Approach
A floating-point number is often represented in a binary scientific notation:
mantissa (significand) × base^exponent
- Mantissa (or significand): A normalized binary fraction that captures the significant digits.
- Exponent: Determines the power of the base (commonly base 2).
Because of this format, floating-point numbers can represent an extremely wide range of values, from tiny fractions to astronomically large magnitudes, albeit with certain limitations on precision.
Common Floating-Point Precision Modes
FP16 (Half Precision)
- Uses 16 bits per floating-point number.
- Breakdown:
- 1 bit for sign
- 5 bits for exponent
- 10 bits for mantissa
- Benefits: Significantly smaller memory footprint and reduced memory bandwidth requirements compared to higher-precision modes.
- Drawbacks: Lower numerical precision and range, which can lead to rounding errors and reduced accuracy in certain applications.
FP32 (Single Precision)
- Uses 32 bits per floating-point number.
- Offers higher numerical precision than FP16.
- More memory usage than FP16, but still widely used (often the default in many GPU computing frameworks).
- Breakdown:
- 1 bit for sign
- 8 bits for exponent
- 23 bits for mantissa
FP64 (Double Precision)
- Uses 64 bits per floating-point number.
- Breakdown:
- 1 bit for sign
- 11 bits for exponent
- 52 bits for mantissa
- Provides even higher precision than FP32.
- Requires more memory and higher bandwidth, so it’s often reserved for workloads that genuinely require double precision (e.g., high-accuracy scientific simulations).
Integer Representations (INT8, INT16, INT32)
Although floating-point formats are crucial for representing fractional values and supporting vast dynamic range, integer formats remain essential in many other scenarios (e.g., counting, indexing, certain inference operations in machine learning).
- INT8:
- 8 bits total; 1 bit for sign in two’s complement, 7 bits for magnitude.
- Range: Typically -128 to +127 in two’s complement.
- INT16:
- 16 bits total; 1 bit for sign, 15 bits for magnitude.
- Range: Typically -32768 to +32767 in two’s complement.
- INT32:
- 32 bits total; 1 bit for sign, 31 bits for magnitude.
- Range: Typically -2,147,483,648 to +2,147,483,647.
Binary Representation Examples
While the exact binary encoding can differ based on hardware architecture and endianness, here are some illustrative examples:
- FP16
- 0.25 →
0011110000000000
- -1.25 →
1100000001010000
- 0.25 →
- FP32 (in hexadecimal form)
- 0.25 →
0x3f200000
- -1.25 →
0xbf200000
- 0.25 →
- FP64 (in hexadecimal form)
- 0.25 →
0x3fe0000000000000
- -1.25 →
0xbfe0000000000000
- 0.25 →
- INT8
- 5 →
00000101
- -5 →
11111011
- 5 →
- INT16
- 5 →
0000000000000101
- -5 →
1111111111111011
- 5 →
- INT32 (in hexadecimal form)
- 5 →
0x00000005
- -5 →
0xfffffffb
- 5 →
Note: The prefix
0x
indicates a hexadecimal representation. For instance, “0x3F2” is hexadecimal for a binary value that you can convert into a decimal number using standard base conversion methods.
Differences Between FP16 and INT16
FP16:
- 16 bits split into sign (1 bit), exponent (5 bits), mantissa (10 bits).
- Can represent fractional values and large ranges (though less than FP32 or FP64).
INT16:
- 16 bits in two’s complement split into sign (1 bit) and magnitude (15 bits).
- Represents only integer values between -32768 and +32767.
What is Mixed Precision?
Mixed precision is the practice of using multiple floating-point precisions (e.g., FP16 and FP32) in a single computation workflow, commonly seen in deep learning. By performing most operations in lower precision (FP16) and only resorting to higher precision (FP32) where necessary (such as maintaining a master copy of weights or calculating delicate operations like loss), mixed precision cuts memory usage and computational costs. This leads to faster training or inference speeds and often allows for larger batch sizes without significant drops in model accuracy. GPU architectures with specialized hardware units (e.g., NVIDIA Tensor Cores) see especially large gains from mixed precision.
Behind the scenes, frameworks like PyTorch or TensorFlow often automate mixed precision (AMP) via selective casting of tensors, carefully handling potential numerical instability issues (such as underflow) through techniques like “loss scaling.” As a result, users can enable mixed precision with minimal code changes while reaping substantial performance benefits. Aside from deep learning, some high-performance computing (HPC) applications also adopt mixed precision for iterative tasks where the bulk of computations can safely be done in lower precision.
Conclusion
Floating-point formats—FP16, FP32, and FP64—form the backbone of numerical representation in modern GPUs and CPUs, striking a balance between precision, range, and resource usage. Integer formats (INT8, INT16, INT32) remain indispensable for tasks where fractional or exponential representation isn’t required.
- FP16 saves memory and bandwidth but compromises on precision.
- FP32 strikes a balance suitable for many machine learning and graphics applications.
- FP64 provides the highest precision for demanding scientific computations.
- FP8 is extremely limited and rarely used outside of very specialized scenarios.
Understanding these trade-offs ensures that developers and data scientists can make informed decisions when optimizing performance and accuracy in their applications.