What are the FP16, FP32 and FP64?

Handling numerical data efficiently and accurately is a cornerstone of modern computing, particularly in scientific simulations, machine learning, and graphics. One of the most common ways to represent real (non-integer) numbers digitally is through floating-point (FP) formats.

In this post, we’ll explore the basics of floating-point representation, various precision modes (FP16, FP32, FP64), and even take a brief look at integer representations (INT8, INT16, INT32).

What is a Floating Point Number?

Floating point literally stands for a number that can “float” in its position of the decimal (or binary) point—allowing for representation of very large or very small values, as well as fractions. Most floating-point implementations follow the IEEE 754 standard, which typically uses base 2 (binary) for calculations.

The Scientific Notation Approach

A floating-point number is often represented in a binary scientific notation:

mantissa (significand) × base^exponent

Mantissa (or significand): A normalized binary fraction that captures the significant digits.
Exponent: Determines the power of the base (commonly base 2).

Because of this format, floating-point numbers can represent an extremely wide range of values, from tiny fractions to astronomically large magnitudes, albeit with certain limitations on precision.

Common Floating-Point Precision Modes

FP16 (Half Precision)

Uses 16 bits per floating-point number.
Breakdown:
- 1 bit for sign
- 5 bits for exponent
- 10 bits for mantissa
Benefits: Significantly smaller memory footprint and reduced memory bandwidth requirements compared to higher-precision modes.
Drawbacks: Lower numerical precision and range, which can lead to rounding errors and reduced accuracy in certain applications.

FP32 (Single Precision)

Uses 32 bits per floating-point number.
Offers higher numerical precision than FP16.
More memory usage than FP16, but still widely used (often the default in many GPU computing frameworks).
Breakdown:
- 1 bit for sign
- 8 bits for exponent
- 23 bits for mantissa

FP64 (Double Precision)

Uses 64 bits per floating-point number.
Breakdown:
- 1 bit for sign
- 11 bits for exponent
- 52 bits for mantissa
Provides even higher precision than FP32.
Requires more memory and higher bandwidth, so it’s often reserved for workloads that genuinely require double precision (e.g., high-accuracy scientific simulations).

Integer Representations (INT8, INT16, INT32)

Although floating-point formats are crucial for representing fractional values and supporting vast dynamic range, integer formats remain essential in many other scenarios (e.g., counting, indexing, certain inference operations in machine learning).

INT8:
- 8 bits total; 1 bit for sign in two’s complement, 7 bits for magnitude.
- Range: Typically -128 to +127 in two’s complement.
INT16:
- 16 bits total; 1 bit for sign, 15 bits for magnitude.
- Range: Typically -32768 to +32767 in two’s complement.
INT32:
- 32 bits total; 1 bit for sign, 31 bits for magnitude.
- Range: Typically -2,147,483,648 to +2,147,483,647.

Binary Representation Examples

While the exact binary encoding can differ based on hardware architecture and endianness, here are some illustrative examples:

FP16
- 0.25 → 0011110000000000
- -1.25 → 1100000001010000
FP32 (in hexadecimal form)
- 0.25 → 0x3f200000
- -1.25 → 0xbf200000
FP64 (in hexadecimal form)
- 0.25 → 0x3fe0000000000000
- -1.25 → 0xbfe0000000000000
INT8
- 5 → 00000101
- -5 → 11111011
INT16
- 5 → 0000000000000101
- -5 → 1111111111111011
INT32 (in hexadecimal form)
- 5 → 0x00000005
- -5 → 0xfffffffb

Note: The prefix 0x indicates a hexadecimal representation. For instance, “0x3F2” is hexadecimal for a binary value that you can convert into a decimal number using standard base conversion methods.

Differences Between FP16 and INT16

FP16:

16 bits split into sign (1 bit), exponent (5 bits), mantissa (10 bits).
Can represent fractional values and large ranges (though less than FP32 or FP64).

INT16:

16 bits in two’s complement split into sign (1 bit) and magnitude (15 bits).
Represents only integer values between -32768 and +32767.

What is Mixed Precision?

Mixed precision is the practice of using multiple floating-point precisions (e.g., FP16 and FP32) in a single computation workflow, commonly seen in deep learning. By performing most operations in lower precision (FP16) and only resorting to higher precision (FP32) where necessary (such as maintaining a master copy of weights or calculating delicate operations like loss), mixed precision cuts memory usage and computational costs. This leads to faster training or inference speeds and often allows for larger batch sizes without significant drops in model accuracy. GPU architectures with specialized hardware units (e.g., NVIDIA Tensor Cores) see especially large gains from mixed precision.

Behind the scenes, frameworks like PyTorch or TensorFlow often automate mixed precision (AMP) via selective casting of tensors, carefully handling potential numerical instability issues (such as underflow) through techniques like “loss scaling.” As a result, users can enable mixed precision with minimal code changes while reaping substantial performance benefits. Aside from deep learning, some high-performance computing (HPC) applications also adopt mixed precision for iterative tasks where the bulk of computations can safely be done in lower precision.

Conclusion

Floating-point formats—FP16, FP32, and FP64—form the backbone of numerical representation in modern GPUs and CPUs, striking a balance between precision, range, and resource usage. Integer formats (INT8, INT16, INT32) remain indispensable for tasks where fractional or exponential representation isn’t required.

FP16 saves memory and bandwidth but compromises on precision.
FP32 strikes a balance suitable for many machine learning and graphics applications.
FP64 provides the highest precision for demanding scientific computations.
FP8 is extremely limited and rarely used outside of very specialized scenarios.

Understanding these trade-offs ensures that developers and data scientists can make informed decisions when optimizing performance and accuracy in their applications.

What are the FP16, FP32 and FP64?

What is a Floating Point Number?

The Scientific Notation Approach

Common Floating-Point Precision Modes

FP16 (Half Precision)

FP32 (Single Precision)

FP64 (Double Precision)

Integer Representations (INT8, INT16, INT32)

Binary Representation Examples

Differences Between FP16 and INT16

What is Mixed Precision?

Conclusion

Found this post helpful or inspiring?

NVIDIA GPU Architectures and Product Families

NVIDIA Holoscan and IGX Orin: Getting Started

Leave a Reply Cancel reply

NVIDIA MAISI: Generate Synthetic CT Images with AI

Must-Have iPhone Apps Every Doctor Should Know

Radiology Datasets for AI Model Training

How To Assess Bone Age Easily with Easy Bone Age Atlas App?

What is a Floating Point Number?

The Scientific Notation Approach

Common Floating-Point Precision Modes

FP16 (Half Precision)

FP32 (Single Precision)

FP64 (Double Precision)

Integer Representations (INT8, INT16, INT32)

Binary Representation Examples

Differences Between FP16 and INT16

What is Mixed Precision?

Conclusion

Found this post helpful or inspiring?

Similar Posts

Leave a Reply Cancel reply