Floating point tables and links

By Wolfgang Keller
Draft
Originally written 2023-08-15
Last modified 2024-01-18

Table of contents

Basics

In IEEE 754, a binary non-denormalized 32/64 bit floating point number consists of

where \(1 + n_e+n_s = n\) with \(n\): number of bits.

For binary32 (single-precision) numbers, the values for \(n_e\) and \(n_s\) are:

For binary64 (double-precision) numbers, the values for \(n_e\) and \(n_s\) are:

Let

Then:

The number \(2^{n_e-1}-1\) occuring in the exponent of \(2^{e-(2^{n_e-1}-1)}\) in the formula for normal numbers is called the exponent bias.

Interesting side effects of this encoding are the following:

Important floating point numbers

Let us tabulate some important floating point numbers in the binary32 and binary64 format:

binary32 format

Description Number Value Binary encoding Hexadecimal encoding
\(+1.0\)\(2^0\)+1*10^00 01111111 000000000000000000000003F 80 00 00
\(-1.0\)\(-2^0\)-1*10^01 01111111 00000000000000000000000BF 80 00 00
\(+0.5\)\(2^{-1}=\frac{1}{2}\)+5*10^-10 01111110 000000000000000000000003F 00 00 00
\(-0.5\)\(-2^{-1}=-\frac{1}{2}\)-5*10^-10 11111110 00000000000000000000000BF 00 00 00
\(+2.0\)\(2^1\)+2*10^00 10000000 0000000000000000000000040 00 00 00
\(-2.0\)\(-2^1\)-2*10^01 10000000 00000000000000000000000C0 00 00 00
\(+0\)+00 00000000 0000000000000000000000000 00 00 00
\(-0\)-01 00000000 0000000000000000000000080 00 00 00
largest number less than \(1\)
smallest number larger than \(1\)
smallest positive normal number
smallest integral \(w\) such that all normal \(w' \geq w\) are integral
largest number that is smaller than \(w\)
smallest positive integral \(x\) such that \(x+1\) cannot be represented
\(x-1\)
\(x+2\)
largest normal number
smallest positive subnormal number\(2^{-149} = \frac{1}{713\_623\_846\_352\_979\_940\_529\_142\_984\_724\_747\_568\_191\_373\_312}\)+1.40129846432481707092372958328991613128026194187651577175706828388979108268586060148663818836212158203125*10^-450 00000000 0000000000000000000000100 00 00 01
largest subnormal number0 00000000 1111111111111111111111100 7F FF FF
\(+\infty\)+Inf0 11111111 000000000000000000000007F 80 00 00
\(-\infty\)-Inf1 11111111 00000000000000000000000FF 80 00 00
sNaN with minimum possible value for \(f\) (typical encoding)sNaN0 11111111 000000000000000000000017F 80 00 01
sNaN with maximum possible value for \(f\)0 11111111 011111111111111111111117F BF FF FF
qNaN with minimum possible value for \(f\)0 11111111 100000000000000000000007F C0 00 00
qNaN with typical encodingqNaN0 11111111 100000000000000000000017F C0 00 01
qNaN with maximum possible value for \(f\)0 11111111 111111111111111111111117F FF FF FF

binary64 format

Description Number Value Binary encoding Hexadecimal encoding
\(+1.0\)\(2^0\)+1*10^00 01111111111 00000000000000000000000000000000000000000000000000003F F0 00 00 00 00 00 00
\(-1.0\)\(-2^0\)-1*10^01 01111111111 0000000000000000000000000000000000000000000000000000BF F0 00 00 00 00 00 00
\(+0.5\)\(2^{-1}=\frac{1}{2}\)+5*10^-1
\(-0.5\)\(-2^{-1}=-\frac{1}{2}\)-5*10^-1
\(+2.0\)\(2^1\)+2*10^0
\(-2.0\)\(-2^1\)-2*10^0
\(+0\)+00 00000000000 000000000000000000000000000000000000000000000000000000 00 00 00 00 00 00 00
\(-0\)-01 00000000000 000000000000000000000000000000000000000000000000000080 00 00 00 00 00 00 00
largest number less than \(1\)
smallest number larger than \(1\)
smallest positive normal number
smallest integral \(w\) such that all normal \(w' \geq w\) are integral
largest number that is smaller than \(w\)
smallest positive integral \(x\) such that \(x+1\) cannot be represented
\(x-1\)
\(x+2\)
largest normal number
smallest positive subnormal number0 00000000000 000000000000000000000000000000000000000000000000000100 00 00 00 00 00 00 01
largest subnormal number0 00000000000 111111111111111111111111111111111111111111111111111100 0F FF FF FF FF FF FF
\(+\infty\)+Inf0 11111111111 00000000000000000000000000000000000000000000000000007F F0 00 00 00 00 00 00
\(-\infty\)-Inf1 11111111111 0000000000000000000000000000000000000000000000000000FF F0 00 00 00 00 00 00
sNaN with minimum possible value for \(f\) (typical encoding)sNaN0 11111111111 00000000000000000000000000000000000000000000000000017F F0 00 00 00 00 00 01
sNaN with maximum possible value for \(f\)0 11111111111 01111111111111111111111111111111111111111111111111117F F7 FF FF FF FF FF FF
qNaN with minimum possible value for \(f\)0 11111111111 10000000000000000000000000000000000000000000000000007F F8 00 00 00 00 00 00
qNaN with typical encodingqNaN0 11111111111 10000000000000000000000000000000000000000000000000017F F8 00 00 00 00 00 01
qNaN with maximum possible value for \(f\)0 11111111111 11111111111111111111111111111111111111111111111111117F FF FF FF FF FF FF FF