# Floating point tables and links

By Wolfgang Keller
Draft
Originally written 2023-08-15

## Basics

In IEEE 754, a binary non-denormalized 16/32/64 bit floating point number consists of

• $$1$$ sign bit,
• $$n_e$$ exponent bits,
• $$n_s$$ significand bits of which $$n_s$$ (the fractional part) are explicitly stored. Note that the significand precision is $$n_s+1$$ bits (the $$n_s$$ significand bits plus a leading 1).

where $$1 + n_e+n_s = n$$ with $$n$$: number of bits.

• For binary16 (half-precision) numbers, the values for $$n_e$$ and $$n_s$$ are:
• $$n_e = 5$$
• $$n_s = 10$$
• For binary32 (single-precision) numbers, the values for $$n_e$$ and $$n_s$$ are:
• $$n_e = 8$$
• $$n_s = 23$$
• For binary64 (double-precision) numbers, the values for $$n_e$$ and $$n_s$$ are:
• $$n_e = 11$$
• $$n_s = 52$$

Let

• $$s$$ denote the value of the sign bit ($$s \in \{0,1\}$$)
• $$e$$ denote the value of the exponent ($$e \in \{0, \ldots, 2^{n_e}-1\}$$)
• $$f$$ denote the value of the fraction ($$f \in \{0, \ldots, 2^{n_s}-1\}$$)

Then:

• If $$e \in \{1, \ldots, 2^{n_e}-2\}$$, a normal value is encoded: $$(-1)^s \cdot 2^{e-(2^{n_e-1}-1)} \cdot (1 + 2^{-n_s} \cdot f). \label{eq:normal}$$
• If $$e = 0$$ and …
• … $$f = 0$$, $$\pm 0$$ is encoded.
• … $$f \neq 0$$, a subnormal number is encoded: $$(-1)^s \cdot 2^{-(2^{n_e-1}-2)} \cdot 2^{-n_s} \cdot f = (-1)^s \cdot 2^{-(2^{n_e-1} + n_s) + 2} \cdot f. \label{eq:subnormal}$$
• If $$e = 2^{n_e}-1$$ and …
• … $$f = 0$$, $$\pm \infty$$ is encoded.
• … $$f \neq 0$$, a NaN (sNan (signalling NaN), qNan (quiet NaN) is encoded: in the IEEE 754-2008 and IEEE 754-2019 standards, the following requirement is defined for encoding a signaling/quiet NaN:
• if $$f_{n_s-1} = 0$$, the NaN is signaling (sNaN),
• if $$f_{n_s-1} = 1$$, the NaN is quiet (qNaN).

The number $$2^{n_e-1}-1$$ occuring in the exponent of $$2^{e-(2^{n_e-1}-1)}$$ in $$\eqref{eq:normal}$$ is called the exponent bias.

Interesting side effects of this encoding are the following:

• TODO

## Important floating point numbers

Let us tabulate some important floating point numbers in the binary32 and binary64 format:

• Simple normal numbers:
• $$\pm 1.0$$
• $$\pm 0.5$$
• $$\pm 2.0$$
• Zero: $$\pm 0$$
• Normal numbers:
• the largest number less than $$1$$:
• the smallest number larger than $$1$$:
• the smallest positive normal number:
• the smallest integral number $$w$$ such that all normal floating point numbers $$w' \geq w$$ are integral:
• the largest (normal) number that is smaller than $$w$$:
• the smallest positive integral (normal) number $$x$$ such that $$x+1$$ cannot be represented as floating point number of the respective type:
• $$x-1$$:
• $$x+2$$:
• the largest normal number:
• Subnormal numbers:
• the smallest positive subnormal number: $$2^{-(2^{n_e-1} + n_s) + 2}$$
• the largest subnormal number: $$2^{-(2^{n_e-1} + n_s) + 2} \cdot (2^{n_s}-1)$$
• Infinity: $$\pm \infty$$
• NaNs (with $$s$$ set to $$0$$):
• sNaNs:
• sNaN with minimum possible value for $$f$$ (typical encoding on most processors, such as x86 and ARM processors)
• sNaN with maximum possible value for $$f$$
• qNaNs:
• qNaN with minimum possible value for $$f$$
• qNaN with typical encoding on most processors, such as x86 and ARM processors
• qNaN with maximum possible value for $$f$$

TODO

### binary32 format

Description Number Value Binary encoding Hexadecimal encoding
$$+1.0$$$$2^0$$+1*10^00 01111111 000000000000000000000003F 80 00 00
$$-1.0$$$$-2^0$$-1*10^01 01111111 00000000000000000000000BF 80 00 00
$$+0.5$$$$2^{-1}=\frac{1}{2}$$+5*10^-10 01111110 000000000000000000000003F 00 00 00
$$-0.5$$$$-2^{-1}=-\frac{1}{2}$$-5*10^-10 11111110 00000000000000000000000BF 00 00 00
$$+2.0$$$$2^1$$+2*10^00 10000000 0000000000000000000000040 00 00 00
$$-2.0$$$$-2^1$$-2*10^01 10000000 00000000000000000000000C0 00 00 00
$$+0$$+00 00000000 0000000000000000000000000 00 00 00
$$-0$$-01 00000000 0000000000000000000000080 00 00 00
largest number less than $$1$$
smallest number larger than $$1$$
smallest positive normal number
smallest integral $$w$$ such that all normal $$w' \geq w$$ are integral
largest number that is smaller than $$w$$
smallest positive integral $$x$$ such that $$x+1$$ cannot be represented
$$x-1$$
$$x+2$$
largest normal number
smallest positive subnormal number$$2^{-149} = \frac{1}{713\_623\_846\_352\_979\_940\_529\_142\_984\_724\_747\_568\_191\_373\_312}$$+1.40129846432481707092372958328991613128026194187651577175706828388979108268586060148663818836212158203125*10^-450 00000000 0000000000000000000000100 00 00 01
largest subnormal number0 00000000 1111111111111111111111100 7F FF FF
$$+\infty$$+Inf0 11111111 000000000000000000000007F 80 00 00
$$-\infty$$-Inf1 11111111 00000000000000000000000FF 80 00 00
sNaN with minimum possible value for $$f$$ (typical encoding)sNaN0 11111111 000000000000000000000017F 80 00 01
sNaN with maximum possible value for $$f$$0 11111111 011111111111111111111117F BF FF FF
qNaN with minimum possible value for $$f$$0 11111111 100000000000000000000007F C0 00 00
qNaN with typical encodingqNaN0 11111111 100000000000000000000017F C0 00 01
qNaN with maximum possible value for $$f$$0 11111111 111111111111111111111117F FF FF FF

### binary64 format

Description Number Value Binary encoding Hexadecimal encoding
$$+1.0$$$$2^0$$+1*10^00 01111111111 00000000000000000000000000000000000000000000000000003F F0 00 00 00 00 00 00
$$-1.0$$$$-2^0$$-1*10^01 01111111111 0000000000000000000000000000000000000000000000000000BF F0 00 00 00 00 00 00
$$+0.5$$$$2^{-1}=\frac{1}{2}$$+5*10^-1
$$-0.5$$$$-2^{-1}=-\frac{1}{2}$$-5*10^-1
$$+2.0$$$$2^1$$+2*10^0
$$-2.0$$$$-2^1$$-2*10^0
$$+0$$+00 00000000000 000000000000000000000000000000000000000000000000000000 00 00 00 00 00 00 00
$$-0$$-01 00000000000 000000000000000000000000000000000000000000000000000080 00 00 00 00 00 00 00
largest number less than $$1$$
smallest number larger than $$1$$
smallest positive normal number
smallest integral $$w$$ such that all normal $$w' \geq w$$ are integral
largest number that is smaller than $$w$$
smallest positive integral $$x$$ such that $$x+1$$ cannot be represented
$$x-1$$
$$x+2$$
largest normal number
smallest positive subnormal number0 00000000000 000000000000000000000000000000000000000000000000000100 00 00 00 00 00 00 01
largest subnormal number0 00000000000 111111111111111111111111111111111111111111111111111100 0F FF FF FF FF FF FF
$$+\infty$$+Inf0 11111111111 00000000000000000000000000000000000000000000000000007F F0 00 00 00 00 00 00
$$-\infty$$-Inf1 11111111111 0000000000000000000000000000000000000000000000000000FF F0 00 00 00 00 00 00
sNaN with minimum possible value for $$f$$ (typical encoding)sNaN0 11111111111 00000000000000000000000000000000000000000000000000017F F0 00 00 00 00 00 01
sNaN with maximum possible value for $$f$$0 11111111111 01111111111111111111111111111111111111111111111111117F F7 FF FF FF FF FF FF
qNaN with minimum possible value for $$f$$0 11111111111 10000000000000000000000000000000000000000000000000007F F8 00 00 00 00 00 00
qNaN with typical encodingqNaN0 11111111111 10000000000000000000000000000000000000000000000000017F F8 00 00 00 00 00 01
qNaN with maximum possible value for $$f$$0 11111111111 11111111111111111111111111111111111111111111111111117F FF FF FF FF FF FF FF