Two’s Complement

The bit sequence $b_{n - 1} b_{n - 2} \dots b_{0}$ represents the number $- b_{n - 1} \cdot 2^{n - 1} + b_{n - 2} \cdot 2^{n - 2} + \dots + b_{0} \cdot 2^{0}$

If the most significant bit is 1, the overall value will be negative because that first bit contributes the largest absolute value to the sum

The value $- x$ is represented as $\neg x + 1$

Fixed-Point

It uses the notation $b_{m} b_{m - 1} \dots b_{1} b_{0} . b_{- 1} b_{- 2} \dots b_{- n + 1} b_{- n}$ to represent the number $\sum_{i = - n}^{m} 2^{i} \cdot b_{i}$

It can represent numbers of similar magnitudes with the same precision. For example, numbers with 4 digits after the decimal point have precision up to $1/ 2^{4} = 1/16$

It can only exactly represent numbers of the form $x \cdot 2^{y}$ . Other rational numbers have repeating bit representations

The binary point has a fixed position, so a lot of digits are needed to represent very small or very large numbers

Floating Point (IEEE-754)

It represents numbers of the form $x \cdot 2^{y}$ by specifying $x$ and $y$

The number is represented as $(- 1)^{S} \cdot M \cdot 2^{E}$

The bit representation is divided into three fields to encode these values:

The single sign bit $s$ directly encodes the sign $S$
The $k$ -bit exponent field $e x p = e_{k - 1} \dots e_{1} e_{0}$ encodes the exponent $E$
The $n$ -bit fraction field $f r a c = f_{n - 1} \dots f_{1} f_{0}$ encodes the significand (mantissa) $M$

Normalized Values

$e x p$ is neither all zeros nor all ones

$E = e x p - B ia s$ where $B ia s = 2^{k - 1} - 1$

$M = 1. f r a c$

Denormalized Numbers

$e x p$ is all zeros

$E = 1 - B ia s$ where $B ia s = 2^{k - 1} - 1$

$M = 0. f r a c$

These are used to represent $\pm 0$ and numbers close to zero

The smallest normalized value $2^{1 - B ia s} \cdot 1.0$ comes right after the biggest denormalized number $2^{1 - B ia s} \cdot 0.1 \dots 1$

Special Values

$e x p$ is all ones

If $f r a c$ is all zeros, it represents $\pm \infty$ depending on sign $S$
If $f r a c$ is nonzero, it represents $N a N$

Comparison

The IEEE format was designed so that floating-point numbers could be sorted using an integer sorting routine. If we interpret the bit representations of the values as unsigned integers, they occur in ascending order, as do the values they represent as floating-point numbers

This is why we use $B ia s$ to represent negative $E$ instead of two’s-complement

Conversions

From int to float, the number cannot overflow, but it may be rounded
From int or float to double, the exact numeric value can be preserved because double has both greater range (i.e., the range of representable values) and greater precision (i.e., the number of significant bits)
From double to float, the value can overflow to $\pm \infty$ , since the range is smaller. Otherwise, it may be rounded because the precision is smaller
From float or double to int, the value will be rounded toward zero. Furthermore, the value may overflow
From a smaller unsigned integer to a larger integer, the value is zero-extended
From a smaller signed integer to a larger number, the value is sign-extended
From a larger integer to a smaller integer, the value is truncated

🪴 Quartz 4.0

Explorer

Binary and Data Representation

Two’s Complement

Fixed-Point

Floating Point (IEEE-754)

Normalized Values

Denormalized Numbers

Special Values

Comparison

Conversions

References

Graph View

Table of Contents

Backlinks