Twoβs Complement
The bit sequence represents the number
If the most significant bit is 1, the overall value will be negative because that first bit contributes the largest absolute value to the sum
The value is represented as
Fixed-Point
It uses the notation to represent the number
It can represent numbers of similar magnitudes with the same precision. For example, numbers with 4 digits after the decimal point have precision up to
It can only exactly represent numbers of the form . Other rational numbers have repeating bit representations
The binary point has a fixed position, so a lot of digits are needed to represent very small or very large numbers
Floating Point (IEEE-754)
It represents numbers of the form by specifying and
The number is represented as
The bit representation is divided into three fields to encode these values:
- The single sign bit directly encodes the sign
- The -bit exponent field encodes the exponent
- The -bit fraction field encodes the significand (mantissa)
Normalized Values
is neither all zeros nor all ones
where
Denormalized Numbers
is all zeros
where
These are used to represent and numbers close to zero
The smallest normalized value comes right after the biggest denormalized number
Special Values
is all ones
- If is all zeros, it represents depending on sign
- If is nonzero, it represents
Comparison
The IEEE format was designed so that floating-point numbers could be sorted using an integer sorting routine. If we interpret the bit representations of the values as unsigned integers, they occur in ascending order, as do the values they represent as floating-point numbers
This is why we use to represent negative instead of twoβs-complement
Conversions
- From
int
tofloat
, the number cannot overflow, but it may be rounded - From
int
orfloat
todouble
, the exact numeric value can be preserved becausedouble
has both greater range (i.e., the range of representable values) and greater precision (i.e., the number of significant bits) - From
double
tofloat
, the value can overflow to , since the range is smaller. Otherwise, it may be rounded because the precision is smaller - From
float
ordouble
toint
, the value will be rounded toward zero. Furthermore, the value may overflow - From a smaller unsigned integer to a larger integer, the value is zero-extended
- From a smaller signed integer to a larger number, the value is sign-extended
- From a larger integer to a smaller integer, the value is truncated
References
- Exponent bias - Wikipedia
- The Floating-Point Guide - What Every Programmer Should Know About Floating-Point Arithmetic
- Computer Systems A Programmerβs Perspective, Global Edition (3rd ed). Randal E. Bryant, David R. OβHallaron
- IEEE floating-point standard - Wikipedia, the free encyclopedia
- Single-precision floating-point format - Wikipedia
- Floating-point arithmetic - Wikipedia
- IEEE 754 - Wikipedia