Two’s Complement

The bit sequence represents the number

If the most significant bit is 1, the overall value will be negative because that first bit contributes the largest absolute value to the sum

The value is represented as

Fixed-Point

It uses the notation to represent the number

It can represent numbers of similar magnitudes with the same precision. For example, numbers with 4 digits after the decimal point have precision up to

It can only exactly represent numbers of the form . Other rational numbers have repeating bit representations

The binary point has a fixed position, so a lot of digits are needed to represent very small or very large numbers

Floating Point (IEEE-754)

It represents numbers of the form by specifying and

The number is represented as

The bit representation is divided into three fields to encode these values:

  • The single sign bit directly encodes the sign
  • The -bit exponent field encodes the exponent
  • The -bit fraction field encodes the significand (mantissa)

Normalized Values

is neither all zeros nor all ones

where

Denormalized Numbers

is all zeros

where

These are used to represent and numbers close to zero

The smallest normalized value comes right after the biggest denormalized number

Special Values

is all ones

  • If is all zeros, it represents depending on sign
  • If is nonzero, it represents

Comparison

The IEEE format was designed so that floating-point numbers could be sorted using an integer sorting routine. If we interpret the bit representations of the values as unsigned integers, they occur in ascending order, as do the values they represent as floating-point numbers

This is why we use to represent negative instead of two’s-complement

Conversions

  • From int to float, the number cannot overflow, but it may be rounded
  • From int or float to double, the exact numeric value can be preserved because double has both greater range (i.e., the range of representable values) and greater precision (i.e., the number of significant bits)
  • From double to float, the value can overflow to , since the range is smaller. Otherwise, it may be rounded because the precision is smaller
  • From float or double to int, the value will be rounded toward zero. Furthermore, the value may overflow
  • From a smaller unsigned integer to a larger integer, the value is zero-extended
  • From a smaller signed integer to a larger number, the value is sign-extended
  • From a larger integer to a smaller integer, the value is truncated

References