Cache Hierarchy

Intel Core i7 cache hierarchy.png|500

The L1 cache is usually split into two parts: the instruction cache (L1i) and the data cache (L1d). This division allows for a better utilization of locality since instructions and data are handled separately

Cache Memory Organization

Caches save a subset of data from a lower level

ParameterDescription
Number of sets
Number of set bits
Number of cache lines per set
Block size (bytes)
Number of physical (main memory) address bits
Number of block offset bits
Number of tag bits
Cache size (bytes), not including overhead such as the valid and tag bits

general organization of cache.png|700

Middle bits are used for the set index because it allows contiguous blocks to be spread more evenly among cache sets. Index­ing with the high­-order bits would cause contiguous memory blocks to all map to the same cache set, leaving the rest of the cache unused

The cache stores and loads the whole cache data block (cache line) from/to lower levels

Cache Data Block Starting Address

Cache data blocks are -bytes aligned, for example, for a block of size =64 bytes the starting address can be 0, 64, 128, etc

Without this alignment requirement, set selection would be complex due to one cache data block spanning multiple set index bits. For example, if =8, =4, =1, and block spans addresses , then the block would occupy two set index bits: and TODO

Request Processing

Set Selection

Determine which set contains the requested word that we want to read or write (update)

Line Matching

Determine which cache line (if any) contains . If some cache line contains , we have a cache hit; otherwise, a cache miss

Line Replacement (Eviction)

On read misses and write misses in write-allocate caches, we need to retrieve the line containing the requested word from the next level in the memory hierarchy

If the set is full, we need to replace (evict) some line based on the replacement policy

Word Extraction

Cache data block size is larger than word size, so we need to determine where in a cache block the word is contained

Direct-Mapped Cache Request Processing

A cache with one cache line per set:

Set Selection

Set index bits in the address of are interpreted as an unsigned integer that corresponds to a set number

direct-mapped cache set selection.png|600

Line Matching and Word Extraction

The word is contained in the line if and only if the valid bit is set, and the tag bits in the cache line match the tag bits in the address of

Block offset bits in the address of determine the offset of the first byte in the word

direct-mapped cache line matching word extraction.png|600

Line Replacement

There is only one cache line per set, so the current line is replaced by the newly fetched line

Set Associative Cache Request Processing

A cache with several cache lines per set:

Set Selection

Set index bits in the address of are interpreted as an unsigned integer that corresponds to a set number

set associative cache set selection.png|600

Line Matching and Word Extraction

The word is contained in the line if and only if the valid bit is set, and the tag bits in the cache line match the tag bits in the address of

Because there are multiple lines in the set, check each line in parallel. Circuitry becomes more complex

Block offset bits in the address of determine the offset of the first byte in the word

set associative cache line matching word extraction.png|600

Line Replacement

The line to be replaced is chosen by the replacement policy (LRU, LFU) among the lines in the set. Need to store additional information to support replacement: access counters, age bits, etc.

Fully Associative Cache Request Processing

A cache with one set containing all cache lines:

Set Selection

There is only one set and no set bits in the address of , so it is always selected

fully associative cache set selection.png|600

Line Matching and Word Extraction

The word is contained in the line if and only if the valid bit is set, and the tag bits in the cache line match the tag bits in the address of

We need to check all cache lines in parallel. Circuitry becomes very complex

Block offset bits in the address of determine the offset of the first byte in the word

fully associative cache line matching word extraction.png|600

Line Replacement

The line to be replaced is chosen by the replacement policy (LRU, LFU) among all the lines. Need to store additional information to support replacement: access counters, age bits, etc.

Write-Hit Policies

When we write the word to the cache that already has its copy cached (write hit), we must update the copy of the data in the next lower level of the memory hierarchy

Write-Through

Write is done synchronously both to the cache and to the next lower level

Inefficient because of a very high bandwidth requirement

Write-Back

Write is initially done only to the cache. Write to the next lower level is postponed for as long as possible and done only when the line is evicted from the cache by the replacement algorithm

The cache maintains a dirty bit for each cache line that indicates whether or not it has been modified

Write Buffer

A write miss in a write-back cache can involve two bus transactions:

  1. Incoming line - the line requested by the CPU
  2. Outgoing line - evicted dirty line that must be flushed

Ideally, we would like the CPU to continue as soon as possible without waiting for the flush to the lower level

The solution is a write buffer:

  1. Save the line to be flushed in a write buffer
  2. Immediately load the requested line, which allows the CPU to continue
  3. Flush the write-back buffer at a later time

Write-Miss Policies

When we write the word to the cache that doesn’t have it cached (write miss), we must decide whether to load the data to the cache since no data is returned on write operations

Write Allocate

Data at the missed-write location is loaded into the cache, followed by a write-hit operation

In this approach, write misses are similar to read misses

No-Write Allocate

Data at the missed-write location is not loaded into the cache and is written directly to the next lower level

In this approach, data is loaded into the cache on read misses only

Cache Inclusion Policy

Inclusive

A lower-level cache is inclusive of a higher-level cache if it contains all entries present in a higher-level cache

  • If the block is found in L1, it’s returned to the CPU
  • If the block is not found in L1 but found in L2, it’s returned to the CPU and placed in L1
  • If the block is not found in L1 and L2, it’s fetched from memory and placed in both L1 and L2
  • Eviction from L1 doesn’t involve L2
  • Eviction from L2 is propagated to L1 if the evicted block also exists in L1

Exclusive

A lower-level cache is exclusive of a higher-level cache if it contains only entries not present in a higher-level cache

  • If the block is found in L1, it’s returned to the CPU
  • If the block is not found in L1 but found in L2, it’s moved from L2 to L1
  • If the block is not found in L1 and L2, it’s fetched from memory and placed in L1 only
  • A block evicted from L1 is placed into L2. It’s the only way L2 gets populated
  • Eviction from L2 doesn’t involve L1

NINE

A lower-level cache is non-inclusive non-exclusive if it’s neither strictly inclusive nor exclusive

  • If the block is found in L1, it’s returned to the CPU
  • If the block is not found in L1 but found in L2, it’s returned to the CPU and placed in L1
  • If the block is not found in L1 and L2, it’s fetched from memory and placed in both L1 and L2
  • Eviction from L1 doesn’t involve L2
  • Eviction from L2 doesn’t involve L1

Addressing

Instructions use virtual addresses, caches can use virtual or physical addresses for the index and tag

Physically Indexed, Physically Tagged (PIPT)

Use the physical address for both the index and the tag

Simple but slow because the need to translate to a virtual address, which may involve a TLB miss and RAM access

Virtually Indexed, Virtually Tagged (VIVT)

Use the virtual address for both the index and the tag

Fast because MMU is not used to translate to a physical address

Complex design:

  • Different virtual addresses can refer to the same physical address (aliasing), which can cause coherency problems
  • The same virtual address can map to different physical addresses (homonyms)
  • Mapping can change, requiring cache flushing

Virtually Indexed, Physically Tagged (VIPT)

Use the virtual address for the index and the physical address in the tag

The advantage over PIPT is lower latency, as the cache line can be looked up in parallel with the TLB translation, but the tag cannot be compared until the physical address is available

The advantage over VIVT is that since the tag has the physical address, the cache can detect homonyms

Physically Indexed, Virtually Tagged (PIVT)

Useless and not used in practice

Types of Misses

Compulsory (Cold) Miss

The very first access to a block cannot be in the cache, so the block must be brought into the cache

Capacity Miss

If the cache cannot contain all the blocks needed during the execution of a program, capacity misses (in addition to compulsory misses) will occur because of blocks being discarded and later retrieved

Conflict Miss

If the block placement strategy is not fully associative, conflict misses (in addition to compulsory and capacity misses) will occur because a block may be discarded and later retrieved if multiple blocks map to its set and accesses to the different blocks are intermingled

Coherency Miss

Misses due to invalidations when using invalidation cache coherency protocol

Cache Coloring

Cache coloring - WikipediaTODO

Prefetching

A technique used by CPU to boost execution performance by fetching instructions or data into caches before it is actually needed

Cache prefetching can be accomplished either by hardware or by software:

  • Hardware based prefetching is accomplished by having a prefetcher in the CPU that watches the stream of instructions or data, recognizes the next few elements that the program might need based on this stream and prefetches into the cache
  • Software based prefetching is accomplished by inserting prefetch instructions in the program during compilation. For example, intrinsic function __builtin_prefetch in GCCoco

References