Cache Hierarchy

Intel Core i7 cache hierarchy.png|500

The L1 cache is usually split into two parts: the instruction cache (L1i) and the data cache (L1d). This division allows for a better utilization of locality since instructions and data are handled separately

Cache Memory Organization

Caches save a subset of data from a lower level

Parameter	Description
$S = 2^{s}$	Number of sets
$s = l g S$	Number of set bits
$E$	Number of cache lines per set
$B = 2^{b}$	Block size (bytes)
$m = l g M$	Number of physical (main memory) address bits
$b = l g B$	Number of block offset bits
$t = m - (s + b)$	Number of tag bits
$C = B \times E \times S$	Cache size (bytes), not including overhead such as the valid and tag bits

general organization of cache.png|700

Middle $s$ bits are used for the set index because it allows contiguous blocks to be spread more evenly among cache sets. Indexing with the high-order bits would cause contiguous memory blocks to all map to the same cache set, leaving the rest of the cache unused

The cache stores and loads the whole cache data block (cache line) from/to lower levels

Cache Data Block Starting Address

Cache data blocks are $B$ -bytes aligned, for example, for a block of size $B$ =64 bytes the starting address can be 0, 64, 128, etc

Without this alignment requirement, set selection would be complex due to one cache data block spanning multiple set index bits. For example, if $B$ =8, $S$ =4, $E$ =1, and block spans addresses $0 x 0002, \dots, 0 x 000 A$ , then the block would occupy two set index bits: $0 x 00$ and $0 x 01$ TODO

Request Processing

Set Selection

Determine which set contains the requested word $w$ that we want to read or write (update)

Line Matching

Determine which cache line (if any) contains $w$ . If some cache line contains $w$ , we have a cache hit; otherwise, a cache miss

Line Replacement (Eviction)

On read misses and write misses in write-allocate caches, we need to retrieve the line containing the requested word from the next level in the memory hierarchy

If the set is full, we need to replace (evict) some line based on the replacement policy

Word Extraction

Cache data block size $B$ is larger than word size, so we need to determine where in a cache block the word $w$ is contained

Direct-Mapped Cache Request Processing

A cache with one cache line per set: $E = 1$

Set Selection

Set index bits in the address of $w$ are interpreted as an unsigned integer that corresponds to a set number

direct-mapped cache set selection.png|600

Line Matching and Word Extraction

The word $w$ is contained in the line if and only if the valid bit is set, and the tag bits in the cache line match the tag bits in the address of $w$

Block offset bits in the address of $w$ determine the offset of the first byte in the word $w$

direct-mapped cache line matching word extraction.png|600

Line Replacement

There is only one cache line per set, so the current line is replaced by the newly fetched line

Set Associative Cache Request Processing

A cache with several cache lines per set: $1 < E < \frac{C}{B}$

Set Selection

Set index bits in the address of $w$ are interpreted as an unsigned integer that corresponds to a set number

set associative cache set selection.png|600

Line Matching and Word Extraction

The word $w$ is contained in the line if and only if the valid bit is set, and the tag bits in the cache line match the tag bits in the address of $w$

Because there are multiple lines in the set, check each line in parallel. Circuitry becomes more complex

Block offset bits in the address of $w$ determine the offset of the first byte in the word $w$

set associative cache line matching word extraction.png|600

Line Replacement

The line to be replaced is chosen by the replacement policy (LRU, LFU) among the lines in the set. Need to store additional information to support replacement: access counters, age bits, etc.

Fully Associative Cache Request Processing

A cache with one set containing all cache lines: $E = \frac{C}{B}$

Set Selection

There is only one set and no set bits in the address of $w$ , so it is always selected

fully associative cache set selection.png|600

Line Matching and Word Extraction

The word $w$ is contained in the line if and only if the valid bit is set, and the tag bits in the cache line match the tag bits in the address of $w$

We need to check all cache lines in parallel. Circuitry becomes very complex

Block offset bits in the address of $w$ determine the offset of the first byte in the word $w$

fully associative cache line matching word extraction.png|600

Line Replacement

The line to be replaced is chosen by the replacement policy (LRU, LFU) among all the lines. Need to store additional information to support replacement: access counters, age bits, etc.

Write-Hit Policies

When we write the word $w$ to the cache that already has its copy cached (write hit), we must update the copy of the data in the next lower level of the memory hierarchy

Write-Through

Write is done synchronously both to the cache and to the next lower level

Inefficient because of a very high bandwidth requirement

Write-Back

Write is initially done only to the cache. Write to the next lower level is postponed for as long as possible and done only when the line is evicted from the cache by the replacement algorithm

The cache maintains a dirty bit for each cache line that indicates whether or not it has been modified

Write Buffer

A write miss in a write-back cache can involve two bus transactions:

Incoming line - the line requested by the CPU
Outgoing line - evicted dirty line that must be flushed

Ideally, we would like the CPU to continue as soon as possible without waiting for the flush to the lower level

The solution is a write buffer:

Save the line to be flushed in a write buffer
Immediately load the requested line, which allows the CPU to continue
Flush the write-back buffer at a later time

Write-Miss Policies

When we write the word $w$ to the cache that doesn’t have it cached (write miss), we must decide whether to load the data to the cache since no data is returned on write operations

Write Allocate

Data at the missed-write location is loaded into the cache, followed by a write-hit operation

In this approach, write misses are similar to read misses

No-Write Allocate

Data at the missed-write location is not loaded into the cache and is written directly to the next lower level

In this approach, data is loaded into the cache on read misses only

Cache Inclusion Policy

Inclusive

A lower-level cache is inclusive of a higher-level cache if it contains all entries present in a higher-level cache

If the block is found in L1, it’s returned to the CPU
If the block is not found in L1 but found in L2, it’s returned to the CPU and placed in L1
If the block is not found in L1 and L2, it’s fetched from memory and placed in both L1 and L2
Eviction from L1 doesn’t involve L2
Eviction from L2 is propagated to L1 if the evicted block also exists in L1

Exclusive

A lower-level cache is exclusive of a higher-level cache if it contains only entries not present in a higher-level cache

If the block is found in L1, it’s returned to the CPU
If the block is not found in L1 but found in L2, it’s moved from L2 to L1
If the block is not found in L1 and L2, it’s fetched from memory and placed in L1 only
A block evicted from L1 is placed into L2. It’s the only way L2 gets populated
Eviction from L2 doesn’t involve L1

NINE

A lower-level cache is non-inclusive non-exclusive if it’s neither strictly inclusive nor exclusive

If the block is found in L1, it’s returned to the CPU
If the block is not found in L1 but found in L2, it’s returned to the CPU and placed in L1
If the block is not found in L1 and L2, it’s fetched from memory and placed in both L1 and L2
Eviction from L1 doesn’t involve L2
Eviction from L2 doesn’t involve L1

Addressing

Instructions use virtual addresses, caches can use virtual or physical addresses for the index and tag

Physically Indexed, Physically Tagged (PIPT)

Use the physical address for both the index and the tag

Simple but slow because the need to translate to a virtual address, which may involve a TLB miss and RAM access

Virtually Indexed, Virtually Tagged (VIVT)

Use the virtual address for both the index and the tag

Fast because MMU is not used to translate to a physical address

Complex design:

Different virtual addresses can refer to the same physical address (aliasing), which can cause coherency problems
The same virtual address can map to different physical addresses (homonyms)
Mapping can change, requiring cache flushing

Virtually Indexed, Physically Tagged (VIPT)

Use the virtual address for the index and the physical address in the tag

The advantage over PIPT is lower latency, as the cache line can be looked up in parallel with the TLB translation, but the tag cannot be compared until the physical address is available

The advantage over VIVT is that since the tag has the physical address, the cache can detect homonyms

Physically Indexed, Virtually Tagged (PIVT)

Useless and not used in practice

Types of Misses

Compulsory (Cold) Miss

The very first access to a block cannot be in the cache, so the block must be brought into the cache

Capacity Miss

If the cache cannot contain all the blocks needed during the execution of a program, capacity misses (in addition to compulsory misses) will occur because of blocks being discarded and later retrieved

Conflict Miss

If the block placement strategy is not fully associative, conflict misses (in addition to compulsory and capacity misses) will occur because a block may be discarded and later retrieved if multiple blocks map to its set and accesses to the different blocks are intermingled

Coherency Miss

Misses due to invalidations when using invalidation cache coherency protocol

Cache Coloring

Cache coloring - Wikipedia TODO

Prefetching

A technique used by CPU to boost execution performance by fetching instructions or data into caches before it is actually needed

Cache prefetching can be accomplished either by hardware or by software:

Hardware based prefetching is accomplished by having a prefetcher in the CPU that watches the stream of instructions or data, recognizes the next few elements that the program might need based on this stream and prefetches into the cache
Software based prefetching is accomplished by inserting prefetch instructions in the program during compilation. For example, intrinsic function __builtin_prefetch in GCCoco

🪴 Quartz 4.0

Explorer

Cache Memory

Cache Hierarchy

Cache Memory Organization

Cache Data Block Starting Address

Request Processing

Set Selection

Line Matching

Line Replacement (Eviction)

Word Extraction

Direct-Mapped Cache Request Processing

Set Selection

Line Matching and Word Extraction

Line Replacement

Set Associative Cache Request Processing

Set Selection

Line Matching and Word Extraction

Line Replacement

Fully Associative Cache Request Processing

Set Selection

Line Matching and Word Extraction

Line Replacement

Write-Hit Policies

Write-Through

Write-Back

Write Buffer

Write-Miss Policies

Write Allocate

No-Write Allocate

Cache Inclusion Policy

Inclusive

Exclusive

NINE

Addressing

Physically Indexed, Physically Tagged (PIPT)

Virtually Indexed, Virtually Tagged (VIVT)

Virtually Indexed, Physically Tagged (VIPT)

Physically Indexed, Virtually Tagged (PIVT)

Types of Misses

Compulsory (Cold) Miss

Capacity Miss

Conflict Miss

Coherency Miss

Cache Coloring

Prefetching

References

Graph View

Table of Contents

Backlinks