Internals
The runtime.g
Struct
Represents a goroutine. When a goroutine exits, its g
object is returned to a pool of free g
s and can later be reused for some other goroutine. It contains the fields necessary to keep track of its stack and current status. It also contains reference to the code that it is responsible for running
The runtime.m
Struct
Represents an OS thread (machine). It has pointers to fields such as the g
that it is currently running and associated p
When m
executes code, it has an associated p
. When m
is idle or in a system call, it does not need a p
Spinning
The runtime.p
Struct
Represents a processor abstraction, resources required to execute a g
. Contains a lock-free local run-queue p.runq
and caches
There are exactly GOMAXPROCS
p
s, thus even if we had a lot of m
s, there are no scalability issues with work stealing and per p
caches accesses
The runtime.schedt
Struct
Represents a scheduler itself. Contains a global run-queue schedt.runq
protected by a mutex schedt.lock
Scheduling Algorithm
Handoff and System Calls
If OS thread m
itself is blocked, e.g. in a system call, the scheduler has to handoff p
to other m
in order to prevent CPU underutilization
However, not all system calls are blocking in practice, e.g. vDSO. Having goroutines go through the full work of releasing their current p
and then re-acquiring one for these system calls has two problems:
- There’s a bunch of overhead involved in locking (and releasing) all of the data structures involved
- If there’s more runnable
g
s thanp
s, ag
that makes a syscall won’t be able to re-acquire ap
and will have to park itself; the moment it released thep
, something else was scheduled onto it
So the scheduler has two ways of handling blocking system calls, a pessimistic way for system calls that are expected to be slow; an optimistic way for ones that are expected to be fast
The pessimistic system call path implements the approach where p
is handed off m
before the system call
The optimistic system call path doesn’t release the p
; instead, it sets a special p
state flag and just makes the system call. The sysmon
goroutine periodically looks for p
s that have been in this state for too long and hands them off m
When m
returns from a syscall the scheduler will check if the old p
is available. If it is, the m
will associate itself with it. If it is not, m
will associate itself with any idle p
. And if there are no idle p
s, g
will be dissociated from m
and added to the schedt.runq
Thread m
itself becomes idle, so we won’t run more g
s than our target parallelism level
Fairness
The scheduler uses following mechanisms to provide fairness at minimal cost. Fairness helps preventing goroutines livelocks and starvation
Goroutine Preemption
Scheduler uses a 10ms time slice after which it preempts the current goroutine, enqueues it to global schedt.runq
and schedules another goroutine
To preempt a g
, a scheduler spoofs g
s stack limit by setting g.stackguard0
to stackPreempt
which causes stack overflow in function prologue check code. At runtime (already in slow path) scheduler recognizes that actually there is enough stack and it was a preemption signal. This spoof technique is an optimization to reduce the overhead of checking the preemption status on each function invocation explicitly
todo this allows preemptable scheduler instead of cooperative scheduler used before, where g
s could be preempted only in specific blocking cases (for example, channel send or receive, I/O, waiting to acquire a mutex)
Loop Preemption
Above approach doesn’t work for loops because, there are no function calls, hence no runtime.stachmap
metadata needed for Go Garbage Collector
Go uses signal-based preemption with conservative innermost frame scanning to preempt tight loops:
- Send OS signal to preempt the goroutine
- Conservatively scan the function stack: assume that anything that looks like a pointer is a pointer. Only needed for innermost frame
- Precisely scan the rest of the stack
Local Run Queue Design
To fairly schedule g
s we need FIFO, because LIFO would mean that two g
s constantly respawning each other would starve other g
s. We would always pop this two g
s from the top never reaching the bottom
On the other hand, from CPU perspective we need LIFO, because it provides better cache locality. When g1
writes data to memory and then g2
resumes, using FIFO may result in g2
finding that the data is no longer in the cache, while LIFO ensures that data will still be in a cache
To provide this LIFO/FIFO behavior the scheduler saves a runnable g'
, that was ready’d by the current g
and should be run next instead of what’s in p.runq
, into the one element LIFO slot p.runnext
. To ensure g'
is not stolen by other m
, goroutines from p.runnext
are not allowed to be stolen for 3us
However, this p.runnext
LIFO behavior can cause starvation. In this code, g1
and g2
always respawn each other in p.runnext
never checking p.runq
:
To prevent p.runq
starvation, g
taken from p.runnext
inherits current time slice and will be preempted within 10ms
Global Run Queue Polling
To avoid schedt.runq
starvation, scheduler checks it every scheduling tick
Network Poller Async Notifications
To avoid network poller starvation, network poller uses background thread to epoll
the network if it wasn’t polled by the main worker thread
Polling used for global run queue is not applicable here, because checking the network involves system calls like epoll
which adds an overhead to the scheduler fast path
Goroutines Stack
todo See: I don’t know where goroutine stacks are allocated in virtual memory : r/golang
A traditional paging based stacks can’t be used:
- Not enough address space to support millions of 1Gb stacks. Because we need to reserve the virtual memory for all the stacks at the program startup, and we have 48 bits address space on 64 bit CPU, we can support only 128000 stacks
- Suboptimal granularity due to minimum page size of 4Kb
- Performance overhead due to system calls used to release unused pages
Instead, Go uses growable stacks allocated on the heap, which allows a goroutine to start with a single 2Kb stack which grows and shrinks as needed. A compiler inserts a prologue at the start of each function, which checks for the overflow by comparing the current stack pointer to g.stackguard0
. When a stack overflow check triggers, runtime calls runtime⋅morestack
to grow a stack:
- Allocate a new stack 2x the size of the old stack. This amortizes the cost of copying
- Copy the old stack to the new stack
- Adjust all pointers on the stack by utilizing
runtime.stackmap
metadata - Adjust all pointers into the stack from outside. There are a few despite escape analysis
- Deallocate the old stack
The p.mcache.stackcache
is a cache of small stacks
References
- Source code: proc.go
- Source code: runtime2.go
- Source code: stack.go
- GopherCon 2018: Kavya Joshi - The Scheduler Saga - YouTube
- Dmitry Vyukov — Go scheduler: Implementing language with lightweight concurrency - YouTube
- runtime: tight loops should be preemptible · Issue #10958 · golang/go · GitHub
- dotGo 2017 - JBD - Go’s work stealing scheduler - YouTube
- Go’s work-stealing scheduler · rakyll.org
- Scheduling In Go : Part I - OS Scheduler
- The Golang Scheduler
- Внутреннее устройство планировщика Go // Демо-занятие курса «Golang Developer. Professional» - YouTube
- ТиПМС 9. Планировщик Go - YouTubetodo
- Contiguous Stacks Design Doc - Google Docstodo
- Go Preemptive Scheduler Design Doc - Google Docs
- Scalable Go Scheduler Design Doc - Google Docs
- HACKING.md
- GopherCon 2020: Austin Clements - Pardon the Interruption: Loop Preemption in Go 1.14 - YouTubeTODO
- Chris’s Wiki :: blog/programming/GoSchedulerAndSyscalls
- 100 Go Mistakes and How to Avoid Them. Teiva Harsanyi