目前调度器代码实现

我们来看具体的调度实现:src/runtime/proc.go

package core

// 下面翻译会做些补充或调整,源码在 /src/runtime/proc.go

// Goroutine scheduler
// The scheduler's job is to distribute ready-to-run goroutines over worker threads.
//
// 协程调度器,`Goroutine` 在中国也翻译成协程。
// 调度器的工作就是分配 `处于准备运行状态` 的协程们到工作线程中去。worker threads = 工作的真线程 = 也就是操作系统线程
//
//
// The main concepts are:
// G - goroutine.
// M - worker thread, or machine.
// P - processor, a resource that is required to execute Go code.
//     M must have an associated P to execute Go code, however it can be
//     blocked or in a syscall w/o an associated P.
//
// Design doc at https://golang.org/s/go11sched.
//
// 有几个比较重要的概念:
// G - 协程. 是一个虚拟的线程,并不是真的线程。
// M - 工作线程,或粗暴地称为机器,它才是真正执行程序的线程。
// P - 处理器, 执行 Go 代码的一种必备的资源。
//     M 必须有一个关联的 P 才可以执行 Go 代码, 可是它也可以不必关联一个 P,当这个 M 被阻塞或处于系统调用 `syscall` 时。
//
//
// Worker thread parking/unparking.
//
// 工作线程的停止/开启,parking 可以理解为停车,当然另外一个词就是启动车了。
//
// We need to balance between keeping enough running worker threads to utilize
// available hardware parallelism and parking excessive running worker threads
// to conserve CPU resources and power. This is not simple for two reasons:
//
// 我们需要保持足够多的运行工作线程,以最大化利用可用的硬件并行特征,
// 同时也需要停止过多的工作线程,便于节省CPU和电源损耗,我们在这二者中需要做均衡。
// 这很不简单,有两个原因:
//
// (1) scheduler state is intentionally distributed (in particular, per-P work
// queues), so it is not possible to compute global predicates on fast paths;
// (2) for optimal thread management we would need to know the future (don't park
// a worker thread when a new goroutine will be readied in near future).
//
//
// (1)调度状态是有意分配的 (特别是, 每一个 P 有一个工作队列), 所以不可能以非常快的路径计算一个全局判断。
//	每一个 P 有一个 G 的工作队列,也就是说很多个 P,每一个 P 负责一堆 G。 所以说很难知道全局需要多少工作线程,也就不知道怎么去控制真实的线程数量。
// (2)为了优化线程管理,我们需要知道将来,也就是后面会发生什么事(比如当一个新的协程在不久的将来准备好运行,那么就不要停止一个工作线程)。
//
//
// Three rejected approaches that would work badly:
//
// 三种不应该被采纳的调度方法,可能会很糟糕,如下:
//
// 1. Centralize all scheduler state (would inhibit scalability).
// 2. Direct goroutine handoff. That is, when we ready a new goroutine and there
//    is a spare P, unpark a thread and handoff it the thread and the goroutine.
//    This would lead to thread state thrashing, as the thread that readied the
//    goroutine can be out of work the very next moment, we will need to park it.
//    Also, it would destroy locality of computation as we want to preserve
//    dependent goroutines on the same thread; and introduce additional latency.
// 3. Unpark an additional thread whenever we ready a goroutine and there is an
//    idle P, but don't do handoff. This would lead to excessive thread parking/
//    unparking as the additional threads will instantly park without discovering
//    any work to do.
//

// 1. 集中所有的调度状态。(会抑制伸缩性)
// 2. 直接协程切换. 也就是说, 当我们有一个新协程准备好了,并且有一个空闲的 P, 那么 P 马上开启一个工作线程并且将该协程扔给该工作线程运行。
//	  这会导致工作线程状态的抖动,因为准备运行协程的工作线程,在下一瞬间可能就不需要工作了,它运行完了,我们需要停下它,
//	  同时,它会破坏计算的局部性,比如当我们想保留在同一工作线程上的协程们。当然,这也会引入额外的延迟,因为工作线程频繁地开车和停车。
// 3. 当我们准备好了一个协程并且有一个空闲的 P,那么开启一个额外的工作线程,但不要切换,也就是不马上该协程扔给该工作线程运行。这会导致过度的线程停止和关闭,
//    因为这个额外的线程会马上停下来,它可能会发现没有工作需要做。
//
//
//    上面很拗口,也就是当有新的协程时,如果将它送给一个工作线程,这个工作线程马上做和一会儿再做都有问题,会导致频繁的开启和停止。
//
// The current approach:
//
// 目前的调度方法是:
//
// We unpark an additional thread when we ready a goroutine if (1) there is an
// idle P and there are no "spinning" worker threads. A worker thread is considered
// spinning if it is out of local work and did not find work in global run queue/
// netpoller; the spinning state is denoted in m.spinning and in sched.nmspinning.
// Threads unparked this way are also considered spinning; we don't do goroutine
// handoff so such threads are out of work initially. Spinning threads do some
// spinning looking for work in per-P run queues before parking. If a spinning
// thread finds work it takes itself out of the spinning state and proceeds to
// execution. If it does not find work it takes itself out of the spinning state
// and then parks.
//
// 当我们准备好了一个协程,如果存在一个空闲的 P,并且没有自旋中的工作线程,那么我们开启一个额外的线程。
// 一个工作线程会被认为是自旋的,当它不在本地工作,也就说在全局的运行队列或者网络轮训器中找不到自己可以干的活。
// 自旋的标记表示在 m.spinning 和 sched.nmspinning,虽然线程没有停止,但是它被认为是自旋的。我们没有做协程切换,所以这种线程最初都是没有工作的。
// 自旋的工作线程会做一些自旋的寻找,在它停下来之前,会一直在每一个 P 的 G 运行队列中寻找工作。如果一个自旋中的线程发现了一个工作,它会从自旋状态改变,并且开始执行那个工作。
// 如果它没有找到工作,它会从自旋中状态离开,也就是停止下来。
//
// 上面总结就是,我们有了一个新的协程,但是发现没有自旋的线程,那么生成一个新的线程。
// 这些操作系统线程一旦启动后,就会一直去找工作,找到就执行,执行完后再次去找工作,如果找一圈找不到它就停下来。
//
// If there is at least one spinning thread (sched.nmspinning>1), we don't unpark
// new threads when readying goroutines. To compensate for that, if the last spinning
// thread finds work and stops spinning, it must unpark a new spinning thread.
// This approach smooths out unjustified spikes of thread unparking,
// but at the same time guarantees eventual maximal CPU parallelism utilization.
//
// 如果存在1个以上的自旋线程(sched.nmspinning>1),出现准备好运行的协程时,我们就不需要再开启一个新线程了。
// 为了弥补,如果最近自旋的线程找到工作了,然后停止自旋去运行工作,它必须启动一个新的自旋线程。
// 这个方法消除了线程启动过程中的不平等,同时最终保证了最大的CPU并行利用率。
//
//
// The main implementation complication is that we need to be very careful during
// spinning->non-spinning thread transition. This transition can race with submission
// of a new goroutine, and either one part or another needs to unpark another worker
// thread. If they both fail to do that, we can end up with semi-persistent CPU
// underutilization. The general pattern for goroutine readying is: submit a goroutine
// to local work queue, #StoreLoad-style memory barrier, check sched.nmspinning.
// The general pattern for spinning->non-spinning transition is: decrement nmspinning,
// #StoreLoad-style memory barrier, check all per-P work queues for new work.
// Note that all this complexity does not apply to global run queue as we are not
// sloppy about thread unparking when submitting to global queue. Also see comments
// for nmspinning manipulation.
//
// 最重要的复杂实现,是我们需要在处理 `自旋-不自旋` 转变中十分小心,这种转变会与一个新协程的提交产生竞争,
// 或者,一部分或另一些部分协程需要启动另外的工作线程。如果做不到这一点,我们就可以以半持久化CPU利用率结局结束了。
//
// 意思是,我们必须管理好,什么时候线程改变自旋状态,什么时候启动一个新线程,充分利用CPU。
//
// 一般的协程准备模式是:提交一个协程到本地的工作队列,设置存储加载的内存屏障 `#StoreLoad-style memory barrier`,
// 检查 sched.nmspinning. 也就说把协程加入队列需要加锁,放进去后检查自旋线程的数量。
// 一般的线程 `自旋-非自旋` 模式是:减少线程数量,设置存储加载的内存屏障,检查每一个 P 的 G 工作队列,也就是它要去找新工作。
// 要注意的是所有这些复杂性,并没有应用到全局的运行队列,因为我们提交协程到全局队列时,并不会随意启动一个线程,可以参考 nmspinning 操作的注释
//
// 也就是说每一个 P 有一个 G 的运行队列,并且有一个全局的 G 运行队列,不适应下述所说:
// 对于每一个 P 中的 G 运行队列,当往里面加协程时,发现没有自旋线程就启动一个新自旋线程。
// 而这些自旋线程会不断地找每一个 P 中需要运行的 G,找到工作后就执行代码,自旋状态更改,更改之前会生成一个新的自旋线程。
//
//

总结:

每一个 P 有一个 G 运行队列,添加协程时,会判断是否存在自旋中的 M,不存在生成一个新线程 M。

所有线程 M 会自旋地在每一个 P 中寻找 G,找到则运行该 G,运行前会生成一个新自旋线程 M,当一圈下来找不到 G 时,M 就会从自旋状态停止下来。

Scalable Go Scheduler Design Doc-可伸缩的Golang调度器设计

而上面所提及的设计文档在:https://golang.org/s/go11sched,全文如下:

The document assumes some prior knowledge of the Go language and current goroutine scheduler implementation.

Problems with current scheduler

Current goroutine scheduler limits scalability of concurrent programs written in Go, in particular, high-throughput servers and parallel computational programs. Vtocc server maxes out at 70% CPU on 8-core box, while profile shows 14% is spent in runtime.futex(). In general, the scheduler may inhibit users from using idiomatic fine-grained concurrency where performance is critical.

What’s wrong with current implementation:

  1. Single global mutex (Sched.Lock) and centralized state. The mutex protects all goroutine-related operations (creation, completion, rescheduling, etc).
  2. Goroutine (G) hand-off (G.nextg). Worker threads (M’s) frequently hand-off runnable goroutines between each other, this may lead to increased latencies and additional overheads. Every M must be able to execute any runnable G, in particular the M that just created the G.
  3. Per-M memory cache (M.mcache). Memory cache and other caches (stack alloc) are associated with all M’s, while they need to be associated only with M’s running Go code (an M blocked inside of syscall does not need mcache). A ratio between M’s running Go code and all M’s can be as high as 1:100. This leads to excessive resource consumption (each MCache can suck up up to 2M) and poor data locality.
  4. Aggressive thread blocking/unblocking. In presence of syscalls worker threads are frequently blocked and unblocked. This adds a lot of overhead.

Design

Processors

The general idea is to introduce a notion of P (Processors) into runtime and implement work-stealing scheduler on top of Processors.

M represents OS thread (as it is now). P represents a resource that is required to execute Go code. When M executes Go code, it has an associated P. When M is idle or in syscall, it does need P.

There is exactly GOMAXPROCS P’s. All P’s are organized into an array, that is a requirement of work-stealing. GOMAXPROCS change involves stop/start the world to resize array of P’s.

Some variables from sched are de-centralized and moved to P. Some variables from M are moved to P (the ones that relate to active execution of Go code).

struct P
{
    Lock;
    G *gfree; // freelist, moved from sched
    G *ghead; // runnable, moved from sched
    G *gtail;
    MCache *mcache; // moved from M
    FixAlloc *stackalloc; // moved from M
    uint64 ncgocall;
    GCStats gcstats;
    // etc
...
};

P *allp; // [GOMAXPROCS]

There is also a lock-free list of idle P’s:

P *idlep; // lock-free list

When an M is willing to start executing Go code, it must pop a P form the list. When an M ends executing Go code, it pushes the P to the list. So, when M executes Go code, it necessary has an associated P. This mechanism replaces sched.atomic (mcpu/mcpumax).

Scheduling

When a new G is created or an existing G becomes runnable, it is pushed onto a list of runnable goroutines of current P. When P finishes executing G, it first tries to pop a G from own list of runnable goroutines; if the list is empty, P chooses a random victim (another P) and tries to steal a half of runnable goroutines from it.

Syscalls/M Parking and Unparking

When an M creates a new G, it must ensure that there is another M to execute the G (if not all M’s are already busy). Similarly, when an M enters syscall, it must ensure that there is another M to execute Go code.

There are two options, we can either promptly block and unblock M’s, or employ some spinning.

Here is inherent conflict between performance and burning unnecessary CPU cycles. The idea is to use spinning and do burn CPU cycles. However, it should not affect programs running with GOMAXPROCS=1 (command line utilities, appengine, etc).

Spinning is two-level: (1) an idle M with an associated P spins looking for new G’s, (2) an M w/o an associated P spins waiting for available P’s. There are at most GOMAXPROCS spinning M’s (both (1) and (2)). Idle M’s of type (1) do not block while there are idle M’s of type (2).

When a new G is spawned, or M enters syscall, or M transitions from idle to busy, it ensures that there is at least 1 spinning M (or all P’s are busy). This ensures that there are no runnable G’s that can be otherwise running; and avoids excessive M blocking/unblocking at the same time.

Spinning is mostly passive (yield to OS, sched_yield()), but may include a little bit of active spinning (loop burnging CPU) (requires investigation and tuning).

Termination/Deadlock Detection

Termination/deadlock detection is more problematic in a distributed system. The general idea is to do the checks only when all P’s are idle (global atomic counter of idle P’s), this allows to do more expensive checks that involve aggregation of per-P state.

No details yet.

LockOSThread

This functionality is not performance-critical.

  1. Locked G become non-runnable (Gwaiting). M instantly returns P to idle list, wakes up another M and blocks.
  2. Locked G becomes runnable (and reaches head of the runq). Current M hands off own P and locked G to the M associated with the locked G, and unblocks it. Current M becomes idle.

Idle G

This functionality is not performance-critical.

There is a global queue of (or a single?) idle G. An M that looks for work checks the queue after several unsuccessful steal attempts.

Implementation Plan

The goal is to split the whole thing into minimal parts that can be independently reviewed and submitted.

  1. Introduce the P struct (empty for now); implement allp/idlep containers (idlep is mutex-protected for starters); associate a P with M running Go code. Global mutex and atomic state is still preserved.
  2. Move G freelist to P.
  3. Move mcache to P.
  4. Move stackalloc to P.
  5. Move ncgocall/gcstats to P.
  6. Decentralize run queue, implement work-stealing. Eliminate G hand off. Still under global mutex.
  7. Remove global mutex, implement distributed termination detection, LockOSThread.
  8. Implement spinning instead of prompt blocking/unblocking.

The plan may turn out to not work, there are a lot of unexplored details.

Potential Further Improvements

  1. Try out LIFO scheduling, this will improve locality. However, it still must provide some degree of fairness and gracefully handle yielding goroutines.
  2. Do not allocate G and stack until the goroutine first runs. For a newly created goroutine we need just callerpc, fn, narg, nret and args, that is, about 6 words. This will allow to create a lot of running-to-completion goroutines with significantly lower memory overhead.
  3. Better locality of G-to-P. Try to enqueue an unblocked G to a P on which it was last running.
  4. Better locality of P-to-M. Try to execute P on the same M it was last running.
  5. Throttling of M creation. The scheduler can be easily forced to create thousands of M’s per second until OS refuses to create more threads. M’s must be created promptly up to k*GOMAXPROCS, after that new M’s may added by a timer.

Random Notes

  • GOMAXPROCS won’t go away as a result of this work.