【K8s源码分析（一）】-K8s调度框架及调度器初始化介绍

本次分析参考的K8s版本是v1.27.0。

调度框架介绍

这是官方对于v1.27调度框架的介绍文档：https://v1-27.docs.kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/

将调度器的实现转化为插件的形式有助于加强调度器的拓展性、灵活性，同时也使得调度核心的实现更加的轻量、可维护。

下图展示了Pod的调度上下文以及调度框架暴露的扩展点。图中“Filter”相当于“Predicate”，“Scoring”相当于“Priority function”。

Scheduling framework extension points

总体而言，首先新创建的Pod或还没有调度的Pod会存在队列中，然后经过调度周期的筛选得到符合条件的Node，然后在调度周期内再对各个符合条件的Node进行打分，最高分的Node就是需要调度到的Node，然后经过绑定周期将Pod放置到Node上。

各个拓展点的具体介绍建议参考上面提到的官方介绍文档，这里不再赘述。

K8s scheduler 介绍

首先需要明确的一个点，K8s中的scheduler是以pod的形式运行在系统中的，通过如下的命令能找到其对应的pod。

# kubectl get pod -n kube-system
NAME                                       READY   STATUS             RESTARTS         AGE
...
kube-scheduler-master                      1/1     Running            0                2d4h
...

Pod中的容器会存在一个scheduler程序并一直在前台运行，接收要调度的pod并给出调度结果。本文主要分析的也就是这个scheduler程序所对应的源代码。

这是官方对K8s scheduler代码层次结构的介绍文档：Scheduler code hierarchy overview。也很推荐观看！

整体的关键代码的结构如下所示：

.
├── cmd
│   └── kube-scheduler
│       └── app - 控制器代码位置以及命令行接口参数定义（遵循所有Kubernetes控制器的标准设置）
├── pkg
│   └── scheduler - 默认调度器代码库的根目录
│       ├── core - 默认调度算法的位置
│       ├── framework - 调度框架及其插件
│       └── internal - 缓存、队列和其他内部元素的实现
├── staging
│   └── src
│       └── k8s.io
│           └── kube-scheduler - ComponentConfig API类型的所在位置
└── test
    ├── e2e
    │   └── scheduling - 端到端调度测试
    │
    ├── integration
        ├── scheduler - 调度器集成测试
        └── scheduler_perf - 调度性能基准测试

K8s scheduler的初始化

Cobra介绍

K8s中大部分组件其实都采用的是Cobra结构。Cobra是一个用于创建现代命令行应用程序的库，云原生中很多项目都采用了它，包括Kubernetes、Hugo、GitHub CLI等，目前都有36.2k个start了。而K8s中的scheduler实际上也是通过Cobra构建的。

Cobra的具体介绍可以参见万字长文——Go 语言现代命令行框架 Cobra 详解。

这边以一个小demo为例进行简单介绍。一个demo项目定义了一个名为hugo的命令行工具，代码如下所示：

.
├── cmd
│   ├── root.go
│   └── version.go
├── go.mod
├── go.sum
└── main.go

main.go的内容如下：

package main

import (
	"hugo/cmd"
)

func main() {
	cmd.Execute()
}

root.go的内容如下：

package cmd

import (
	"fmt"
	"os"

	"github.com/spf13/cobra"
)

var rootCmd = &cobra.Command{
	Use:   "hugo",
	Short: "Hugo is a very fast static site generator",
	Long: `A Fast and Flexible Static Site Generator built with
				  love by spf13 and friends in Go.
				  Complete documentation is available at https://gohugo.io`,
	RunE: func(cmd *cobra.Command, args []string) error {
		fmt.Println("run hugo...")
		return nil
	},
}

func Execute() {
	if err := rootCmd.Execute(); err != nil {
		fmt.Println(err)
		os.Exit(1)
	}
}

version.go的内容如下：

package cmd

import (
	"fmt"

	"github.com/spf13/cobra"
)

var versionCmd = &cobra.Command{
	Use:   "version",
	Short: "Print the version number of Hugo",
	Long:  `All software has versions. This is Hugo's`,
	RunE: func(cmd *cobra.Command, args []string) error {
		fmt.Println("Hugo Static Site Generator v0.9 -- HEAD")
		return nil
	},
}

func init() {
	rootCmd.AddCommand(versionCmd)
}

可以看到main.go的主要内容就是调用root.go中的Execute()函数，然后这个函数又是调用cobra定义的rootCmd对其进行执行。rootCmd是一个cobra.Command类，它定义时写了自己的说明文本，然后Run函数是最关键的，定义了自己的运行内容，也就是打印一句字符，这就是单独在命令行中输入hugo后需要执行的程序。如果想要进行命令嵌套，那么就得像version.go文件中的处理方法一样再定义另一个cobra的cmd变量versionCmd ，然后通过AddCommand函数就可以加入进去，如此之后就可以通过hugo version来运行versionCmd 中的Run对应的函数。

项目build之后得到执行文件hugo，运行结果如下

# ./hugo 
run hugo..
# ./hugo -h
A Fast and Flexible Static Site Generator built with
                                  love by spf13 and friends in Go.
                                  Complete documentation is available at https://gohugo.io

Usage:
  hugo [flags]
  hugo [command]

Available Commands:
  completion  Generate the autocompletion script for the specified shell
  help        Help about any command
  version     Print the version number of Hugo

Flags:
  -h, --help   help for hugo

Use "hugo [command] --help" for more information about a command.
# ./hugo version
Hugo Static Site Generator v0.9 -- HEAD

K8s scheduler中初始化的源代码解析

K8s的scheduler也是类似于上面的hugo程序，只不过更加复杂。

首先在cmd/kube-scheduler/scheduler.go:29中我们能看见scheduler的入口函数：

func main() {
	command := app.NewSchedulerCommand()
	code := cli.Run(command)
	os.Exit(code)
}

这里也是通过app.NewSchedulerCommand得到了一个cobra.Command 类，然后让这个类运行起来。

具体看cmd/kube-scheduler/app/server.go:76

// NewSchedulerCommand creates a *cobra.Command object with default parameters and registryOptions
func NewSchedulerCommand(registryOptions ...Option) *cobra.Command {
	opts := options.NewOptions()

	cmd := &cobra.Command{
		Use: "kube-scheduler",
		Long: `The Kubernetes scheduler is a control plane process which assigns
Pods to Nodes. The scheduler determines which Nodes are valid placements for
each Pod in the scheduling queue according to constraints and available
resources. The scheduler then ranks each valid Node and binds the Pod to a
suitable Node. Multiple different schedulers may be used within a cluster;
kube-scheduler is the reference implementation.
See [scheduling](https://kubernetes.io/docs/concepts/scheduling-eviction/)
for more information about scheduling and the kube-scheduler component.`,
		RunE: func(cmd *cobra.Command, args []string) error {
			return runCommand(cmd, opts, registryOptions...)
		},
		Args: func(cmd *cobra.Command, args []string) error {
			for _, arg := range args {
				if len(arg) > 0 {
					return fmt.Errorf("%q does not take any arguments, got %q", cmd.CommandPath(), args)
				}
			}
			return nil
		},
	}
	
	//...
	
	return cmd
}

这里定义了一个cobra.Command，与之前的示例类似，主要的内容还是在runCommand中。

查看其对应的内容，cmd/kube-scheduler/app/server.go:121

// runCommand runs the scheduler.
func runCommand(cmd *cobra.Command, opts *options.Options, registryOptions ...Option) error {
	verflag.PrintAndExitIfRequested()

	// Activate logging as soon as possible, after that
	// show flags with the final logging configuration.
	if err := logsapi.ValidateAndApply(opts.Logs, utilfeature.DefaultFeatureGate); err != nil {
		fmt.Fprintf(os.Stderr, "%v\n", err)
		os.Exit(1)
	}
	cliflag.PrintFlags(cmd.Flags())

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()
	go func() {
		stopCh := server.SetupSignalHandler()
		<-stopCh
		cancel()
	}()

	cc, sched, err := Setup(ctx, opts, registryOptions...)
	if err != nil {
		return err
	}
	// add feature enablement metrics
	utilfeature.DefaultMutableFeatureGate.AddMetrics()
	return Run(ctx, cc, sched)
}

前面的内容主要是一些配置文件，其中最主要的初始化配置函数是Setup(ctx, opts, registryOptions...) ，初始化完毕后就会返回一个scheduler。

具体的内容在cmd/kube-scheduler/app/server.go:309 ，对这部分代码的一些解释放在了注释里。

// Setup creates a completed config and a scheduler based on the command args and options
func Setup(ctx context.Context, opts *options.Options, outOfTreeRegistryOptions ...Option) (*schedulerserverconfig.CompletedConfig, *scheduler.Scheduler, error) {
    // 尝试获取默认的调度器配置
    if cfg, err := latest.Default(); err != nil {
        return nil, nil, err
    } else {
        opts.ComponentConfig = cfg // 如果没有错误，将配置赋值给opts
    }

    // 验证opts中的选项是否有效
    if errs := opts.Validate(); len(errs) > 0 {
        return nil, nil, utilerrors.NewAggregate(errs) // 如果有验证错误，返回它们
    }

    // 从opts创建一个调度器的配置对象
    c, err := opts.Config(ctx)
    if err != nil {
        return nil, nil, err
    }

    // 从调度器配置对象中获取完整的配置
    cc := c.Complete()

    // 创建一个用于存放外部插件的注册表
    outOfTreeRegistry := make(runtime.Registry)
    for _, option := range outOfTreeRegistryOptions {
        if err := option(outOfTreeRegistry); err != nil {
            return nil, nil, err
        }
    }

    // 获取事件记录器工厂
    recorderFactory := getRecorderFactory(&cc)

    // 创建一个空的调度器配置概要切片
    completedProfiles := make([]kubeschedulerconfig.KubeSchedulerProfile, 0)

    // 使用一系列参数和配置选项创建一个新的调度器实例
    sched, err := scheduler.New(
        cc.Client,                                 // 客户端对象
        cc.InformerFactory,                       // Informer工厂
        cc.DynInformerFactory,                    // 动态Informer工厂
        recorderFactory,                          // 事件记录器工厂
        ctx.Done(),                               // 上下文取消通道
        scheduler.WithComponentConfigVersion(cc.ComponentConfig.TypeMeta.APIVersion),  // 组件配置版本
        scheduler.WithKubeConfig(cc.KubeConfig),                                      // Kube配置
        scheduler.WithProfiles(cc.ComponentConfig.Profiles...),                       // 调度器配置概要
        scheduler.WithPercentageOfNodesToScore(cc.ComponentConfig.PercentageOfNodesToScore), // 节点评分百分比
        scheduler.WithFrameworkOutOfTreeRegistry(outOfTreeRegistry),                // 外部插件注册表
        scheduler.WithPodMaxBackoffSeconds(cc.ComponentConfig.PodMaxBackoffSeconds),  // Pod最大退避秒数
        scheduler.WithPodInitialBackoffSeconds(cc.ComponentConfig.PodInitialBackoffSeconds),  // Pod初始退避秒数
        scheduler.WithPodMaxInUnschedulablePodsDuration(cc.PodMaxInUnschedulablePodsDuration), // Pod在不可调度Pod列表中的最大持续时间
        scheduler.WithExtenders(cc.ComponentConfig.Extenders...),  // 扩展器
        scheduler.WithParallelism(cc.ComponentConfig.Parallelism),  // 并行度
        scheduler.WithBuildFrameworkCapturer(func(profile kubeschedulerconfig.KubeSchedulerProfile) {
            // 在框架实例化期间处理概要以设置默认插件和配置，并捕获它们以记录日志
            completedProfiles = append(completedProfiles, profile)
        }),
    )
    if err != nil {
        return nil, nil, err
    }

    // 记录或写入配置和概要信息
    if err := options.LogOrWriteConfig(klog.FromContext(ctx), opts.WriteConfigTo, &cc.ComponentConfig, completedProfiles); err != nil {
        return nil, nil, err
    }

    // 返回完整的配置和调度器实例
    return &cc, sched, nil
}

得到scheduler后运行的函数还是在后面的Run(ctx, cc, sched)里。

查看其对应的内容，cmd/kube-scheduler/app/server.go:150 ，补充了一部分解释放在代码的注释里。

// Run executes the scheduler based on the given configuration. It only returns on error or when context is done.
func Run(ctx context.Context, cc *schedulerserverconfig.CompletedConfig, sched *scheduler.Scheduler) error {
    logger := klog.FromContext(ctx) // 从上下文中获取日志记录器

    // 为了帮助调试，立即记录版本信息
    logger.Info("Starting Kubernetes Scheduler", "version", version.Get())

    // 记录 Golang 的设置，这些环境变量会影响 Go 运行时的行为
    logger.Info("Golang settings", "GOGC", os.Getenv("GOGC"), "GOMAXPROCS", os.Getenv("GOMAXPROCS"), "GOTRACEBACK", os.Getenv("GOTRACEBACK"))

    // Configz 注册，Configz 允许通过 HTTP 端点公开当前的配置
    if cz, err := configz.New("componentconfig"); err == nil {
        cz.Set(cc.ComponentConfig) // 设置调度器的组件配置
    } else {
        return fmt.Errorf("unable to register configz: %s", err) // 如果注册失败，返回错误
    }

    // 启动事件处理流水线
    cc.EventBroadcaster.StartRecordingToSink(ctx.Done()) // 开始录制事件
    defer cc.EventBroadcaster.Shutdown()                   // 延后关闭事件广播

    // 设置健康检查
    var checks []healthz.HealthChecker
    if cc.ComponentConfig.LeaderElection.LeaderElect {
        checks = append(checks, cc.LeaderElection.WatchDog) // 如果启用了领导者选举，添加 WatchDog 健康检查
    }

    // 等待领导者选举的通道
    waitingForLeader := make(chan struct{})
    isLeader := func() bool {
        select {
        case _, ok := <-waitingForLeader:
            // 如果通道关闭，我们是领导者
            return !ok
        default:
            // 通道是打开的，我们正在等待领导者
            return false
        }
    }

    // 启动健康检查服务器
    if cc.SecureServing != nil {
        // 构建处理函数链
        handler := buildHandlerChain(newHealthzAndMetricsHandler(&cc.ComponentConfig, cc.InformerFactory, isLeader, checks...), cc.Authentication.Authenticator, cc.Authorization.Authorizer)
        // 启动安全服务器，注意处理返回的 stoppedCh 和 listenerStoppedCh
        if _, _, err := cc.SecureServing.Serve(handler, 0, ctx.Done()); err != nil {
            return fmt.Errorf("failed to start secure server: %v", err) // 如果启动失败，返回错误
        }
    }

    // 启动所有的 informer
    cc.InformerFactory.Start(ctx.Done()) // 启动 informer 工厂
    // DynInformerFactory 可以在测试中为 nil
    if cc.DynInformerFactory != nil {
        cc.DynInformerFactory.Start(ctx.Done()) // 启动动态 informer 工厂
    }

    // 等待所有缓存同步后再进行调度
    cc.InformerFactory.WaitForCacheSync(ctx.Done()) // 等待 informer 工厂的缓存同步
    if cc.DynInformerFactory != nil {
        cc.DynInformerFactory.WaitForCacheSync(ctx.Done()) // 等待动态 informer 工厂的缓存同步
    }

    // 如果启用了领导者选举，通过 LeaderElector 运行直到完成并退出
    if cc.LeaderElection != nil {
        // 设置领导者选举的回调
        cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
            OnStartedLeading: func(ctx context.Context) {
                close(waitingForLeader) // 关闭等待领导者的通道，表示我们现在是领导者
                sched.Run(ctx)           // 运行调度器
            },
            OnStoppedLeading: func() {
                select {
                case <-ctx.Done():
                    // 我们被请求终止。退出 0。
                    logger.Info("Requested to terminate, exiting")
                    os.Exit(0)
                default:
                    // 我们失去了锁。
                    logger.Error(nil, "Leaderelection lost")
                    klog.FlushAndExit(klog.ExitFlushTimeout, 1)
                }
            },
        }
        // 创建新的领导者选举
        leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
        if err != nil {
            return fmt.Errorf("couldn't create leader elector: %v", err) // 如果创建失败，返回错误
        }

        leaderElector.Run(ctx) // 运行领导者选举

        return fmt.Errorf("lost lease") // 如果失去租约，返回错误
    }

    // 领导者选举被禁用，因此内联运行直到完成
    close(waitingForLeader) // 关闭等待领导者的通道
    sched.Run(ctx)           // 运行调度器
    return fmt.Errorf("finished without leader") // 如果没有领导者，返回错误
}

这个函数首先设置日志记录器，记录版本和 Golang 环境设置，然后注册配置以供调试使用。接着，它启动事件处理流水线，并设置健康检查和健康检查服务器。之后，函数启动 informer 并等待缓存同步。如果配置了领导者选举，它会通过领导者选举器运行调度器，否则直接运行调度器。如果在任何步骤中出现错误，函数会返回该错误。

查看sched.Run(ctx) 这部分调度器实际运行的内容，pkg/scheduler/scheduler.go:355 ，补充了一部分解释放在代码的注释里。

// Run begins watching and scheduling. It starts scheduling and blocked until the context is done.
func (sched *Scheduler) Run(ctx context.Context) {
    // 启动调度队列，这将允许调度器观察新的、需要调度的 Pods
    sched.SchedulingQueue.Run()

	  // We need to start scheduleOne loop in a dedicated goroutine,
	  // because scheduleOne function hangs on getting the next item
  	// from the SchedulingQueue.
  	// If there are no new pods to schedule, it will be hanging there
  	// and if done in this goroutine it will be blocking closing
  	// SchedulingQueue, in effect causing a deadlock on shutdown.
    // 翻译：
    // 我们需要在一个独立的 goroutine 中启动 scheduleOne 循环，
    // 因为 scheduleOne 函数在从 SchedulingQueue 获取下一个项目时会挂起。
    // 如果没有新的 Pods 需要调度，它会在那里挂起，
    // 如果在这个 goroutine 中执行，它将阻止关闭 SchedulingQueue，
    // 从而在关闭时造成死锁。
    go wait.UntilWithContext(ctx, sched.scheduleOne, 0)

    // 当上下文完成（即 ctx.Done() 通道关闭）时，阻塞直到收到信号
    <-ctx.Done()

    // 关闭调度队列，这将停止调度器的事件循环
    sched.SchedulingQueue.Close()
}

可以看到到了这里就剩下了两个主要的实体：调度队列和调度算法。

调度队列收集需要调度的Pod，然后提交给scheduler调度，具体将在后面进行介绍。
go wait.UntilWithContext(ctx, sched.scheduleOne, 0)启用了一个go协程，然后负责一个一个调度pod。注意一下go wait.UntilWithContext ，它是 Kubernetes 项目中用于周期性运行函数的工具。它是一个包装了 time.Ticker 和 context.Context 的机制，允许在给定的时间间隔内重复执行某个函数，直到提供的上下文被取消。函数的基本签名如下：
1
func UntilWithContext(ctx context.Context, f func(context.Context), period time.Duration)
当 ctx.Done() 通道关闭时，wait.UntilWithContext 将停止执行其周期性的任务sched.scheduleOne，0 表示两次迭代之间没有间隔，sched.scheduleOne 将尽可能快地被调用。具体sched.scheduleOne的介绍将在后面进行。

k8s > 源码分析

#k8s #源码分析

【K8s源码分析（一）】-K8s调度框架及调度器初始化介绍

http://example.com/2024/05/10/k8sSource1/

作者

滑滑蛋

发布于

2024年5月10日

许可协议

【K8s源码分析（二）】-K8s调度队列介绍上一篇

【论文阅读】Gödel:Unified Large-Scale Resource Management and Scheduling at ByteDance 下一篇