


shixinzhang / 2968人閱讀











RunC 是一個輕量級的工具,它是用來運行容器的,只用來做這一件事,并且這一件事要做好。我們可以認為它就是個命令行小工具,可以不用通過 docker 引擎,直接運行容器。事實上,runC 是標準化的產(chǎn)物,它根據(jù) OCI 標準來創(chuàng)建和運行容器。而 OCI(Open Container Initiative)組織,旨在圍繞容器格式和運行時制定一個開放的工業(yè)化標準。
OCI 由 docker、coreos 以及其他容器相關(guān)公司創(chuàng)建于 2015 年,目前主要有兩個標準文檔:容器運行時標準 (runtime spec)和 容器鏡像標準(image spec)。
runC 由golang語言實現(xiàn),基于libcontainer庫。從docker1.11以后,docker架構(gòu)圖:


runc目前支持各種架構(gòu)的Linux平臺。必須使用Go 1.6或更高版本構(gòu)建它才能使某些功能正常運行。

e.g. libseccomp-devel for CentOS, or libseccomp-dev for Ubuntu

否則,如果您不想使用seccomp支持構(gòu)建runc,則可以在運行make時添加BUILDTAGS =“”。

# create a "github.com/opencontainers" in your GOPATH/src
cd github.com/opencontainers
git clone https://github.com/opencontainers/runc
cd runc

sudo make install


make BUILDTAGS="seccomp apparmor"
Build Tag Feature Dependency
seccomp Syscall filtering libseccomp
selinux selinux process and mount labeling
apparmor apparmor profile support
ambient ambient capability support kernel 4.3
使用runC 創(chuàng)建一個 OCI Bundle


# create the top most bundle directory
mkdir /mycontainer
cd /mycontainer

# create the rootfs directory
mkdir rootfs

# export busybox via Docker into the rootfs directory
docker export $(docker create busybox) | tar -C rootfs -xvf -


runc spec

先來準備一個工作目錄,下面所有的操作都是在這個目錄下執(zhí)行的,比如 mycontainer:

# mkdir mycontainer

接下來,準備容器鏡像的文件系統(tǒng),我們選擇從 docker 鏡像中提取:

# mkdir rootfs
# docker export $(docker create busybox) | tar -C rootfs -xvf -
# ls rootfs 
bin  dev  etc  home  proc  root  sys  tmp  usr  var

有了 rootfs 之后,我們還要按照 OCI 標準有一個配置文件 config.json 說明如何運行容器,包括要運行的命令、權(quán)限、環(huán)境變量等等內(nèi)容,runc 提供了一個命令可以自動幫我們生成:

# runc spec
# ls
config.json  rootfs

這樣就構(gòu)成了一個 OCI runtime bundle 的內(nèi)容,這個 bundle 非常簡單,就上面兩個內(nèi)容:config.json 文件和 rootfs 文件系統(tǒng)。config.json 里面的內(nèi)容很長,這里就不貼出來了,我們也不會對其進行修改,直接使用這個默認生成的文件。有了這些信息,runc 就能知道怎么怎么運行容器了,我們先來看看簡單的方法 runc run(這個命令需要 root 權(quán)限),這個命令類似于 docker run,它會創(chuàng)建并啟動一個容器:

runc run simplebusybox
/ # ls
bin   dev   etc   home  proc  root  sys   tmp   usr   var
/ # hostname
/ # whoami
/ # pwd
/ # ip addr
1: lo:  mtu 65536 qdisc noqueue qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
/ # ps aux
    1 root       0:00 sh
   11 root       0:00 ps aux


runc list
ID              PID         STATUS      BUNDLE                                    CREATED                          OWNER
simplebusybox   18073       running     /home/cizixs/Workspace/runc/mycontainer   2017-11-02T06:54:52.023379345Z   root


    app.Commands = []cli.Command{

這些命令底層是調(diào)用 libcontainer庫實現(xiàn)具體的操作。
例如create 命令:

var createCommand = cli.Command{
    Name:  "create",
    Usage: "create a container",
    ArgsUsage: `

Where "" is your name for the instance of the container that you
are starting. The name you provide for the container instance must be unique on
your host.`,
    Description: `The create command creates an instance of a container for a bundle. The bundle
is a directory with a specification file named "` + specConfig + `" and a root

The specification file includes an args parameter. The args parameter is used
to specify command(s) that get run when the container is started. To change the
command(s) that get executed on start, edit the args parameter of the spec. See
"runc spec --help" for more explanation.`,
    Flags: []cli.Flag{
            Name:  "bundle, b",
            Value: "",
            Usage: `path to the root of the bundle directory, defaults to the current directory`,
            Name:  "console-socket",
            Value: "",
            Usage: "path to an AF_UNIX socket which will receive a file descriptor referencing the master end of the console"s pseudoterminal",
            Name:  "pid-file",
            Value: "",
            Usage: "specify the file to write the process id to",
            Name:  "no-pivot",
            Usage: "do not use pivot root to jail process inside rootfs.  This should be used whenever the rootfs is on top of a ramdisk",
            Name:  "no-new-keyring",
            Usage: "do not create a new session keyring for the container.  This will cause the container to inherit the calling processes session key",
            Name:  "preserve-fds",
            Usage: "Pass N additional file descriptors to the container (stdio + $LISTEN_FDS + N in total)",
    Action: func(context *cli.Context) error {
        if err := checkArgs(context, 1, exactArgs); err != nil {
            return err
        if err := revisePidFile(context); err != nil {
            return err
        spec, err := setupSpec(context)
        if err != nil {
            return err
        status, err := startContainer(context, spec, CT_ACT_CREATE, nil)
        if err != nil {
            return err
        // exit with the container"s exit status so any external supervisor is
        // notified of the exit with the correct exit status.
        return nil









先調(diào)用spec, err := setupSpec(context)加載配置文件config.json的內(nèi)容。此處是和咱們前面提到的OCI bundle 相關(guān)。

        spec, err := setupSpec(context)
        if err != nil {
            return err


// Spec is the base configuration for the container.
type Spec struct {
    // Version of the Open Container Runtime Specification with which the bundle complies.
    Version string `json:"ociVersion"`
    // Process configures the container process.
    Process *Process `json:"process,omitempty"`
    // Root configures the container"s root filesystem.
    Root *Root `json:"root,omitempty"`
    // Hostname configures the container"s hostname.
    Hostname string `json:"hostname,omitempty"`
    // Mounts configures additional mounts (on top of Root).
    Mounts []Mount `json:"mounts,omitempty"`
    // Hooks configures callbacks for container lifecycle events.
    Hooks *Hooks `json:"hooks,omitempty" platform:"linux,solaris"`
    // Annotations contains arbitrary metadata for the container.
    Annotations map[string]string `json:"annotations,omitempty"`

    // Linux is platform-specific configuration for Linux based containers.
    Linux *Linux `json:"linux,omitempty" platform:"linux"`
    // Solaris is platform-specific configuration for Solaris based containers.
    Solaris *Solaris `json:"solaris,omitempty" platform:"solaris"`
    // Windows is platform-specific configuration for Windows based containers.
    Windows *Windows `json:"windows,omitempty" platform:"windows"`

之后調(diào)用status, err := startcontainer(context, spec, CT_ACT_CREATE, nil)進行容器的創(chuàng)建工作。其中CT_ACT_CREATE表示創(chuàng)建操作。CT_ACT_CREATE是一個枚舉類型。

type CtAct uint8

const (
    CT_ACT_CREATE CtAct = iota + 1
        status, err := startContainer(context, spec, CT_ACT_CREATE, nil)


func startContainer(context *cli.Context, spec *specs.Spec, action CtAct, criuOpts *libcontainer.CriuOpts) (int, error) {
    id := context.Args().First()
    if id == "" {
        return -1, errEmptyID

    notifySocket := newNotifySocket(context, os.Getenv("NOTIFY_SOCKET"), id)
    if notifySocket != nil {
        notifySocket.setupSpec(context, spec)

    container, err := createContainer(context, id, spec)
    if err != nil {
        return -1, err

    if notifySocket != nil {
        err := notifySocket.setupSocket()
        if err != nil {
            return -1, err

    // Support on-demand socket activation by passing file descriptors into the container init process.
    listenFDs := []*os.File{}
    if os.Getenv("LISTEN_FDS") != "" {
        listenFDs = activation.Files(false)
    r := &runner{
        enableSubreaper: !context.Bool("no-subreaper"),
        shouldDestroy:   true,
        container:       container,
        listenFDs:       listenFDs,
        notifySocket:    notifySocket,
        consoleSocket:   context.String("console-socket"),
        detach:          context.Bool("detach"),
        pidFile:         context.String("pid-file"),
        preserveFDs:     context.Int("preserve-fds"),
        action:          action,
        criuOpts:        criuOpts,
        init:            true,
    return r.run(spec.Process)

首先調(diào)用container, err := createContainer(context, id, spec)創(chuàng)建容器, 之后填充runner結(jié)構(gòu)r。

func createContainer(context *cli.Context, id string, spec *specs.Spec) (libcontainer.Container, error) {
    rootless, err := isRootless(context)
    if err != nil {
        return nil, err
    config, err := specconv.CreateLibcontainerConfig(&specconv.CreateOpts{
        CgroupName:       id,
        UseSystemdCgroup: context.GlobalBool("systemd-cgroup"),
        NoPivotRoot:      context.Bool("no-pivot"),
        NoNewKeyring:     context.Bool("no-new-keyring"),
        Spec:             spec,
        Rootless:         rootless,
    if err != nil {
        return nil, err

    factory, err := loadFactory(context)
    if err != nil {
        return nil, err
    return factory.Create(id, config)

注意factory, err := loadFactory(context)和factory.Create(id, config),這兩個就是我們上面提到的factory.go。由工廠來根據(jù)配置config創(chuàng)建具體容器。


// Process contains information to start a specific application inside the container.
type Process struct {
    // Terminal creates an interactive terminal for the container.
    Terminal bool `json:"terminal,omitempty"`
    // ConsoleSize specifies the size of the console.
    ConsoleSize *Box `json:"consoleSize,omitempty"`
    // User specifies user information for the process.
    User User `json:"user"`
    // Args specifies the binary and arguments for the application to execute.
    Args []string `json:"args"`
    // Env populates the process environment for the process.
    Env []string `json:"env,omitempty"`
    // Cwd is the current working directory for the process and must be
    // relative to the container"s root.
    Cwd string `json:"cwd"`
    // Capabilities are Linux capabilities that are kept for the process.
    Capabilities *LinuxCapabilities `json:"capabilities,omitempty" platform:"linux"`
    // Rlimits specifies rlimit options to apply to the process.
    Rlimits []POSIXRlimit `json:"rlimits,omitempty" platform:"linux,solaris"`
    // NoNewPrivileges controls whether additional privileges could be gained by processes in the container.
    NoNewPrivileges bool `json:"noNewPrivileges,omitempty" platform:"linux"`
    // ApparmorProfile specifies the apparmor profile for the container.
    ApparmorProfile string `json:"apparmorProfile,omitempty" platform:"linux"`
    // Specify an oom_score_adj for the container.
    OOMScoreAdj *int `json:"oomScoreAdj,omitempty" platform:"linux"`
    // SelinuxLabel specifies the selinux context that the container process is run as.
    SelinuxLabel string `json:"selinuxLabel,omitempty" platform:"linux"`


process, err := newProcess(*config, r.init)

newProcess 主要是填充 libcontainer.Process 結(jié)構(gòu)體,包括參數(shù),環(huán)境變量,user 權(quán)限,工作目錄,cpabilities,資源限制等。

    switch r.action {
    case CT_ACT_CREATE:
        err = r.container.Start(process)
    case CT_ACT_RESTORE:
        err = r.container.Restore(process, r.criuOpts)
    case CT_ACT_RUN:
        err = r.container.Run(process)
        panic("Unknown action")


func (c *linuxContainer) start(process *Process) error {
    parent, err := c.newParentProcess(process)
    if err != nil {
        return newSystemErrorWithCause(err, "creating new parent process")
    if err := parent.start(); err != nil {
        // terminate the process to ensure that it properly is reaped.
        if err := ignoreTerminateErrors(parent.terminate()); err != nil {
        return newSystemErrorWithCause(err, "starting container process")
    // generate a timestamp indicating when the container was started
    c.created = time.Now().UTC()
    if process.Init {
        c.state = &createdState{
            c: c,
        state, err := c.updateState(parent)
        if err != nil {
            return err
        c.initProcessStartTime = state.InitProcessStartTime

        if c.config.Hooks != nil {
            bundle, annotations := utils.Annotations(c.config.Labels)
            s := configs.HookState{
                Version:     c.config.Version,
                ID:          c.id,
                Pid:         parent.pid(),
                Bundle:      bundle,
                Annotations: annotations,
            for i, hook := range c.config.Hooks.Poststart {
                if err := hook.Run(s); err != nil {
                    if err := ignoreTerminateErrors(parent.terminate()); err != nil {
                    return newSystemErrorWithCausef(err, "running poststart hook %d", i)
    return nil


1.創(chuàng)建一對pipe,parentPipe和childPipe,作為 runc start 進程與容器內(nèi)部 init 進程通信管道
2.創(chuàng)建一個命令模版作為 Parent 進程啟動的模板
3.newInitProcess 封裝 initProcess。主要工作為添加初始化類型環(huán)境變量,將namespace、uid/gid 映射等信息使用 bootstrapData 封裝為一個 io.Reader


添加初始化類型環(huán)境變量,將namespace、uid/gid 映射等信息使用 bootstrapData 函數(shù)封裝為一個 io.Reader,使用的是 netlink 用于內(nèi)核間的通信,返回 initProcess 結(jié)構(gòu)體。

最后調(diào)用func (l *linuxStandardInit) Init() error方法,這里是上面提到的init_linux.go文件。

func (l *linuxStandardInit) Init() error {
    if !l.config.Config.NoNewKeyring {
        ringname, keepperms, newperms := l.getSessionRingParams()

        // Do not inherit the parent"s session keyring.
        sessKeyId, err := keys.JoinSessionKeyring(ringname)
        if err != nil {
            return errors.Wrap(err, "join session keyring")
        // Make session keyring searcheable.
        if err := keys.ModKeyringPerm(sessKeyId, keepperms, newperms); err != nil {
            return errors.Wrap(err, "mod keyring permissions")

    if err := setupNetwork(l.config); err != nil {
        return err
    if err := setupRoute(l.config.Config); err != nil {
        return err

    if err := prepareRootfs(l.pipe, l.config); err != nil {
        return err
    // Set up the console. This has to be done *before* we finalize the rootfs,
    // but *after* we"ve given the user the chance to set up all of the mounts
    // they wanted.
    if l.config.CreateConsole {
        if err := setupConsole(l.consoleSocket, l.config, true); err != nil {
            return err
        if err := system.Setctty(); err != nil {
            return errors.Wrap(err, "setctty")

    // Finish the rootfs setup.
    if l.config.Config.Namespaces.Contains(configs.NEWNS) {
        if err := finalizeRootfs(l.config.Config); err != nil {
            return err

    if hostname := l.config.Config.Hostname; hostname != "" {
        if err := unix.Sethostname([]byte(hostname)); err != nil {
            return errors.Wrap(err, "sethostname")
    if err := apparmor.ApplyProfile(l.config.AppArmorProfile); err != nil {
        return errors.Wrap(err, "apply apparmor profile")
    if err := label.SetProcessLabel(l.config.ProcessLabel); err != nil {
        return errors.Wrap(err, "set process label")

    for key, value := range l.config.Config.Sysctl {
        if err := writeSystemProperty(key, value); err != nil {
            return errors.Wrapf(err, "write sysctl key %s", key)
    for _, path := range l.config.Config.ReadonlyPaths {
        if err := readonlyPath(path); err != nil {
            return errors.Wrapf(err, "readonly path %s", path)
    for _, path := range l.config.Config.MaskPaths {
        if err := maskPath(path, l.config.Config.MountLabel); err != nil {
            return errors.Wrapf(err, "mask path %s", path)
    pdeath, err := system.GetParentDeathSignal()
    if err != nil {
        return errors.Wrap(err, "get pdeath signal")
    if l.config.NoNewPrivileges {
        if err := unix.Prctl(unix.PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); err != nil {
            return errors.Wrap(err, "set nonewprivileges")
    // Tell our parent that we"re ready to Execv. This must be done before the
    // Seccomp rules have been applied, because we need to be able to read and
    // write to a socket.
    if err := syncParentReady(l.pipe); err != nil {
        return errors.Wrap(err, "sync ready")
    // Without NoNewPrivileges seccomp is a privileged operation, so we need to
    // do this before dropping capabilities; otherwise do it as late as possible
    // just before execve so as few syscalls take place after it as possible.
    if l.config.Config.Seccomp != nil && !l.config.NoNewPrivileges {
        if err := seccomp.InitSeccomp(l.config.Config.Seccomp); err != nil {
            return err
    if err := finalizeNamespace(l.config); err != nil {
        return err
    // finalizeNamespace can change user/group which clears the parent death
    // signal, so we restore it here.
    if err := pdeath.Restore(); err != nil {
        return errors.Wrap(err, "restore pdeath signal")
    // Compare the parent from the initial start of the init process and make
    // sure that it did not change.  if the parent changes that means it died
    // and we were reparented to something else so we should just kill ourself
    // and not cause problems for someone else.
    if unix.Getppid() != l.parentPid {
        return unix.Kill(unix.Getpid(), unix.SIGKILL)
    // Check for the arg before waiting to make sure it exists and it is
    // returned as a create time error.
    name, err := exec.LookPath(l.config.Args[0])
    if err != nil {
        return err
    // Close the pipe to signal that we have completed our init.
    // Wait for the FIFO to be opened on the other side before exec-ing the
    // user process. We open it through /proc/self/fd/$fd, because the fd that
    // was given to us was an O_PATH fd to the fifo itself. Linux allows us to
    // re-open an O_PATH fd through /proc.
    fd, err := unix.Open(fmt.Sprintf("/proc/self/fd/%d", l.fifoFd), unix.O_WRONLY|unix.O_CLOEXEC, 0)
    if err != nil {
        return newSystemErrorWithCause(err, "open exec fifo")
    if _, err := unix.Write(fd, []byte("0")); err != nil {
        return newSystemErrorWithCause(err, "write 0 exec fifo")
    // Close the O_PATH fifofd fd before exec because the kernel resets
    // dumpable in the wrong order. This has been fixed in newer kernels, but
    // we keep this to ensure CVE-2016-9962 doesn"t re-emerge on older kernels.
    // N.B. the core issue itself (passing dirfds to the host filesystem) has
    // since been resolved.
    // https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318
    // Set seccomp as close to execve as possible, so as few syscalls take
    // place afterward (reducing the amount of syscalls that users need to
    // enable in their seccomp profiles).
    if l.config.Config.Seccomp != nil && l.config.NoNewPrivileges {
        if err := seccomp.InitSeccomp(l.config.Config.Seccomp); err != nil {
            return newSystemErrorWithCause(err, "init seccomp")
    if err := syscall.Exec(name, l.config.Args[0:], os.Environ()); err != nil {
        return newSystemErrorWithCause(err, "exec user process")
    return nil

(1)、該函數(shù)先處理l.config.Config.NoNewKeyring,l.config.Console, setupNetwork, setupRoute, label.Init()

(2)、if l.config.Config.Namespaces.Contains(configs.NEWNS) -> setupRootfs(l.config.Config, console, l.pipe)

(3)、設(shè)置hostname, apparmor.ApplyProfile(...), label.SetProcessLabel(...),l.config.Config.Sysctl

(4)、調(diào)用remountReadonly(path)重新掛載ReadonlyPaths,在配置文件中為/proc/asound,/proc/bus, /proc/fs等等

(5)、調(diào)用maskPath(path)設(shè)置maskedPaths,pdeath := system.GetParentDeathSignal(), 處理l.config.NoNewPrivileges

(6)、調(diào)用syncParentReady(l.pipe) // 告訴父進程容器可以執(zhí)行Execv了, 從父進程來看,create已經(jīng)完成了

(7)、處理l.config.Config.Seccomp 和 l.config.NoNewPrivileges, finalizeNamespace(l.config),pdeath.Restore(), 判斷syscall.Getppid()和l.parentPid是否相等,找到name, err := exec.Lookpath(l.config.Args[0]),最后l.pipe.Close(),init完成。此時create 在子進程中也完成了。

(8)、fd, err := syscall.Openat(l.stateDirFD, execFifoFilename, os.O_WRONLY|syscall.O_CLOEXEC, 0) ---> wait for the fifo to be opened on the other side before exec"ing the user process,其實此處就是在等待start命令。之后,再往fd中寫一個字節(jié),用于同步:syscall.Write(fd, []byte("0"))

(9)、調(diào)用syscall.Exec(name, l.config.Args[0:], os.Environ())執(zhí)行容器命令




  • docker系列--runC解讀

    摘要:而具體代碼首先調(diào)用創(chuàng)建容器之后填充結(jié)構(gòu)。該函數(shù)先處理,設(shè)置,調(diào)用重新掛載,在配置文件中為等等調(diào)用設(shè)置,處理調(diào)用告訴父進程容器可以執(zhí)行了從父進程來看,已經(jīng)完成了處理和,判斷和是否相等,找到,最后,完成。 前言 理解docker,主要從namesapce,cgroups,聯(lián)合文件,運行時(runC),網(wǎng)絡(luò)幾個方面。接下來我們會花一些時間,分別介紹。 docker系列--namespace...

    _Suqin 評論0 收藏0
  • docker系列--runC解讀

    摘要:而具體代碼首先調(diào)用創(chuàng)建容器之后填充結(jié)構(gòu)。該函數(shù)先處理,設(shè)置,調(diào)用重新掛載,在配置文件中為等等調(diào)用設(shè)置,處理調(diào)用告訴父進程容器可以執(zhí)行了從父進程來看,已經(jīng)完成了處理和,判斷和是否相等,找到,最后,完成。 前言 理解docker,主要從namesapce,cgroups,聯(lián)合文件,運行時(runC),網(wǎng)絡(luò)幾個方面。接下來我們會花一些時間,分別介紹。 docker系列--namespace...

    binaryTree 評論0 收藏0
  • docker系列--namespace解讀

    摘要:目前內(nèi)核總共實現(xiàn)了種隔離和消息隊列。參數(shù)表示我們要加入的的文件描述符。提供了很多種進程間通信的機制,針對的是和消息隊列。所謂傳播事件,是指由一個掛載對象的狀態(tài)變化導致的其它掛載對象的掛載與解除掛載動作的事件。 前言 理解docker,主要從namesapce,cgroups,聯(lián)合文件,運行時(runC),網(wǎng)絡(luò)幾個方面。接下來我們會花一些時間,分別介紹。 docker系列--names...

    wupengyu 評論0 收藏0
  • docker系列--namespace解讀

    摘要:目前內(nèi)核總共實現(xiàn)了種隔離和消息隊列。參數(shù)表示我們要加入的的文件描述符。提供了很多種進程間通信的機制,針對的是和消息隊列。所謂傳播事件,是指由一個掛載對象的狀態(tài)變化導致的其它掛載對象的掛載與解除掛載動作的事件。 前言 理解docker,主要從namesapce,cgroups,聯(lián)合文件,運行時(runC),網(wǎng)絡(luò)幾個方面。接下來我們會花一些時間,分別介紹。 docker系列--names...

    cikenerd 評論0 收藏0
  • docker系列--namespace解讀

    摘要:目前內(nèi)核總共實現(xiàn)了種隔離和消息隊列。參數(shù)表示我們要加入的的文件描述符。提供了很多種進程間通信的機制,針對的是和消息隊列。所謂傳播事件,是指由一個掛載對象的狀態(tài)變化導致的其它掛載對象的掛載與解除掛載動作的事件。 前言 理解docker,主要從namesapce,cgroups,聯(lián)合文件,運行時(runC),網(wǎng)絡(luò)幾個方面。接下來我們會花一些時間,分別介紹。 docker系列--names...

    Acceml 評論0 收藏0


