最近在用rust实现一个oci runtime。开始就遇到了很多问题,从这里开始理解到底什么是system programming。

那么什么是system programming?

参考Kamal Marhubi的答案:

Systems programming is programming where you spend more time reading man pages than reading the internet.

最近真是读了好多man。。。因为自己想要的东西是google搜不出来的。

我一开始的想法很简单,创建一个user namespace,设置相应的flags,mount devices和fs,然后把pid放到cgroup v2下面,打完收功。没想到第一个步就走不通。

这里参考的user namespace的example。然后开始用rust改这里的逻辑。

创建一个新的user namespace流程:

  • unshare + all flags
  • update uid_map|gid_map
  • mount proc
  • mount / MS_PRIVATE
  • mount rootfs
  • mkdir rootfs/oldrootfs
  • pivot_root
  • chdir /
  • umount2 oldrootfs
  • rmdir oldrootfs

遇到的第一个问题是不能修改gid_map

这个问题很简单:

Linux 3.19 made a change in the handling of setgroups(2) and the ‘gid_map’ file to address a security issue. The issue allowed unprivileged users to employ user namespaces in order to drop The upshot of the 3.19 changes is that in order to update the ‘gid_maps’ file, use of the setgroups() system call in this user namespace must first be disabled by writing “deny” to one of the /proc/PID/setgroups files for this namespace. That is the purpose of the following function.

在修改gid_map之前setgroups deny就好。

第二个问题是不能直接mount proc

使用命令行很容易复现:

$unshare -r --pid --mount-proc readlink /proc/self bash 
#unshare: mount /proc failed: Operation not permitted

但是使用fork是可以的:

$unshare -r --fork --pid --mount-proc readlink /proc/self bash
#1

这里可以看到,在创建的namespace中,pid是从1开始的。

参考man:

-f, –fork Fork the specified program as a child process of unshare rather than running it directly. This is useful when creating a new PID namespace.

最初我在userns_exec_child.c中,尝试在执行execv之前,加了一段mount:

if (mount("proc", "/proc", "proc", MS_REC|MS_BIND, NULL) < 0) {
  fprintf(stderr, "ERROR: %s\n",
      strerror(errno));
  exit(EXIT_FAILURE);

结果是一样的,也是没有权限。在google搜索一番之后,找到了redhat bugzilla。一看这时间,了不起阿,2016-10-31。

Karel Zak解释到:

IMHO require CLONE_NEWPID to unshare also the /proc makes sense. And I guess that CLONE_NEWPID requires fork() to create a “init” process in the new pid namespace – without fork() the new /proc will not contain any process.

$ unshare –user –pid –fork –mount-proc

seems correct.

user namespace在mount proc这里的实现是依赖fork的,因为在mount proc之前/proc中只包含当前的pid,如何将当前的pid修改成1?显然这是一个immutable的实现,pid是不可以直接修改的。必须依赖fork创建新pid,通过这种方式init。

另外一个信息:

The –mount-proc has been designed for CLONE_NEWPID