Skip to content

Namespace notes

jacobcannizzaro edited this page Jul 1, 2020 · 1 revision

Container/Namespace notes:

Containers unplugged: Linux namespaces - Michael Kerrisk - https://youtu.be/0kJPa-1FuoI

Namespaces (NS) -

  • for each NS type:

    • multiple instances of NS may exist on a system
      • at system boot, there is one instance of each NS type - the so-called initial namespace of that type
    • each process resides in one NS instance
    • to processes inside NS instance3, it appears that only they can see/modify corresponding global resource
      • processes are unaware of other instances of resource
  • when new process is created via fork(), it resides in the same set of NSs as parent process

  • 7 types of NS types supported by linux with corresponding clone() flag

    • mount - CLONE_NEWNS

      • first implemented NS (explains the name, they didn't know about future namespaces)
      • isolation of mount points (MPs) seen by process(es)
        • process's view of filesystem (FS) tree is defined by (hierarchically) related set of MPs
        • MP is a tuple that includes:
          • Mount source (device)
          • pathname
          • id of parent mount. parental relationship defines single directory of hierarchy
      • processes in different mount NSs see different FS trees, i.e. processes have distinct sets of MPs
      • mount(2) and umount(2) affect only processes in the same mount NS as caller
      • compare to chroot:
        • chroot allows to confine process to a subtree but mount NS allows you to completely define different root directory and rearrange their view of the file system
      • allows you to mount new /proc FS without side effects. This lets you first set up a new mount point before creating new PID, IPC, etc. name spaces so you can avoid side effects system wide
      • consider situation: you have multiple isolated mount NSs and slide a disk into the disk reader, you want all NSs to see this change, mount_namespaces(7) allows you to mount a point and share this subtree by propogating the same mount in all namespaces. allows partial isolation reversal
    • UTS - CLONE_NEWUTS

      • simplest NS
      • isolate two system identifiers returned by uname(2)
        • nodename: system hostname (set by sethostname(2))
        • domainname: NIS domain name (set by setdomainname(2))
        • changes made aren't visible outside namespace
        • nodenames can be used with DHCP to obtain seperate ip address for container
    • IPC - CLONE_NEWIPC

      • isolate certain ipc resources:
        • system V IPC (message queues (MQs), semaphores, shared memory)
        • POSIX MQs
        • processes in an IPC NS instance share a set of IPC objects, but can't see objects in other IPC NSs
      • each NS has:
        • isolated set of System V IPC identifiers
        • its own POSIX MQ filesystem (/dev/mqueue)
        • private instances of various /proc files related to these IPC mechanisms
          • /proc/sysvipc, /proc/sys/fs/mqueue, etc.
        • all of these are destroyed when NS is destroyed
    • PID - CLONE_NEWPID

      • isolate process ids
        • different NSs can have the same PIDs
      • benefits:
        • allows process inside containers to maintain same PIDs when container is migrated to different host - could possibly be useful for distributed systems around the city if someone wants to switch trashcans and run code somewhere else
        • allows per-container init process (PID 1) which manages container initialization and reaping of orphaned children
      • unlike most other NS types, PID NSs form a hierarchy
        • each pid NS has a parent, going back to initial PID NS
        • Parent of PID NS is PID NS of caller of clone() or unshare()
        • ioctl(fd, NS_GET_PARENT) can be used to discover parental relationship
          • see ioctl_ns(2)
        • process is member of its immediate PID NS but also visible in each anscestor PID NS
        • process will typically have different PID in each PID NS in which it is visible
        • initial PID NS can see all process in all PID NSs
          • see == employ syscalls on, send signals to, access via /proc, ...
    • Network - CLONE_NEWNET

      • isolating network resources
        • IP addresses, IP routing tables, /proc/net & /sys/class/net directories, netfilter (firewall) rules, socket port-number space, abstract UNIX domain sockets
      • make containers useful from networking perspective:
        • each container can have virtual network device
        • applications bound to per-ns port-number space
        • routing rules in host system can direct network packets to virtual device of specific container
          • virtual ethernet (veth) devices provide network connection between container and host system
      • isolate network service workers
        • place server worker process in NS with no NW device
        • can still use host to pass file descriptors (e.g. connected sockets) via UNIX domain socket
          • FD passing example: sockets/scm_rights_send.c and sockets/scm_rights_recv.c
        • worker can then access NW via the host but if compromised can't make new network connections because it has no NW device
    • User - CLONE_NEWUSER

      • requires no privilege to create (all other require privilege)
      • isolate user and group ID number spaces
        • proccess's UIDs and GIDs can be different inside and outside user namespace
      • most interesting use:
        • outside NS: processss has normal unprivileged UID
        • inside NS: process has UID 0 and is granted superuser privileges for operations inside user NS
      • see user_namespaces() man pages
      • even if you set a process with new user NS, it can't then go and do something like set new hostname because its superuser privileges only apply to objects governed by that user NS
      • can create usernamespace with single UID identity mapping --> no superusr possible!
        • e.g. uid_map: 1000 1000 1 so that no concept of superuser even exists in NS
    • Cgroup - CLONE_NEWCGROUP

      • to do: understand cgroups!

      • see: cgroup_namespaces(7) for full details

      • essentially virtualize pathnames exposed in certain /proc/PID files that show cgroup membership of a process (pretty simple actually)

  • can use namespace types alone by themselves, but can be helpful to combine multiple

    • PID, IPC, cgroup NSs use file systems that need to be mounted so combine with mount namespace
    • container style frameworks like docker use almost all of these (maybe except cgroup so far)
  • magic symlinks

    • each progress has symlink files in /proc/PID/ns

      $ readlink /proc/$$/ns/uts
      uts: [4026531838]
      

      ($$ is pid oof the shell)

    • This returns ns-type: [magic inode #]

      • uses file system for implementation so it returns unique inode number. if two process have same symlink target then they are in the same NS
  • programs can use system calls to work with NSs:

    • clone(2): create new child process in a new NS(s), similar to fork but you get to specify namespace
    • ushare(2): create new NS(s) and move caller into it/them
    • setns(2): move calling process to another existing NS instance
  • shell commands for similar purposes:

    • unshare(1): create new NS(s) and execute a command in the NS(s)
      • things like hostname are copied at first so will be the same but then can be independently modified
    • nsenter existing NS(s) and execute a command
  • creating NS types (non user) requires privilege

    • CAP_SYS_ADMIN
  • important use of namespaces: implementing lightweight virtualization (AKA containers)

    • virtualization == isolation of processes
  • traditional virtualization: hypervisors

    • process isolated by runni9ng in speerate guest kernels that sit on top of host kernel
    • isolation is "all or nothing"
    • no halfway between for processes
    • a lot of overhead, new kernel for each new isolated process
  • isolation/virtualization via namespaces (containers)

    • much cheaper in resources
    • permits partial isolation of processes running on a single kernel
    • isolation per global resource
    • more work to implement
      • each global resource must be refactored inside kernel to support isolation (required changes can be extensive)
      • mainline-kernel-based container systems are more recent
Clone this wiki locally