Namespace notes

Container/Namespace notes:

Containers unplugged: Linux namespaces - Michael Kerrisk - https://youtu.be/0kJPa-1FuoI

Namespaces (NS) -

for each NS type:
- multiple instances of NS may exist on a system
  - at system boot, there is one instance of each NS type - the so-called initial namespace of that type
- each process resides in one NS instance
- to processes inside NS instance3, it appears that only they can see/modify corresponding global resource
  - processes are unaware of other instances of resource
when new process is created via fork(), it resides in the same set of NSs as parent process
7 types of NS types supported by linux with corresponding clone() flag
- mount - CLONE_NEWNS
  - first implemented NS (explains the name, they didn't know about future namespaces)
  - isolation of mount points (MPs) seen by process(es)
    - process's view of filesystem (FS) tree is defined by (hierarchically) related set of MPs
    - MP is a tuple that includes:
      - Mount source (device)
      - pathname
      - id of parent mount. parental relationship defines single directory of hierarchy
  - processes in different mount NSs see different FS trees, i.e. processes have distinct sets of MPs
  - mount(2) and umount(2) affect only processes in the same mount NS as caller
  - compare to chroot:
    - chroot allows to confine process to a subtree but mount NS allows you to completely define different root directory and rearrange their view of the file system
  - allows you to mount new /proc FS without side effects. This lets you first set up a new mount point before creating new PID, IPC, etc. name spaces so you can avoid side effects system wide
  - consider situation: you have multiple isolated mount NSs and slide a disk into the disk reader, you want all NSs to see this change, mount_namespaces(7) allows you to mount a point and share this subtree by propogating the same mount in all namespaces. allows partial isolation reversal
- UTS - CLONE_NEWUTS
  - simplest NS
  - isolate two system identifiers returned by uname(2)
    - nodename: system hostname (set by sethostname(2))
    - domainname: NIS domain name (set by setdomainname(2))
    - changes made aren't visible outside namespace
    - nodenames can be used with DHCP to obtain seperate ip address for container
- IPC - CLONE_NEWIPC
  - isolate certain ipc resources:
    - system V IPC (message queues (MQs), semaphores, shared memory)
    - POSIX MQs
    - processes in an IPC NS instance share a set of IPC objects, but can't see objects in other IPC NSs
  - each NS has:
    - isolated set of System V IPC identifiers
    - its own POSIX MQ filesystem (/dev/mqueue)
    - private instances of various /proc files related to these IPC mechanisms
      - /proc/sysvipc, /proc/sys/fs/mqueue, etc.
    - all of these are destroyed when NS is destroyed
- PID - CLONE_NEWPID
  - isolate process ids
    - different NSs can have the same PIDs
  - benefits:
    - allows process inside containers to maintain same PIDs when container is migrated to different host - could possibly be useful for distributed systems around the city if someone wants to switch trashcans and run code somewhere else
    - allows per-container init process (PID 1) which manages container initialization and reaping of orphaned children
  - unlike most other NS types, PID NSs form a hierarchy
    - each pid NS has a parent, going back to initial PID NS
    - Parent of PID NS is PID NS of caller of clone() or unshare()
    - ioctl(fd, NS_GET_PARENT) can be used to discover parental relationship
      - see ioctl_ns(2)
    - process is member of its immediate PID NS but also visible in each anscestor PID NS
    - process will typically have different PID in each PID NS in which it is visible
    - initial PID NS can see all process in all PID NSs
      - see == employ syscalls on, send signals to, access via /proc, ...
- Network - CLONE_NEWNET
  - isolating network resources
    - IP addresses, IP routing tables, /proc/net & /sys/class/net directories, netfilter (firewall) rules, socket port-number space, abstract UNIX domain sockets
  - make containers useful from networking perspective:
    - each container can have virtual network device
    - applications bound to per-ns port-number space
    - routing rules in host system can direct network packets to virtual device of specific container
      - virtual ethernet (veth) devices provide network connection between container and host system
  - isolate network service workers
    - place server worker process in NS with no NW device
    - can still use host to pass file descriptors (e.g. connected sockets) via UNIX domain socket
      - FD passing example: sockets/scm_rights_send.c and sockets/scm_rights_recv.c
    - worker can then access NW via the host but if compromised can't make new network connections because it has no NW device
- User - CLONE_NEWUSER
  - requires no privilege to create (all other require privilege)
  - isolate user and group ID number spaces
    - proccess's UIDs and GIDs can be different inside and outside user namespace
  - most interesting use:
    - outside NS: processss has normal unprivileged UID
    - inside NS: process has UID 0 and is granted superuser privileges for operations inside user NS
  - see user_namespaces() man pages
  - even if you set a process with new user NS, it can't then go and do something like set new hostname because its superuser privileges only apply to objects governed by that user NS
  - can create usernamespace with single UID identity mapping --> no superusr possible!
    - e.g. uid_map: 1000 1000 1 so that no concept of superuser even exists in NS
- Cgroup - CLONE_NEWCGROUP
  - to do: understand cgroups!
  - see: cgroup_namespaces(7) for full details
  - essentially virtualize pathnames exposed in certain /proc/PID files that show cgroup membership of a process (pretty simple actually)
can use namespace types alone by themselves, but can be helpful to combine multiple
- PID, IPC, cgroup NSs use file systems that need to be mounted so combine with mount namespace
- container style frameworks like docker use almost all of these (maybe except cgroup so far)
magic symlinks
- each progress has symlink files in /proc/PID/ns
```
$ readlink /proc/$$/ns/uts
uts: [4026531838]
```
  ($$ is pid oof the shell)
- This returns ns-type: [magic inode #]
  - uses file system for implementation so it returns unique inode number. if two process have same symlink target then they are in the same NS
programs can use system calls to work with NSs:
- clone(2): create new child process in a new NS(s), similar to fork but you get to specify namespace
- ushare(2): create new NS(s) and move caller into it/them
- setns(2): move calling process to another existing NS instance
shell commands for similar purposes:
- unshare(1): create new NS(s) and execute a command in the NS(s)
  - things like hostname are copied at first so will be the same but then can be independently modified
- nsenter existing NS(s) and execute a command
creating NS types (non user) requires privilege
- CAP_SYS_ADMIN
important use of namespaces: implementing lightweight virtualization (AKA containers)
- virtualization == isolation of processes
traditional virtualization: hypervisors
- process isolated by runni9ng in speerate guest kernels that sit on top of host kernel
- isolation is "all or nothing"
- no halfway between for processes
- a lot of overhead, new kernel for each new isolated process
isolation/virtualization via namespaces (containers)
- much cheaper in resources
- permits partial isolation of processes running on a single kernel
- isolation per global resource
- more work to implement
  - each global resource must be refactored inside kernel to support isolation (required changes can be extensive)
  - mainline-kernel-based container systems are more recent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Namespace notes

Container/Namespace notes:

Namespaces (NS) -

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally