-
Notifications
You must be signed in to change notification settings - Fork 4
Namespace notes
Containers unplugged: Linux namespaces - Michael Kerrisk - https://youtu.be/0kJPa-1FuoI
-
for each NS type:
- multiple instances of NS may exist on a system
- at system boot, there is one instance of each NS type - the so-called initial namespace of that type
- each process resides in one NS instance
- to processes inside NS instance3, it appears that only they can see/modify corresponding global resource
- processes are unaware of other instances of resource
- multiple instances of NS may exist on a system
-
when new process is created via fork(), it resides in the same set of NSs as parent process
-
7 types of NS types supported by linux with corresponding clone() flag
-
mount - CLONE_NEWNS
- first implemented NS (explains the name, they didn't know about future namespaces)
- isolation of mount points (MPs) seen by process(es)
- process's view of filesystem (FS) tree is defined by (hierarchically) related set of MPs
- MP is a tuple that includes:
- Mount source (device)
- pathname
- id of parent mount. parental relationship defines single directory of hierarchy
- processes in different mount NSs see different FS trees, i.e. processes have distinct sets of MPs
- mount(2) and umount(2) affect only processes in the same mount NS as caller
- compare to chroot:
- chroot allows to confine process to a subtree but mount NS allows you to completely define different root directory and rearrange their view of the file system
- allows you to mount new /proc FS without side effects. This lets you first set up a new mount point before creating new PID, IPC, etc. name spaces so you can avoid side effects system wide
- consider situation: you have multiple isolated mount NSs and slide a disk into the disk reader, you want all NSs to see this change, mount_namespaces(7) allows you to mount a point and share this subtree by propogating the same mount in all namespaces. allows partial isolation reversal
-
UTS - CLONE_NEWUTS
- simplest NS
- isolate two system identifiers returned by uname(2)
- nodename: system hostname (set by sethostname(2))
- domainname: NIS domain name (set by setdomainname(2))
- changes made aren't visible outside namespace
- nodenames can be used with DHCP to obtain seperate ip address for container
-
IPC - CLONE_NEWIPC
- isolate certain ipc resources:
- system V IPC (message queues (MQs), semaphores, shared memory)
- POSIX MQs
- processes in an IPC NS instance share a set of IPC objects, but can't see objects in other IPC NSs
- each NS has:
- isolated set of System V IPC identifiers
- its own POSIX MQ filesystem (/dev/mqueue)
- private instances of various /proc files related to these IPC mechanisms
- /proc/sysvipc, /proc/sys/fs/mqueue, etc.
- all of these are destroyed when NS is destroyed
- isolate certain ipc resources:
-
PID - CLONE_NEWPID
- isolate process ids
- different NSs can have the same PIDs
- benefits:
- allows process inside containers to maintain same PIDs when container is migrated to different host - could possibly be useful for distributed systems around the city if someone wants to switch trashcans and run code somewhere else
- allows per-container init process (PID 1) which manages container initialization and reaping of orphaned children
- unlike most other NS types, PID NSs form a hierarchy
- each pid NS has a parent, going back to initial PID NS
- Parent of PID NS is PID NS of caller of clone() or unshare()
- ioctl(fd, NS_GET_PARENT) can be used to discover parental relationship
- see ioctl_ns(2)
- process is member of its immediate PID NS but also visible in each anscestor PID NS
- process will typically have different PID in each PID NS in which it is visible
- initial PID NS can see all process in all PID NSs
- see == employ syscalls on, send signals to, access via /proc, ...
- isolate process ids
-
Network - CLONE_NEWNET
- isolating network resources
- IP addresses, IP routing tables, /proc/net & /sys/class/net directories, netfilter (firewall) rules, socket port-number space, abstract UNIX domain sockets
- make containers useful from networking perspective:
- each container can have virtual network device
- applications bound to per-ns port-number space
- routing rules in host system can direct network packets to virtual device of specific container
- virtual ethernet (veth) devices provide network connection between container and host system
- isolate network service workers
- place server worker process in NS with no NW device
- can still use host to pass file descriptors (e.g. connected sockets) via UNIX domain socket
- FD passing example: sockets/scm_rights_send.c and sockets/scm_rights_recv.c
- worker can then access NW via the host but if compromised can't make new network connections because it has no NW device
- isolating network resources
-
User - CLONE_NEWUSER
- requires no privilege to create (all other require privilege)
- isolate user and group ID number spaces
- proccess's UIDs and GIDs can be different inside and outside user namespace
- most interesting use:
- outside NS: processss has normal unprivileged UID
- inside NS: process has UID 0 and is granted superuser privileges for operations inside user NS
- see user_namespaces() man pages
- even if you set a process with new user NS, it can't then go and do something like set new hostname because its superuser privileges only apply to objects governed by that user NS
- can create usernamespace with single UID identity mapping --> no superusr possible!
- e.g.
uid_map: 1000 1000 1
so that no concept of superuser even exists in NS
- e.g.
-
Cgroup - CLONE_NEWCGROUP
-
to do: understand cgroups!
-
see: cgroup_namespaces(7) for full details
-
essentially virtualize pathnames exposed in certain /proc/PID files that show cgroup membership of a process (pretty simple actually)
-
-
-
can use namespace types alone by themselves, but can be helpful to combine multiple
- PID, IPC, cgroup NSs use file systems that need to be mounted so combine with mount namespace
- container style frameworks like docker use almost all of these (maybe except cgroup so far)
-
magic symlinks
-
each progress has symlink files in /proc/PID/ns
$ readlink /proc/$$/ns/uts uts: [4026531838]
($$ is pid oof the shell)
-
This returns ns-type: [magic inode #]
- uses file system for implementation so it returns unique inode number. if two process have same symlink target then they are in the same NS
-
-
programs can use system calls to work with NSs:
- clone(2): create new child process in a new NS(s), similar to fork but you get to specify namespace
- ushare(2): create new NS(s) and move caller into it/them
- setns(2): move calling process to another existing NS instance
-
shell commands for similar purposes:
- unshare(1): create new NS(s) and execute a command in the NS(s)
- things like hostname are copied at first so will be the same but then can be independently modified
- nsenter existing NS(s) and execute a command
- unshare(1): create new NS(s) and execute a command in the NS(s)
-
creating NS types (non user) requires privilege
- CAP_SYS_ADMIN
-
important use of namespaces: implementing lightweight virtualization (AKA containers)
- virtualization == isolation of processes
-
traditional virtualization: hypervisors
- process isolated by runni9ng in speerate guest kernels that sit on top of host kernel
- isolation is "all or nothing"
- no halfway between for processes
- a lot of overhead, new kernel for each new isolated process
-
isolation/virtualization via namespaces (containers)
- much cheaper in resources
- permits partial isolation of processes running on a single kernel
- isolation per global resource
- more work to implement
- each global resource must be refactored inside kernel to support isolation (required changes can be extensive)
- mainline-kernel-based container systems are more recent