Skip to content

[doc] Host network device ordering #6387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
342 changes: 342 additions & 0 deletions doc/content/xcp-networkd/host-network-device-ordering-on-networkd.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,342 @@
---
title: Host Network Device Ordering on Networkd
description: How does the host network device ordering work on networkd.
---

Purpose
-------

One of the Toolstack's functions is to maintain a pool of hosts. A pool can be
constructed by joining a host into an existing pool. One challenge in this
process is determining which pool-wide network a network device on the joining
host should connect to.

At first glance, this could be resolved by specifying a mapping between an
individual network device and a pool-wide network. However, this approach
would be burdensome for administrators when managing many hosts. It would be
more efficient if the Toolstack could determine this automatically.

To achieve this, the Toolstack components on two hosts need to independently
work out consistent identifications for the host network devices and connect
the network devices with the same identification to the same pool-wide network.
The identifications on a host can be considered as an order, with each network
device assigned a unique position in the order as its identification. Network
devices with the same position will connect to the same network.


The assumption
--------------

Why can the Toolstack components on two hosts independently work out an expected
order without any communication? This is possible only under the assumption that
the hosts have identical hardware, firmware, software, and the way
network devices are plugged into them. For example, an administrator will always
plug the network devices into the same PCI slot position on multiple hosts if
they want these network devices to connect to the same network.

The ordering is considered consistent if the positions of such network devices
(plugged into the same PCI slot position) in the generated orders are the same.


The biosdevname
---------------
Particularly, when the assumption above holds, a consistent initial order can be
worked out on multiple hosts independently with the help of `biosdevname`. The
"all_ethN" policy of the `biosdevname` utility can generate a device order based
on whether the device is embedded or not, PCI cards in ascending slot order, and
ports in ascending PCI bus/device/function order breadth-first. Since the hosts
are identical, the orders generated by the `biosdevname` are consistent across
the hosts.

An example of `biosdevname`'s output is as the following. The initial order can
be derived from the `BIOS device` field.

```
# biosdevname --policy all_ethN -d -x
BIOS device: eth0
Kernel name: enp5s0
Permanent MAC: 00:02:C9:ED:FD:F0
Assigned MAC : 00:02:C9:ED:FD:F0
Bus Info: 0000:05:00.0
...

BIOS device: eth1
Kernel name: enp5s1
Permanent MAC: 00:02:C9:ED:FD:F1
Assigned MAC : 00:02:C9:ED:FD:F1
Bus Info: 0000:05:01.0
...
```

However, the `BIOS device` of a particular network device may change with the
addition or removal of devices. For example:

```
# biosdevname --policy all_ethN -d -x
BIOS device: eth0
Kernel name: enp4s0
Permanent MAC: EC:F4:BB:E6:D7:BB
Assigned MAC : EC:F4:BB:E6:D7:BB
Bus Info: 0000:04:00.0
...

BIOS device: eth1
Kernel name: enp5s0
Permanent MAC: 00:02:C9:ED:FD:F0
Assigned MAC : 00:02:C9:ED:FD:F0
Bus Info: 0000:05:00.0
...

BIOS device: eth2
Kernel name: enp5s1
Permanent MAC: 00:02:C9:ED:FD:F1
Assigned MAC : 00:02:C9:ED:FD:F1
Bus Info: 0000:05:01.0
...
```

Therefore, the order derived from these values is used solely for determining
the initial order and the order of newly added devices.

Principles
-----------
* Initially, the order is aligned with PCI slots. This is to make the connection
between cabling and order predictable: The network devices in identical PCI
slots have the same position. The rationale is that PCI slots are more
predictable than MAC addresses and correspond to physical locations.

* Once a previous order has been established, the ordering should be maintained
as stable as possible despite changes to MAC addresses or PCI addresses. The
rationale is that the assumption is less likely to hold as long as the hosts are
experiencing updates and maintenance. Therefore, maintaining the stable order is
the best choice for automatic ordering.

Notation
--------

```
mac:pci:position
!mac:pci:position
```

A network device is characterised by

* MAC address, which is unique.
* PCI slot, which is not unique and multiple network devices can share a PCI
slot. PCI addresses correspond to hardware PCI slots and thus are physically
observable.
* position, the position assigned to this network device by xcp-networkd. At any
given time, no position is assigned twice but the sequence of positions may have
holes.
* The `!mac:pci:position` notation indicates that this postion was previously
used but currently is free because the device it was assgined was removed.

On a Linux system, MAC and PCI addresses have specific formats. However, for
simplicity, symbolic names are used here: MAC addresses use lowercase letters,
PCI addresses use uppercase letters, and positions use numbers.

Scenarios
---------

### The initial order

As mentioned above, the `biosdevname` can be used to generate consistent orders
for the network devices on multiple hosts.

```
current input: a:A b:D c:C
initial order: a:A:0 c:C:1 b:D:2
```

This only works if the assumption of identical hardware, firmware, software, and
network device placement holds. And it is considered that the assumption will
hold for the majority of the use cases.

Otherwise, the order can be generated from a user's configuration. The user can
specify the order explicilty for individual hosts. However, administrators would
prefer to avoid this as much as possible when managing many hosts.

```
user spec: a::0 c::1 b::2
current input: a:A b:D c:C
initial order: a:A:0 c:C:1 b:D:2
```

### Keep the order as stable as possible

Once an initial order is created on an individual host, it should be kept as
stable as possible across host boot-ups and at runtime. For example, unless
there are hardware changes, the position of a network device in the initial
order should remain the same regardless of how many times the host is rebooted.

To achieve this, the initial order should be saved persistently on the host's
local storage so it can be referenced in subsequent orderings. When performing
another ordering after the initial order has been saved, the position of a
currently unordered network device should be determined by finding its position
in the last saved order. The MAC address of the network device is a reliable
attribute for this purpose, as it is considered unique for each network device
globally.

Therefore, the network devices in the saved order should have their MAC
addresses saved together, effectively mapping each position to a MAC address.
When performing an ordering, the stable position can be found by searching the
last saved order using the MAC address.

```
last order: a:A:0 c:C:1 b:D:2
current input: a:A b:D c:C
new order: a:A:0 c:C:1 b:D:2
```

Name labels of the network devices are not considered reliable enough to
identify particular devices. For example, if the name labels are determined by
the PCI address via systemd, and a firmware update changes the PCI addresses of
the network devices, the name labels will also change.

The PCI addresses are not considered reliable as well. They may change due to
the firmeware update/setting changes or even plugging/unpluggig other devices.

```
last order: a:A:0 c:C:1 b:D:2
current input: a:A b:B c:E
new order: a:A:0 c:E:1 b:B:2
```

### Replacement

However, what happens when the MAC address of an unordered network device cannot
be found in the last saved order? There are two possible scenarios:

1. It's a newly added network device since the last ordering.
2. It's a new device that replaces an existing network device.

Replacement is a supported scenario, as an administrator might replace a broken
network device with a new one.

This can be recognized by comparing the PCI address where the network device is
located. Therefore, the PCI address of each network device should also be saved
in the order. In this case, searching the PCI address in the order results in
one of the following:

1. Not found: This means the PCI address was not occupied during the last
ordering, indicating a newly added network device.
2. Found with a MAC address, but another device with this MAC address is still
present in the system: This suggests that the PCI address of an existing
network device (with the same MAC address) has changed since the last ordering.
This may be caused by either a device move or others like a firmware update. In
this case, the current unordered network device is considered newly added.

```
last order: a:A:0 c:C:1 b:D:2
current input: a:A b:B c:C d:D
new order: a:A:0 c:C:1 b:B:2 d:D:3
```

3. Found with a MAC address, and no current devices have this MAC address: This
indicates that a new network device has replaced the old one in the same PCI
slot.
The replacing network device should be assigned the same position as the
replaced one.

```
last order: a:A:0 c:C:1 b:D:2
current input: a:A c:C d:D
new order: a:A:0 c:C:1 d:D:2
```

### Removed devices

A network device can be removed or unplugged since the last ordering. Its
position, MAC address, and PCI address are saved for future reference, and its
position will be reserved. This means there may be a gap in the order: a
position that was previously assigned to a network device is now vacant because
the device has been removed.

```
last order: a:A:0 c:C:1 b:D:2
current input: a:A b:D
new order: a:A:0 !c:C:1 d:D:2
```

### Newly added devices

As long as `the assumption` holds, newly added devices since the last ordering
can be assigned positions consistently across multiple hosts. Newly added
devices will not be assigned the positions reserved for removed devices.

```
last order: a:A:0 !c:C:1 d:D:2
current input: a:A d:D e:E
new order: a:A:0 !c:C:1 d:D:2 e:E:3
```

### Removed and then added back

It is a supported scenario for a removed device to be plugged back in,
regardless of whether it is in the same PCI slot or not. This can be recognized
by searching for the device in the saved removed devices using its MAC address.
The reserved position will be reassigned to the device when it is added back.

```
last order: a:A:0 !c:C:1 d:D:2
current input: a:A c:F d:D e:E
new order: a:A:0 c:F:1 d:D:2 e:E:3
```

### Multinic functions

The multinic function is a special kind of network device. When this type of
physical device is plugged into a PCI slot, multiple network devices are
reported at a single PCI address. Additionally, the number of reported network
devices may change due to driver updates.

```
current input: a:A b:A c:A d:A
initial order: a:A:0 b:A:1 c:A:2 d:A:3
```

As long as `the assumption` holds, the initial order of these devices can be
generated automatically and kept stable by using MAC addresses to identify
individual devices. However, `biosdevname` cannot reliably generate an order for
all devices reported at one PCI address. For devices located at the same PCI
address, their MAC addresses are used to generate the initial order.

```
last order: a:A:0 b:A:1 c:A:2 d:A:3 m:M:4 n:N:5
current input: a:A b:A c:A d:A e:A f:A m:M n:N
new order: a:A:0 b:A:1 c:A:2 d:A:3 m:M:4 n:N:5 e:A:6 f:A:7
```

For example, suppose `biosdevname` generates an order for a multinic function
and other non-multinic devices. Within this order, the N devices of the
multinic function with MAC addresses mac[1], ..., mac[N] are assigned positions
pos[1], ..., pos[N] correspondingly. `biosdevname` cannot ensure that the device
with mac[1] is always assigned position pos[1]. Instead, it ensures that the
entire set of positions pos[1], ..., pos[N] remains stable for the devices of
the multinic function. Therefore, to ensure the order follows the MAC address
order, the devices of the multinic function need to be sorted by their MAC
addresses within the set of positions.

```
last order: a:A:0 b:A:1 c:A:2 d:A:3 m:M:4
current input: e:A f:A g:A h:A m:M
new order: e:A:0 f:A:1 g:A:2 h:A:3 m:M:4
```

Rare cases that can not be handled automatically
------------------------------------------------

In summary, to keep the order stable, the auto-generated order needs to be saved
for the next ordering. When performing an automatic ordering for the current
network devices, either the MAC address or the PCI address is used to recognize
the device that was assigned the same position in the last ordering. If neither
the MAC address nor the PCI address can be used to find a position from the last
ordering, the device is considered newly added and is assigned a new position.

However, following this sorting logic, the ordering result may not always be as
expected. In practice, this can be caused by various rare cases, such as
switching an existing network device to connect to another network, performing
firmware updates, changing firmware settings, or plugging/unplugging network
devices. It is not worth complicating the entire function for these rare cases.
Instead, the initial user's configuration can be used to handle these rare
scenarios.
Loading