Skip to content

Commit 92945bd

Browse files
Mike SnitzerAnna Schumaker
authored andcommitted
nfs: add Documentation/filesystems/nfs/localio.rst
This document gives an overview of the LOCALIO auxiliary RPC protocol added to the Linux NFS client and server to allow them to reliably handshake to determine if they are on the same host. Once an NFS client and server handshake as "local", the client will bypass the network RPC protocol for read, write and commit operations. Due to this XDR and RPC bypass, these operations will operate faster. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
1 parent 56bcd0f commit 92945bd

File tree

1 file changed

+203
-0
lines changed

1 file changed

+203
-0
lines changed
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
===========
2+
NFS LOCALIO
3+
===========
4+
5+
Overview
6+
========
7+
8+
The LOCALIO auxiliary RPC protocol allows the Linux NFS client and
9+
server to reliably handshake to determine if they are on the same
10+
host. Select "NFS client and server support for LOCALIO auxiliary
11+
protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel
12+
config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled).
13+
14+
Once an NFS client and server handshake as "local", the client will
15+
bypass the network RPC protocol for read, write and commit operations.
16+
Due to this XDR and RPC bypass, these operations will operate faster.
17+
18+
The LOCALIO auxiliary protocol's implementation, which uses the same
19+
connection as NFS traffic, follows the pattern established by the NFS
20+
ACL protocol extension.
21+
22+
The LOCALIO auxiliary protocol is needed to allow robust discovery of
23+
clients local to their servers. In a private implementation that
24+
preceded use of this LOCALIO protocol, a fragile sockaddr network
25+
address based match against all local network interfaces was attempted.
26+
But unlike the LOCALIO protocol, the sockaddr-based matching didn't
27+
handle use of iptables or containers.
28+
29+
The robust handshake between local client and server is just the
30+
beginning, the ultimate use case this locality makes possible is the
31+
client is able to open files and issue reads, writes and commits
32+
directly to the server without having to go over the network. The
33+
requirement is to perform these loopback NFS operations as efficiently
34+
as possible, this is particularly useful for container use cases
35+
(e.g. kubernetes) where it is possible to run an IO job local to the
36+
server.
37+
38+
The performance advantage realized from LOCALIO's ability to bypass
39+
using XDR and RPC for reads, writes and commits can be extreme, e.g.:
40+
41+
fio for 20 secs with directio, qd of 8, 16 libaio threads:
42+
- With LOCALIO:
43+
4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec)
44+
4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec)
45+
128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec)
46+
128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec)
47+
48+
- Without LOCALIO:
49+
4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec)
50+
4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec)
51+
128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec)
52+
128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec)
53+
54+
fio for 20 secs with directio, qd of 8, 1 libaio thread:
55+
- With LOCALIO:
56+
4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec)
57+
4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec)
58+
128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec)
59+
128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec)
60+
61+
- Without LOCALIO:
62+
4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec)
63+
4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec)
64+
128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec)
65+
128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec)
66+
67+
RPC
68+
===
69+
70+
The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
71+
RPC method that allows the Linux NFS client to verify the local Linux
72+
NFS server can see the nonce (single-use UUID) the client generated and
73+
made available in nfs_common. This protocol isn't part of an IETF
74+
standard, nor does it need to be considering it is Linux-to-Linux
75+
auxiliary RPC protocol that amounts to an implementation detail.
76+
77+
The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
78+
the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode
79+
XDR methods are used instead of the less efficient variable sized
80+
methods.
81+
82+
The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned
83+
by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ):
84+
Linux Kernel Organization 400122 nfslocalio
85+
86+
The LOCALIO protocol spec in rpcgen syntax is:
87+
88+
/* raw RFC 9562 UUID */
89+
#define UUID_SIZE 16
90+
typedef u8 uuid_t<UUID_SIZE>;
91+
92+
program NFS_LOCALIO_PROGRAM {
93+
version LOCALIO_V1 {
94+
void
95+
NULL(void) = 0;
96+
97+
void
98+
UUID_IS_LOCAL(uuid_t) = 1;
99+
} = 1;
100+
} = 400122;
101+
102+
LOCALIO uses the same transport connection as NFS traffic. As such,
103+
LOCALIO is not registered with rpcbind.
104+
105+
NFS Common and Client/Server Handshake
106+
======================================
107+
108+
fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client
109+
to generate a nonce (single-use UUID) and associated short-lived
110+
nfs_uuid_t struct, register it with nfs_common for subsequent lookup and
111+
verification by the NFS server and if matched the NFS server populates
112+
members in the nfs_uuid_t struct. The NFS client then uses nfs_common to
113+
transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv
114+
clients_list from the nfs_common's uuids_list. See:
115+
fs/nfs/localio.c:nfs_local_probe()
116+
117+
nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such
118+
it has members that point to nfsd memory for direct use by the client
119+
(e.g. 'net' is the server's network namespace, through it the client can
120+
access nn->nfsd_serv with proper rcu read access). It is this client
121+
and server synchronization that enables advanced usage and lifetime of
122+
objects to span from the host kernel's nfsd to per-container knfsd
123+
instances that are connected to nfs client's running on the same local
124+
host.
125+
126+
NFS Client issues IO instead of Server
127+
======================================
128+
129+
Because LOCALIO is focused on protocol bypass to achieve improved IO
130+
performance, alternatives to the traditional NFS wire protocol (SUNRPC
131+
with XDR) must be provided to access the backing filesystem.
132+
133+
See fs/nfs/localio.c:nfs_local_open_fh() and
134+
fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes
135+
focused use of select nfs server objects to allow a client local to a
136+
server to open a file pointer without needing to go over the network.
137+
138+
The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the
139+
server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access
140+
both the associated nfsd network namespace and nn->nfsd_serv in terms of
141+
RCU. If nfsd_open_local_fh() finds that the client no longer sees valid
142+
nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO
143+
to nfs_local_open_fh() and the client will try to reestablish the
144+
LOCALIO resources needed by calling nfs_local_probe() again. This
145+
recovery is needed if/when an nfsd instance running in a container were
146+
to reboot while a LOCALIO client is connected to it.
147+
148+
Once the client has an open nfsd_file pointer it will issue reads,
149+
writes and commits directly to the underlying local filesystem (normally
150+
done by the nfs server). As such, for these operations, the NFS client
151+
is issuing IO to the underlying local filesystem that it is sharing with
152+
the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and
153+
fs/nfs/localio.c:nfs_local_commit().
154+
155+
Security
156+
========
157+
158+
Localio is only supported when UNIX-style authentication (AUTH_UNIX, aka
159+
AUTH_SYS) is used.
160+
161+
Care is taken to ensure the same NFS security mechanisms are used
162+
(authentication, etc) regardless of whether LOCALIO or regular NFS
163+
access is used. The auth_domain established as part of the traditional
164+
NFS client access to the NFS server is also used for LOCALIO.
165+
166+
Relative to containers, LOCALIO gives the client access to the network
167+
namespace the server has. This is required to allow the client to access
168+
the server's per-namespace nfsd_net struct. With traditional NFS, the
169+
client is afforded this same level of access (albeit in terms of the NFS
170+
protocol via SUNRPC). No other namespaces (user, mount, etc) have been
171+
altered or purposely extended from the server to the client.
172+
173+
Testing
174+
=======
175+
176+
The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write
177+
and commit access have proven stable against various test scenarios:
178+
179+
- Client and server both on the same host.
180+
181+
- All permutations of client and server support enablement for both
182+
local and remote client and server.
183+
184+
- Testing against NFS storage products that don't support the LOCALIO
185+
protocol was also performed.
186+
187+
- Client on host, server within a container (for both v3 and v4.2).
188+
The container testing was in terms of podman managed containers and
189+
includes successful container stop/restart scenario.
190+
191+
- Formalizing these test scenarios in terms of existing test
192+
infrastructure is on-going. Initial regular coverage is provided in
193+
terms of ktest running xfstests against a LOCALIO-enabled NFS loopback
194+
mount configuration, and includes lockdep and KASAN coverage, see:
195+
https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next
196+
https://github.com/koverstreet/ktest
197+
198+
- Various kdevops testing (in terms of "Chuck's BuildBot") has been
199+
performed to regularly verify the LOCALIO changes haven't caused any
200+
regressions to non-LOCALIO NFS use cases.
201+
202+
- All of Hammerspace's various sanity tests pass with LOCALIO enabled
203+
(this includes numerous pNFS and flexfiles tests).

0 commit comments

Comments
 (0)