Skip to content

Commit 65965d9

Browse files
committed
Merge tag 'erofs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs (and fscache) updates from Gao Xiang: "After working on it on the mailing list for more than half a year, we finally form 'erofs over fscache' feature into shape. Hopefully it could bring more possibility to the communities. The story mainly started from a new project what we called "RAFS v6" [1] for Nydus image service almost a year ago, which enhances EROFS to be a new form of one bootstrap (which includes metadata representing the whole fs tree) + several data-deduplicated content addressable blobs (actually treated as multiple devices). Each blob can represent one container image layer but not quite exactly since all new data can be fully existed in the previous blobs so no need to introduce another new blob. It is actually not a new idea (at least on my side it's much like a simpilied casync [2] for now) and has many benefits over per-file blobs or some other exist ways since typically each RAFS v6 image only has dozens of device blobs instead of thousands of per-file blobs. It's easy to be signed with user keys as a golden image, transfered untouchedly with minimal overhead over the network, kept in some type of storage conveniently, and run with (optional) runtime verification but without involving too many irrelevant features crossing the system beyond EROFS itself. At least it's our final goal and we're keeping working on it. There was also a good summary of this approach from the casync author [3]. Regardless further optimizations, this work is almost done in the previous Linux release cycles. In this round, we'd like to introduce on-demand load for EROFS with the fscache/cachefiles infrastructure, considering the following advantages: - Introduce new file-based backend to EROFS. Although each image only contains dozens of blobs but in densely-deployed runC host for example, there could still be massive blobs on a machine, which is messy if each blob is treated as a device. In contrast, fscache and cachefiles are really great interfaces for us to make them work. - Introduce on-demand load to fscache and EROFS. Previously, fscache is mainly used to caching network-likewise filesystems, now it can support on-demand downloading for local fses too with the exact localfs on-disk format. It has many advantages which we're been described in the latest patchset cover letter [4]. In addition to that, most importantly, the cached data is still stored in the original local fs on-disk format so that it's still the one signed with private keys but only could be partially available. Users can fully trust it during running. Later, users can also back up cachefiles easily to another machine. - More reliable on-demand approach in principle. After data is all available locally, user daemon can be no longer online in some use cases, which helps daemon crash recovery (filesystems can still in service) and hot-upgrade (user daemon can be upgraded more frequently due to new features or protocols introduced.) - Other format can also be converted to EROFS filesystem format over the internet on the fly with the new on-demand load feature and mounted. That is entirely possible with on-demand load feature as long as such archive format metadata can be fetched in advance like stargz. In addition, although currently our target user is Nydus image service [5], but laterly, it can be used for other use cases like on-demand system booting, etc. As for the fscache on-demand load feature itself, strictly it can be used for other local fses too. Laterly we could promote most code to the iomap infrastructure and also enhance it in the read-write way if other local fses are interested. Thanks David Howells for taking so much time and patience on this these months, many thanks with great respect here again! Thanks Jeffle for working on this feature and Xin Yin from Bytedance for asynchronous I/O implementation as well as Zichen Tian, Jia Zhu, and Yan Song for testing, much appeciated. We're also exploring more possibly over fscache cache management over FSDAX for secure containers and working on more improvements and useful features for fscache, cachefiles, and on-demand load. In addition to "erofs over fscache", NFS export and idmapped mount are also completed in this cycle for container use cases as well. Summary: - Add erofs on-demand load support over fscache - Support NFS export for erofs - Support idmapped mounts for erofs - Don't prompt for risk any more when using big pcluster - Fix buffer copy overflow of ztailpacking feature - Several minor cleanups" [1] https://lore.kernel.org/r/20210730194625.93856-1-hsiangkao@linux.alibaba.com [2] https://github.com/systemd/casync [3] http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html [4] https://lore.kernel.org/r/20220509074028.74954-1-jefflexu@linux.alibaba.com [5] https://github.com/dragonflyoss/image-service * tag 'erofs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (29 commits) erofs: scan devices from device table erofs: change to use asynchronous io for fscache readpage/readahead erofs: add 'fsid' mount option erofs: implement fscache-based data readahead erofs: implement fscache-based data read for inline layout erofs: implement fscache-based data read for non-inline layout erofs: implement fscache-based metadata read erofs: register fscache context for extra data blobs erofs: register fscache context for primary data blob erofs: add erofs_fscache_read_folios() helper erofs: add anonymous inode caching metadata for data blobs erofs: add fscache context helper functions erofs: register fscache volume erofs: add fscache mode check helper erofs: make erofs_map_blocks() generally available cachefiles: document on-demand read mode cachefiles: add tracepoints for on-demand read mode cachefiles: enable on-demand read mode cachefiles: implement on-demand read cachefiles: notify the user daemon when withdrawing cookie ...
2 parents 850f603 + ba73ead commit 65965d9

File tree

24 files changed

+1997
-164
lines changed

24 files changed

+1997
-164
lines changed

Documentation/filesystems/caching/cachefiles.rst

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ Cache on Already Mounted Filesystem
2828
2929
(*) Debugging.
3030
31+
(*) On-demand Read.
3132
3233
3334
Overview
@@ -482,3 +483,180 @@ the control file. For example::
482483
echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug
483484

484485
will turn on all function entry debugging.
486+
487+
488+
On-demand Read
489+
==============
490+
491+
When working in its original mode, CacheFiles serves as a local cache for a
492+
remote networking fs - while in on-demand read mode, CacheFiles can boost the
493+
scenario where on-demand read semantics are needed, e.g. container image
494+
distribution.
495+
496+
The essential difference between these two modes is seen when a cache miss
497+
occurs: In the original mode, the netfs will fetch the data from the remote
498+
server and then write it to the cache file; in on-demand read mode, fetching
499+
the data and writing it into the cache is delegated to a user daemon.
500+
501+
``CONFIG_CACHEFILES_ONDEMAND`` should be enabled to support on-demand read mode.
502+
503+
504+
Protocol Communication
505+
----------------------
506+
507+
The on-demand read mode uses a simple protocol for communication between kernel
508+
and user daemon. The protocol can be modeled as::
509+
510+
kernel --[request]--> user daemon --[reply]--> kernel
511+
512+
CacheFiles will send requests to the user daemon when needed. The user daemon
513+
should poll the devnode ('/dev/cachefiles') to check if there's a pending
514+
request to be processed. A POLLIN event will be returned when there's a pending
515+
request.
516+
517+
The user daemon then reads the devnode to fetch a request to process. It should
518+
be noted that each read only gets one request. When it has finished processing
519+
the request, the user daemon should write the reply to the devnode.
520+
521+
Each request starts with a message header of the form::
522+
523+
struct cachefiles_msg {
524+
__u32 msg_id;
525+
__u32 opcode;
526+
__u32 len;
527+
__u32 object_id;
528+
__u8 data[];
529+
};
530+
531+
where:
532+
533+
* ``msg_id`` is a unique ID identifying this request among all pending
534+
requests.
535+
536+
* ``opcode`` indicates the type of this request.
537+
538+
* ``object_id`` is a unique ID identifying the cache file operated on.
539+
540+
* ``data`` indicates the payload of this request.
541+
542+
* ``len`` indicates the whole length of this request, including the
543+
header and following type-specific payload.
544+
545+
546+
Turning on On-demand Mode
547+
-------------------------
548+
549+
An optional parameter becomes available to the "bind" command::
550+
551+
bind [ondemand]
552+
553+
When the "bind" command is given no argument, it defaults to the original mode.
554+
When it is given the "ondemand" argument, i.e. "bind ondemand", on-demand read
555+
mode will be enabled.
556+
557+
558+
The OPEN Request
559+
----------------
560+
561+
When the netfs opens a cache file for the first time, a request with the
562+
CACHEFILES_OP_OPEN opcode, a.k.a an OPEN request will be sent to the user
563+
daemon. The payload format is of the form::
564+
565+
struct cachefiles_open {
566+
__u32 volume_key_size;
567+
__u32 cookie_key_size;
568+
__u32 fd;
569+
__u32 flags;
570+
__u8 data[];
571+
};
572+
573+
where:
574+
575+
* ``data`` contains the volume_key followed directly by the cookie_key.
576+
The volume key is a NUL-terminated string; the cookie key is binary
577+
data.
578+
579+
* ``volume_key_size`` indicates the size of the volume key in bytes.
580+
581+
* ``cookie_key_size`` indicates the size of the cookie key in bytes.
582+
583+
* ``fd`` indicates an anonymous fd referring to the cache file, through
584+
which the user daemon can perform write/llseek file operations on the
585+
cache file.
586+
587+
588+
The user daemon can use the given (volume_key, cookie_key) pair to distinguish
589+
the requested cache file. With the given anonymous fd, the user daemon can
590+
fetch the data and write it to the cache file in the background, even when
591+
kernel has not triggered a cache miss yet.
592+
593+
Be noted that each cache file has a unique object_id, while it may have multiple
594+
anonymous fds. The user daemon may duplicate anonymous fds from the initial
595+
anonymous fd indicated by the @fd field through dup(). Thus each object_id can
596+
be mapped to multiple anonymous fds, while the usr daemon itself needs to
597+
maintain the mapping.
598+
599+
When implementing a user daemon, please be careful of RLIMIT_NOFILE,
600+
``/proc/sys/fs/nr_open`` and ``/proc/sys/fs/file-max``. Typically these needn't
601+
be huge since they're related to the number of open device blobs rather than
602+
open files of each individual filesystem.
603+
604+
The user daemon should reply the OPEN request by issuing a "copen" (complete
605+
open) command on the devnode::
606+
607+
copen <msg_id>,<cache_size>
608+
609+
where:
610+
611+
* ``msg_id`` must match the msg_id field of the OPEN request.
612+
613+
* When >= 0, ``cache_size`` indicates the size of the cache file;
614+
when < 0, ``cache_size`` indicates any error code encountered by the
615+
user daemon.
616+
617+
618+
The CLOSE Request
619+
-----------------
620+
621+
When a cookie withdrawn, a CLOSE request (opcode CACHEFILES_OP_CLOSE) will be
622+
sent to the user daemon. This tells the user daemon to close all anonymous fds
623+
associated with the given object_id. The CLOSE request has no extra payload,
624+
and shouldn't be replied.
625+
626+
627+
The READ Request
628+
----------------
629+
630+
When a cache miss is encountered in on-demand read mode, CacheFiles will send a
631+
READ request (opcode CACHEFILES_OP_READ) to the user daemon. This tells the user
632+
daemon to fetch the contents of the requested file range. The payload is of the
633+
form::
634+
635+
struct cachefiles_read {
636+
__u64 off;
637+
__u64 len;
638+
};
639+
640+
where:
641+
642+
* ``off`` indicates the starting offset of the requested file range.
643+
644+
* ``len`` indicates the length of the requested file range.
645+
646+
647+
When it receives a READ request, the user daemon should fetch the requested data
648+
and write it to the cache file identified by object_id.
649+
650+
When it has finished processing the READ request, the user daemon should reply
651+
by using the CACHEFILES_IOC_READ_COMPLETE ioctl on one of the anonymous fds
652+
associated with the object_id given in the READ request. The ioctl is of the
653+
form::
654+
655+
ioctl(fd, CACHEFILES_IOC_READ_COMPLETE, msg_id);
656+
657+
where:
658+
659+
* ``fd`` is one of the anonymous fds associated with the object_id
660+
given.
661+
662+
* ``msg_id`` must match the msg_id field of the READ request.

fs/cachefiles/Kconfig

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,15 @@ config CACHEFILES_ERROR_INJECTION
2626
help
2727
This permits error injection to be enabled in cachefiles whilst a
2828
cache is in service.
29+
30+
config CACHEFILES_ONDEMAND
31+
bool "Support for on-demand read"
32+
depends on CACHEFILES
33+
default n
34+
help
35+
This permits userspace to enable the cachefiles on-demand read mode.
36+
In this mode, when a cache miss occurs, responsibility for fetching
37+
the data lies with the cachefiles backend instead of with the netfs
38+
and is delegated to userspace.
39+
40+
If unsure, say N.

fs/cachefiles/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,5 +16,6 @@ cachefiles-y := \
1616
xattr.o
1717

1818
cachefiles-$(CONFIG_CACHEFILES_ERROR_INJECTION) += error_inject.o
19+
cachefiles-$(CONFIG_CACHEFILES_ONDEMAND) += ondemand.o
1920

2021
obj-$(CONFIG_CACHEFILES) := cachefiles.o

0 commit comments

Comments
 (0)