Skip to content

Commit 069569e

Browse files
committed
Link to issues #25 and #26 from design.md
1 parent c560637 commit 069569e

File tree

1 file changed

+2
-0
lines changed

1 file changed

+2
-0
lines changed

design.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
`light-speed-io` (or "LSIO", for short) will be a Rust library crate for loading and processing many chunks of files, as fast as the storage system will allow. **The aim is to to allow users to load and process on the order of 1 million 4 kB chunks per second from a single local SSD**.
44

5+
**UPDATE (2024-01-23): THE DESIGN IS LIKELY TO CHANGE A LOT! SPECIFICALLY, MY PLAN IS TO SIMPLIFY LSIO SO THAT IT IS ONLY RESPONSIBLE FOR I/O (NOT FOR PROCESSING CHUNKS). USERS WILL STILL BE ABLE TO INTERLEAVE I/O WITH PROCESSING BECAUSE LSIO WILL RETURN A Rust `Stream` (AKA `AsyncIterator`) OF CHUNKS (see [this GitHub comment](https://github.com/JackKelly/light-speed-io/issues/25#issuecomment-1900536618)). AFTER BUILDING AN MVP OF LSIO, I PLAN TO BUILD A SECOND CRATE WHICH MAKES IT EASY TO APPLY AN ARBITRARY PROCESSING FUNCTION TO A STREAM, IN PARALLEL ACROSS CPU CORES. (See [this comment](https://github.com/JackKelly/light-speed-io/issues/26#issuecomment-1902182033))**
6+
57
Why aim for 1 million chunks per second? See [this spreadsheet](https://docs.google.com/spreadsheets/d/1DSNeU--dDlNSFyOrHhejXvTl9tEWvUAJYl-YavUdkmo/edit#gid=0) an ML training use-case that comfortably requires hundreds of thousands of chunks per second.
68

79
But, wait, isn't it inefficient to load tiny chunks? [Dask recommends chunk sizes between 100 MB and 1 GB](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes)! Modern SSDs are turning the tables: modern SSDs can sustain over 1 million input/output operations per second. And cloud storage looks like it is speeding up (for example, see the recent announcement of [AWS Express Zone One](https://aws.amazon.com/blogs/aws/new-amazon-s3-express-one-zone-high-performance-storage-class/); and there may be [ways to get high performance from existing cloud storage buckets](https://github.com/JackKelly/light-speed-io/issues/10), too). One reason that Dask recommends large chunk sizes is that Dask's scheduler takes on the order of 1 ms to plan each task. LSIO's data processing should be faster (see below).

0 commit comments

Comments
 (0)