Skip to content

Commit b89f7bc

Browse files
authored
Merge pull request #8778 from shirady/chunked-content-decoder-docs
Docs | Add `ChunkedContentDecoder` Documentation with State Machine Diagram
2 parents 8668b3c + ea9d1e8 commit b89f7bc

File tree

2 files changed

+203
-0
lines changed

2 files changed

+203
-0
lines changed

docs/design/ChunkedContentDecoder.md

Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# Chunked Content Decoder
2+
3+
HTTP Chunked Encoding is a streaming data transfer mechanism that breaks down the data stream into a series of non-overlapping segments called "chunks".
4+
Each chunk is sent with its own size header, which tells the receiver how much data to expect in that chunk.
5+
To indicate that the data is being sent in chunks a header of `Transfer-Encoding: chunked` is included.
6+
Source: [HTTP Chunked Encoding](https://www.ioriver.io/terms/http-chunked-encoding)
7+
8+
The `ChunkedContentDecoder` class is a [Transform stream](https://nodejs.org/api/stream.html#class-streamtransform), which takes a stream that it received as chunks and streams only the data - it removes the size of the data, chunk headers of chunk-signature (optional extension), trailers, etc.
9+
10+
## Basic Encoding Structure:
11+
### Chunks (Without Optional Extension and Trailers)
12+
Each chunk consists of two parts:
13+
1. a header
14+
2. the actual data.
15+
The header is a hexadecimal number that indicates the size of the chunk in bytes, followed by a carriage return (CR) and a line feed (NL).
16+
The data that follows this header is exactly the size specified in the header.
17+
After the data, another carriage return and line feed signify the end of the chunk.
18+
Source: [HTTP Chunked Encoding](https://www.ioriver.io/terms/http-chunked-encoding)
19+
20+
```
21+
<hex bytes of data>\r\n
22+
<data>
23+
...
24+
the end of the chunk:
25+
0\r\n
26+
\r\n
27+
```
28+
29+
Example:
30+
```
31+
7\r\n
32+
Mozilla\r\n
33+
11\r\n
34+
Developer Network\r\n
35+
0\r\n
36+
\r\n
37+
```
38+
Source of the example: [Mozilla Transfer-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding)
39+
40+
41+
### Chunks With Optional Extension and Trailers
42+
(combined with example)
43+
```
44+
1fff;chunk-signature=1a2b\r\n - chunk header (optional extension)
45+
<1fff bytes of data>\r\n - chunk data
46+
2fff;chunk-signature=1a2b\r\n - chunk header (optional extension)
47+
<2fff bytes of data>\r\n - chunk data
48+
0\r\n - last chunk
49+
<trailer>\r\n - optional trailer
50+
<trailer>\r\n - optional trailer
51+
\r\n - end of content
52+
```
53+
Notes:
54+
- `1fff` and `2fff` are examples of the size in hex
55+
- trailer example: `x-amz-checksum-crc32:uOMGCw==\r\n` (key - the algorithm `crc32`, value in base64 and `\r\n` as CR NL ending of the trailer)
56+
57+
More info in [Wikipedia](https://en.wikipedia.org/wiki/Chunked_transfer_encoding)
58+
And also in [RFC 7230](https://www.rfc-editor.org/rfc/rfc7230#section-4.1)
59+
60+
### Chunk Extension (chunk-signature)
61+
In HTTP there is an option for adding chunk extensions, immediately following the chunk size.
62+
```
63+
chunk-ext = *( ";" chunk-ext-name [ "=" chunk-ext-val ] )
64+
```
65+
Source: [RFC 7230](https://www.rfc-editor.org/rfc/rfc7230#section-4.1.1)
66+
67+
You can see in AWS SigV4 that when AWS uses the chunk extension as a chunk signature it uses it with the following structure:
68+
```
69+
string(IntHexBase(chunk-size)) + ";chunk-signature=" + signature + \r\n + chunk-data + \r\n
70+
```
71+
Source: [AWS Documentation Defining the Chunk Body](https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html)
72+
73+
Chunk Extension are the following lines in the example encoding structure:
74+
```
75+
2fff;chunk-signature=1a2b\r\n - chunk header (optional extension)
76+
<2fff bytes of data>\r\n - chunk data
77+
```
78+
79+
80+
### Trailers
81+
In HTTP there is a option for a trailer. A trailer allows the sender to include additional fields at the end of a chunked message in order to supply metadata that might be dynamically generated while the message body is sent.
82+
Source: [RFC 7230](https://www.rfc-editor.org/rfc/rfc7230#section-4.1.2)
83+
84+
85+
The name of the trailing is passed as a header and the trailer (key-value) passed after the chunked body. There can be added 0 or more trailers headers in a HTTP body.
86+
Source: [Stack OverFlow](https://stackoverflow.com/questions/5590791/http-chunked-encoding-need-an-example-of-trailer-mentioned-in-spec)
87+
88+
Amazon S3 supports chunked uploads that use `aws-chunked` content encoding for `PutObject` and `UploadPart` requests with trailing checksums.
89+
90+
When a request has the header `x-amz-trailer` it indicates the name of the trailing header in the request. If trailing checksums exist the `x-amz-trailer` header value includes the `x-amz-checksum-` prefix and ends with the algorithm name. The following `x-amz-trailer` values are currently supported:
91+
- x-amz-checksum-crc32
92+
- x-amz-checksum-crc32c
93+
- x-amz-checksum-crc64nvme
94+
- x-amz-checksum-sha1
95+
- x-amz-checksum-sha256
96+
97+
Source: [AWS Trailing Checksum Documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#trailing-checksums)
98+
99+
Trailers are the following lines in the example encoding structure:
100+
```
101+
<trailer>\r\n - optional trailer
102+
<trailer>\r\n - optional trailer
103+
```
104+
105+
## State Machine
106+
The chunks are passing in the buffer and the buffer is parsed in a **loop** to handle multiple chunks in the same buffer, and to handle the case where the buffer ends in the middle of a chunk.
107+
108+
The `ChunkedContentDecoder` is using a state machine:
109+
The state machine is updated according to the current state and the buffer content.
110+
The state machine is updated by the following rules:
111+
1. **STATE_READ_CHUNK_HEADER** - read the chunk header until CR and parse it.
112+
2. **STATE_WAIT_NL_HEADER** - wait for NL after the chunk header.
113+
3. **STATE_SEND_DATA** - send chunk data to the stream until chunk size bytes sent.
114+
4. **STATE_WAIT_CR_DATA** - wait for CR after the chunk data.
115+
5. **STATE_WAIT_NL_DATA** - wait for NL after the chunk data.
116+
6. **STATE_READ_TRAILER** - read optional trailer until CR and save it.
117+
7. **STATE_WAIT_NL_TRAILER** - wait for NL after non empty trailer.
118+
8. **STATE_WAIT_NL_END** - wait for NL after the last empty trailer.
119+
9. **STATE_CONTENT_END** - the stream is done.
120+
10. **STATE_ERROR** - an error occurred.
121+
122+
The following diagram describes the changes of the state machine:
123+
![State Machine Diagram](https://github.com/user-attachments/assets/727faf34-887a-4ad8-814c-134585618d8b)
124+
125+
#### Dry run for example:
126+
An updated AWS SDK client operate `PutObject` with body: "body for example".
127+
On the server side we get the following buffers (showing as strings for readability):
128+
1. "10\r\n" <- 10 hex in decimal is 16 (this is the length of the data "body for example")
129+
2. "body for example" <- the data (we want to save as content and pipe it)
130+
3. "\r\n0\r\n" <- CR NL of the data and "0\r\n" as completion chunk (end of data - the final object chunk)
131+
4. "x-amz-checksum-crc32:uOMGCw==\r\n"
132+
5. "\r\n"
133+
134+
In this example there are 5 calls to the parse, the whole stream has 1 data chunk and its header, 1 trailer and no chunk-signature.
135+
Although in this example the buffer includes the chunks inside, it doesn’t have to be like that a chunk might be split into a couple of buffers.
136+
137+
### Notice
138+
Currently, we haven’t implemented the checksum on the server side, so if the request contains `x-amz-trailer: x-amz-checksum-crc32` and the railing chunk has the header name `x-amz-checksum-sha1` (instead of `x-amz-checksum-crc32`) this request would not fail.
139+
Example source: [AWS Trailer Chunks Documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#trailing-checksums)
140+
141+
## Policy Change in AWS
142+
In the past AWS supported data integrity check as an opt-in, and changed it to be by default.
143+
Source: [Data Integrity Protections for Amazon S3](https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html)
144+
145+
Useful link about the posted message in AWS Clients:
146+
1. [AWS CLI](https://github.com/aws/aws-cli/issues/9214)
147+
2. [AWS SDK JS V3](https://github.com/aws/aws-sdk-js-v3/issues/6810)
148+
3. [AWS SDK GO V2](https://github.com/aws/aws-sdk-go-v2/discussions/2960)
149+
150+
They also posted a full table where this change was implemented by SDK and version [Compatibility with AWS SDKs](https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html)
151+
It was also announced in [AWS blog](https://aws.amazon.com/blogs/aws/introducing-default-data-integrity-protections-for-new-objects-in-amazon-s3/)
152+
153+
#### WorkAround
154+
In the past the mentioned state machine did not include states for trailers, if a request had a trailer in its body it would get to `STATE_ERROR` as the previous state machine expected the body to end with:
155+
```
156+
0\r\n
157+
\r\n
158+
```
159+
When using an updated AWS SDK Client directly against NooBaa before the mentioned change, please run the AWS client (CLI / SDK) with the following environment variables:
160+
```
161+
AWS_RESPONSE_CHECKSUM_VALIDATION=WHEN_REQUIRED
162+
AWS_REQUEST_CHECKSUM_CALCULATION=WHEN_REQUIRED
163+
```
164+
Source: [Data Integrity Protections for Amazon S3](https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html)
165+
166+
## Code References:
167+
### Files:
168+
- `src/util/chunked_content_decoder.js` - the class `ChunkedContentDecoder`
169+
- `src/test/unit_tests/jest_tests/test_chunked_content_decoder.test.js` - unit test of the class, please run with `npx jest test_chunked_content_decoder.test.js`
170+
- `ChunkedContentDecoder_State_Machine.md` - the diagram (not as link, in case we need to modify it).
171+
172+
### Related PRs
173+
1. https://github.com/noobaa/noobaa-core/pull/5397 which created the stream transformer `ChunkedContentDecoder` (the original state machine - was build with the states that handled the optional extension of aws-chunk)
174+
2. https://github.com/noobaa/noobaa-core/pull/8753 - mainly added the trailers to the `ChunkedContentDecoder` state machine.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
```mermaid
2+
---
3+
config:
4+
theme: dark
5+
look: classic
6+
layout: elk
7+
---
8+
flowchart TD
9+
A -- not CR, append string --> A["STATE_READ_CHUNK_HEADER <br> read the chunk header until CR and parse it"]
10+
A -. CR, header parse problem .-> J["STATE_ERROR <br> an error occurred"]
11+
A -- "CR, chunk_size!=0" --> B["STATE_WAIT_NL_HEADER <br> wait for NL after the chunk header"]
12+
A -- "CR, chunk_size==0" --> F["STATE_READ_TRAILER <br> read optional trailer until CR and save it"]
13+
C -- data --> C["STATE_SEND_DATA <br> send chunk data to the stream until chunk size bytes sent"]
14+
C -- done size bytes --> D["STATE_WAIT_CR_DATA <br> wait for CR after the chunk data"]
15+
F -- not CR, append string --> F
16+
F -- CR, keep trailer --> G["STATE_WAIT_NL_TRAILER <br> wait for NL after non empty trailer"]
17+
F -- CR, empty trailer --> H["STATE_WAIT_NL_END <br> wait for NL after the last empty trailer"]
18+
D -- CR --> E["STATE_WAIT_NL_DATA <br> wait for NL after the chunk data"]
19+
H -- NL --> I["STATE_CONTENT_END <br> the stream is done"]
20+
B -- NL --> C
21+
E -- NL --> A
22+
G -- NL --> F
23+
B -. not NL .-> J
24+
E -. not NL .-> J
25+
G -. not NL .-> J
26+
H -. not NL .-> J
27+
D -. not CR .-> J
28+
I -. any .-> J
29+
```

0 commit comments

Comments
 (0)