|
| 1 | +# Chunked Content Decoder |
| 2 | + |
| 3 | +HTTP Chunked Encoding is a streaming data transfer mechanism that breaks down the data stream into a series of non-overlapping segments called "chunks". |
| 4 | +Each chunk is sent with its own size header, which tells the receiver how much data to expect in that chunk. |
| 5 | +To indicate that the data is being sent in chunks a header of `Transfer-Encoding: chunked` is included. |
| 6 | +Source: [HTTP Chunked Encoding](https://www.ioriver.io/terms/http-chunked-encoding) |
| 7 | + |
| 8 | +The `ChunkedContentDecoder` class is a [Transform stream](https://nodejs.org/api/stream.html#class-streamtransform), which takes a stream that it received as chunks and streams only the data - it removes the size of the data, chunk headers of chunk-signature (optional extension), trailers, etc. |
| 9 | + |
| 10 | +## Basic Encoding Structure: |
| 11 | +### Chunks (Without Optional Extension and Trailers) |
| 12 | +Each chunk consists of two parts: |
| 13 | +1. a header |
| 14 | +2. the actual data. |
| 15 | +The header is a hexadecimal number that indicates the size of the chunk in bytes, followed by a carriage return (CR) and a line feed (NL). |
| 16 | +The data that follows this header is exactly the size specified in the header. |
| 17 | +After the data, another carriage return and line feed signify the end of the chunk. |
| 18 | +Source: [HTTP Chunked Encoding](https://www.ioriver.io/terms/http-chunked-encoding) |
| 19 | + |
| 20 | +``` |
| 21 | +<hex bytes of data>\r\n |
| 22 | +<data> |
| 23 | +... |
| 24 | +the end of the chunk: |
| 25 | +0\r\n |
| 26 | +\r\n |
| 27 | +``` |
| 28 | + |
| 29 | +Example: |
| 30 | +``` |
| 31 | +7\r\n |
| 32 | +Mozilla\r\n |
| 33 | +11\r\n |
| 34 | +Developer Network\r\n |
| 35 | +0\r\n |
| 36 | +\r\n |
| 37 | +``` |
| 38 | +Source of the example: [Mozilla Transfer-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding) |
| 39 | + |
| 40 | + |
| 41 | +### Chunks With Optional Extension and Trailers |
| 42 | +(combined with example) |
| 43 | +``` |
| 44 | +1fff;chunk-signature=1a2b\r\n - chunk header (optional extension) |
| 45 | +<1fff bytes of data>\r\n - chunk data |
| 46 | +2fff;chunk-signature=1a2b\r\n - chunk header (optional extension) |
| 47 | +<2fff bytes of data>\r\n - chunk data |
| 48 | +0\r\n - last chunk |
| 49 | +<trailer>\r\n - optional trailer |
| 50 | +<trailer>\r\n - optional trailer |
| 51 | +\r\n - end of content |
| 52 | +``` |
| 53 | +Notes: |
| 54 | +- `1fff` and `2fff` are examples of the size in hex |
| 55 | +- trailer example: `x-amz-checksum-crc32:uOMGCw==\r\n` (key - the algorithm `crc32`, value in base64 and `\r\n` as CR NL ending of the trailer) |
| 56 | + |
| 57 | +More info in [Wikipedia](https://en.wikipedia.org/wiki/Chunked_transfer_encoding) |
| 58 | +And also in [RFC 7230](https://www.rfc-editor.org/rfc/rfc7230#section-4.1) |
| 59 | + |
| 60 | +### Chunk Extension (chunk-signature) |
| 61 | +In HTTP there is an option for adding chunk extensions, immediately following the chunk size. |
| 62 | +``` |
| 63 | +chunk-ext = *( ";" chunk-ext-name [ "=" chunk-ext-val ] ) |
| 64 | +``` |
| 65 | +Source: [RFC 7230](https://www.rfc-editor.org/rfc/rfc7230#section-4.1.1) |
| 66 | + |
| 67 | +You can see in AWS SigV4 that when AWS uses the chunk extension as a chunk signature it uses it with the following structure: |
| 68 | +``` |
| 69 | +string(IntHexBase(chunk-size)) + ";chunk-signature=" + signature + \r\n + chunk-data + \r\n |
| 70 | +``` |
| 71 | +Source: [AWS Documentation Defining the Chunk Body](https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html) |
| 72 | + |
| 73 | +Chunk Extension are the following lines in the example encoding structure: |
| 74 | +``` |
| 75 | +2fff;chunk-signature=1a2b\r\n - chunk header (optional extension) |
| 76 | +<2fff bytes of data>\r\n - chunk data |
| 77 | +``` |
| 78 | + |
| 79 | + |
| 80 | +### Trailers |
| 81 | +In HTTP there is a option for a trailer. A trailer allows the sender to include additional fields at the end of a chunked message in order to supply metadata that might be dynamically generated while the message body is sent. |
| 82 | +Source: [RFC 7230](https://www.rfc-editor.org/rfc/rfc7230#section-4.1.2) |
| 83 | + |
| 84 | + |
| 85 | +The name of the trailing is passed as a header and the trailer (key-value) passed after the chunked body. There can be added 0 or more trailers headers in a HTTP body. |
| 86 | +Source: [Stack OverFlow](https://stackoverflow.com/questions/5590791/http-chunked-encoding-need-an-example-of-trailer-mentioned-in-spec) |
| 87 | + |
| 88 | +Amazon S3 supports chunked uploads that use `aws-chunked` content encoding for `PutObject` and `UploadPart` requests with trailing checksums. |
| 89 | + |
| 90 | +When a request has the header `x-amz-trailer` it indicates the name of the trailing header in the request. If trailing checksums exist the `x-amz-trailer` header value includes the `x-amz-checksum-` prefix and ends with the algorithm name. The following `x-amz-trailer` values are currently supported: |
| 91 | +- x-amz-checksum-crc32 |
| 92 | +- x-amz-checksum-crc32c |
| 93 | +- x-amz-checksum-crc64nvme |
| 94 | +- x-amz-checksum-sha1 |
| 95 | +- x-amz-checksum-sha256 |
| 96 | + |
| 97 | +Source: [AWS Trailing Checksum Documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#trailing-checksums) |
| 98 | + |
| 99 | +Trailers are the following lines in the example encoding structure: |
| 100 | +``` |
| 101 | +<trailer>\r\n - optional trailer |
| 102 | +<trailer>\r\n - optional trailer |
| 103 | +``` |
| 104 | + |
| 105 | +## State Machine |
| 106 | +The chunks are passing in the buffer and the buffer is parsed in a **loop** to handle multiple chunks in the same buffer, and to handle the case where the buffer ends in the middle of a chunk. |
| 107 | + |
| 108 | +The `ChunkedContentDecoder` is using a state machine: |
| 109 | +The state machine is updated according to the current state and the buffer content. |
| 110 | +The state machine is updated by the following rules: |
| 111 | +1. **STATE_READ_CHUNK_HEADER** - read the chunk header until CR and parse it. |
| 112 | +2. **STATE_WAIT_NL_HEADER** - wait for NL after the chunk header. |
| 113 | +3. **STATE_SEND_DATA** - send chunk data to the stream until chunk size bytes sent. |
| 114 | +4. **STATE_WAIT_CR_DATA** - wait for CR after the chunk data. |
| 115 | +5. **STATE_WAIT_NL_DATA** - wait for NL after the chunk data. |
| 116 | +6. **STATE_READ_TRAILER** - read optional trailer until CR and save it. |
| 117 | +7. **STATE_WAIT_NL_TRAILER** - wait for NL after non empty trailer. |
| 118 | +8. **STATE_WAIT_NL_END** - wait for NL after the last empty trailer. |
| 119 | +9. **STATE_CONTENT_END** - the stream is done. |
| 120 | +10. **STATE_ERROR** - an error occurred. |
| 121 | + |
| 122 | +The following diagram describes the changes of the state machine: |
| 123 | + |
| 124 | + |
| 125 | +#### Dry run for example: |
| 126 | +An updated AWS SDK client operate `PutObject` with body: "body for example". |
| 127 | +On the server side we get the following buffers (showing as strings for readability): |
| 128 | +1. "10\r\n" <- 10 hex in decimal is 16 (this is the length of the data "body for example") |
| 129 | +2. "body for example" <- the data (we want to save as content and pipe it) |
| 130 | +3. "\r\n0\r\n" <- CR NL of the data and "0\r\n" as completion chunk (end of data - the final object chunk) |
| 131 | +4. "x-amz-checksum-crc32:uOMGCw==\r\n" |
| 132 | +5. "\r\n" |
| 133 | + |
| 134 | +In this example there are 5 calls to the parse, the whole stream has 1 data chunk and its header, 1 trailer and no chunk-signature. |
| 135 | +Although in this example the buffer includes the chunks inside, it doesn’t have to be like that a chunk might be split into a couple of buffers. |
| 136 | + |
| 137 | +### Notice |
| 138 | +Currently, we haven’t implemented the checksum on the server side, so if the request contains `x-amz-trailer: x-amz-checksum-crc32` and the railing chunk has the header name `x-amz-checksum-sha1` (instead of `x-amz-checksum-crc32`) this request would not fail. |
| 139 | +Example source: [AWS Trailer Chunks Documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#trailing-checksums) |
| 140 | + |
| 141 | +## Policy Change in AWS |
| 142 | +In the past AWS supported data integrity check as an opt-in, and changed it to be by default. |
| 143 | +Source: [Data Integrity Protections for Amazon S3](https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html) |
| 144 | + |
| 145 | +Useful link about the posted message in AWS Clients: |
| 146 | +1. [AWS CLI](https://github.com/aws/aws-cli/issues/9214) |
| 147 | +2. [AWS SDK JS V3](https://github.com/aws/aws-sdk-js-v3/issues/6810) |
| 148 | +3. [AWS SDK GO V2](https://github.com/aws/aws-sdk-go-v2/discussions/2960) |
| 149 | + |
| 150 | +They also posted a full table where this change was implemented by SDK and version [Compatibility with AWS SDKs](https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html) |
| 151 | +It was also announced in [AWS blog](https://aws.amazon.com/blogs/aws/introducing-default-data-integrity-protections-for-new-objects-in-amazon-s3/) |
| 152 | + |
| 153 | +#### WorkAround |
| 154 | +In the past the mentioned state machine did not include states for trailers, if a request had a trailer in its body it would get to `STATE_ERROR` as the previous state machine expected the body to end with: |
| 155 | +``` |
| 156 | +0\r\n |
| 157 | +\r\n |
| 158 | +``` |
| 159 | +When using an updated AWS SDK Client directly against NooBaa before the mentioned change, please run the AWS client (CLI / SDK) with the following environment variables: |
| 160 | +``` |
| 161 | +AWS_RESPONSE_CHECKSUM_VALIDATION=WHEN_REQUIRED |
| 162 | +AWS_REQUEST_CHECKSUM_CALCULATION=WHEN_REQUIRED |
| 163 | +``` |
| 164 | +Source: [Data Integrity Protections for Amazon S3](https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html) |
| 165 | + |
| 166 | +## Code References: |
| 167 | +### Files: |
| 168 | +- `src/util/chunked_content_decoder.js` - the class `ChunkedContentDecoder` |
| 169 | +- `src/test/unit_tests/jest_tests/test_chunked_content_decoder.test.js` - unit test of the class, please run with `npx jest test_chunked_content_decoder.test.js` |
| 170 | +- `ChunkedContentDecoder_State_Machine.md` - the diagram (not as link, in case we need to modify it). |
| 171 | + |
| 172 | +### Related PRs |
| 173 | +1. https://github.com/noobaa/noobaa-core/pull/5397 which created the stream transformer `ChunkedContentDecoder` (the original state machine - was build with the states that handled the optional extension of aws-chunk) |
| 174 | +2. https://github.com/noobaa/noobaa-core/pull/8753 - mainly added the trailers to the `ChunkedContentDecoder` state machine. |
0 commit comments