Python: Fix header field values character encoding.

ac000 · ac000 · commit 5cfad9cc0bb3 · 2023-11-09T17:53:09.000Z
On GitHub, @RomainMou reported an issue whereby HTTP header field values where being incorrectly reported as non-ascii by the Python .isacii() method. For example, using the following test application def application(environ, start_response): t = environ['HTTP_ASCIITEST'] t = "'" + t + "'" + " (" + str(len(t)) + ")" if t.isascii(): t = t + " [ascii]" else: t = t + " [non-ascii]" resp = t + "\n\n" start_response("200 OK", [("Content-Type", "text/plain")]) return (bytes(resp, 'latin1')) You would see the following $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [non-ascii] '$' has an ASCII code of 0x24 (36). The initial idea was to adjust the second parameter to the PyUnicode_New() call from 255 to 127. This unfortunately had the opposite effect. $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] Good. However... $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [ascii] Not good. Let's take a closer look at this. '£' is not in basic ASCII, but is in extended ASCII with a value of 0xA3 (163). Its UTF-8 encoding is 0xC2 0xA3, hence the length of 2 bytes above. $ strace -s 256 -e sendto,recvfrom curl -H "ASCIITEST: £" http://localhost:8080/ sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92 recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Mon, 22 May 2023 12:44:11 GMT\r\nTransfer-Encoding: chunked\r\n\r\n12\r\n'\302\243' (2) [ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 160 '£' (2) [ascii] So we can see curl sent it UTF-8 encoded '\302\243\' which is C octal escaped UTF-8 for 0xC2 0xA3, and we got the same back. But it should not be marked as ASCII. When doing PyUnicode_New(size, 127) it sets the buffer as ASCII. So we need to use another function and that function would appear to be PyUnicode_DecodeCharmap() Which creates an Unicode object with the correct ascii/non-ascii properties based on the character encoding. With this function we now get $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [non-ascii] and for good measure $ curl -H "ASCIITEST: $ £" http://localhost:8080/ '$ £' (4) [non-ascii] $ curl -H "ASCIITEST: $" -H "ASCIITEST: £" http://localhost:8080/ '$, £' (5) [non-ascii] PyUnicode_DecodeCharmap() does require having the full string upfront so we need to build up the potentially comma separated header field values string before invoking this function. I did not want to touch the Python 2.7 code (which may or may not even be affected by this) so kept these changes completely isolated from that, hence a slight duplication with the for () loop. Python 2.7 was sunset on January 1st 2020[0], so this code will hopefully just disappear soon anyway. I also purposefully didn't touch other code that may well have similar issues (such as the HTTP header field names) if we ever get issue reports about them, we'll deal with them then. [0]: <https://www.python.org/doc/sunset-python-2/> Link: <https://docs.python.org/3/c-api/unicode.html> Closes: <#868> Reported-by: RomainMou <https://github.com/RomainMou> Tested-by: RomainMou <https://github.com/RomainMou> Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
diff --git a/src/python/nxt_python_wsgi.c b/src/python/nxt_python_wsgi.c
@@ -866,10 +866,35 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
     src = nxt_unit_sptr_get(&f->value);
 
 #if PY_MAJOR_VERSION == 3
-    res = PyUnicode_New(vl, 255);
+    if (nxt_slow_path(n > 1)) {
+        char  *ptr;
+
+        p = nxt_unit_malloc(NULL, vl + 1);
+        if (nxt_slow_path(p == NULL)) {
+            return NULL;
+        }
+
+        ptr = p;
+        p = nxt_cpymem(p, src, f->value_length);
+
+        for (i = 1; i < n; i++) {
+            p = nxt_cpymem(p, ", ", 2);
+
+            src = nxt_unit_sptr_get(&f[i].value);
+            p = nxt_cpymem(p, src, f[i].value_length);
+        }
+        *p = '\0';
+
+        src = ptr;
+    }
+
+    res = PyUnicode_DecodeCharmap(src, vl, NULL, NULL);
+
+    if (nxt_slow_path(n > 1)) {
+        nxt_unit_free(NULL, src);
+    }
 #else
     res = PyString_FromStringAndSize(NULL, vl);
-#endif
 
     if (nxt_slow_path(res == NULL)) {
         return NULL;
@@ -885,6 +910,7 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
         src = nxt_unit_sptr_get(&f[i].value);
         p = nxt_cpymem(p, src, f[i].value_length);
     }
+#endif
 
     return res;
 }