Skip to content

Commit 5cfad9c

Browse files
committed
Python: Fix header field values character encoding.
On GitHub, @RomainMou reported an issue whereby HTTP header field values where being incorrectly reported as non-ascii by the Python .isacii() method. For example, using the following test application def application(environ, start_response): t = environ['HTTP_ASCIITEST'] t = "'" + t + "'" + " (" + str(len(t)) + ")" if t.isascii(): t = t + " [ascii]" else: t = t + " [non-ascii]" resp = t + "\n\n" start_response("200 OK", [("Content-Type", "text/plain")]) return (bytes(resp, 'latin1')) You would see the following $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [non-ascii] '$' has an ASCII code of 0x24 (36). The initial idea was to adjust the second parameter to the PyUnicode_New() call from 255 to 127. This unfortunately had the opposite effect. $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] Good. However... $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [ascii] Not good. Let's take a closer look at this. '£' is not in basic ASCII, but is in extended ASCII with a value of 0xA3 (163). Its UTF-8 encoding is 0xC2 0xA3, hence the length of 2 bytes above. $ strace -s 256 -e sendto,recvfrom curl -H "ASCIITEST: £" http://localhost:8080/ sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92 recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Mon, 22 May 2023 12:44:11 GMT\r\nTransfer-Encoding: chunked\r\n\r\n12\r\n'\302\243' (2) [ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 160 '£' (2) [ascii] So we can see curl sent it UTF-8 encoded '\302\243\' which is C octal escaped UTF-8 for 0xC2 0xA3, and we got the same back. But it should not be marked as ASCII. When doing PyUnicode_New(size, 127) it sets the buffer as ASCII. So we need to use another function and that function would appear to be PyUnicode_DecodeCharmap() Which creates an Unicode object with the correct ascii/non-ascii properties based on the character encoding. With this function we now get $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [non-ascii] and for good measure $ curl -H "ASCIITEST: $ £" http://localhost:8080/ '$ £' (4) [non-ascii] $ curl -H "ASCIITEST: $" -H "ASCIITEST: £" http://localhost:8080/ '$, £' (5) [non-ascii] PyUnicode_DecodeCharmap() does require having the full string upfront so we need to build up the potentially comma separated header field values string before invoking this function. I did not want to touch the Python 2.7 code (which may or may not even be affected by this) so kept these changes completely isolated from that, hence a slight duplication with the for () loop. Python 2.7 was sunset on January 1st 2020[0], so this code will hopefully just disappear soon anyway. I also purposefully didn't touch other code that may well have similar issues (such as the HTTP header field names) if we ever get issue reports about them, we'll deal with them then. [0]: <https://www.python.org/doc/sunset-python-2/> Link: <https://docs.python.org/3/c-api/unicode.html> Closes: <#868> Reported-by: RomainMou <https://github.com/RomainMou> Tested-by: RomainMou <https://github.com/RomainMou> Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
1 parent dd0c53a commit 5cfad9c

File tree

1 file changed

+28
-2
lines changed

1 file changed

+28
-2
lines changed

src/python/nxt_python_wsgi.c

Lines changed: 28 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -866,10 +866,35 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
866866
src = nxt_unit_sptr_get(&f->value);
867867

868868
#if PY_MAJOR_VERSION == 3
869-
res = PyUnicode_New(vl, 255);
869+
if (nxt_slow_path(n > 1)) {
870+
char *ptr;
871+
872+
p = nxt_unit_malloc(NULL, vl + 1);
873+
if (nxt_slow_path(p == NULL)) {
874+
return NULL;
875+
}
876+
877+
ptr = p;
878+
p = nxt_cpymem(p, src, f->value_length);
879+
880+
for (i = 1; i < n; i++) {
881+
p = nxt_cpymem(p, ", ", 2);
882+
883+
src = nxt_unit_sptr_get(&f[i].value);
884+
p = nxt_cpymem(p, src, f[i].value_length);
885+
}
886+
*p = '\0';
887+
888+
src = ptr;
889+
}
890+
891+
res = PyUnicode_DecodeCharmap(src, vl, NULL, NULL);
892+
893+
if (nxt_slow_path(n > 1)) {
894+
nxt_unit_free(NULL, src);
895+
}
870896
#else
871897
res = PyString_FromStringAndSize(NULL, vl);
872-
#endif
873898

874899
if (nxt_slow_path(res == NULL)) {
875900
return NULL;
@@ -885,6 +910,7 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
885910
src = nxt_unit_sptr_get(&f[i].value);
886911
p = nxt_cpymem(p, src, f[i].value_length);
887912
}
913+
#endif
888914

889915
return res;
890916
}

0 commit comments

Comments
 (0)