Skip to content

Commit ed787e6

Browse files
authored
Use cURL-based HTTP client in diff server (#579)
This is effectively a revival of #293 (from @vbanos). The cURL HTTP client in Tornado is both faster and supports more funny business that various HTTP servers do. This might also improve some of the weird memory & SSL errors we run into in production every so often. Last time around, we had some concerns about whether `CURLOPT_MAXFILESIZE` would be good enough to restrict the size of responses and mitigate memory issues. Happily, we now have tests around this and it does. To continue using the simple HTTP client, set the `USE_SIMPLE_HTTP_CLIENT` environment variable. As long as the cURL client works well, though, we’ll probably eventually drop support for the simple one.
1 parent 555dd0d commit ed787e6

File tree

6 files changed

+157
-45
lines changed

6 files changed

+157
-45
lines changed

.env.example

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,10 @@ export ACCESS_CONTROL_ALLOW_ORIGIN_HEADER="*"
2929
# Maximum diffable body size, in bytes.
3030
export DIFFER_MAX_BODY_SIZE='10485760' # 10 MB
3131

32+
# Use Tornado's "simple" HTTP client to get diffable content. By default, the
33+
# diff server uses a cURL-based client, which is faster and more robust.
34+
# export USE_SIMPLE_HTTP_CLIENT='true'
35+
3236
# The diff server does not normally validate SSL certificates when requesting
3337
# pages to diff. If this is set to "true", diff requests will fail if upstream
3438
# https:// requests have invalid certificates.

Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@ FROM python:3.7-slim
33
LABEL maintainer="enviroDGI@gmail.com"
44

55
RUN apt-get update && apt-get install -y --no-install-recommends \
6-
git gcc g++ pkg-config libxml2-dev libxslt-dev libz-dev
6+
git gcc g++ pkg-config libxml2-dev libxslt-dev libz-dev \
7+
libssl-dev openssl libcurl4-openssl-dev
78

89
# Set the working directory to /app
910
WORKDIR /app

README.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,19 +41,20 @@ Legacy projects that may be revisited:
4141
privileges to install or use it, and it won't interfere with any other
4242
installations of Python already on your system.)
4343

44-
2. Install libxml2 and libxslt. (This package uses lxml, which requires your system to have the libxml2 and libxslt libraries.)
44+
2. Install libxml2, libxslt, and openssl. (This package uses lxml, which requires your system to have the libxml2 and libxslt libraries, and pycurl, which requires libcurl [built-in on MacOS] and openssl.)
4545

4646
On MacOS, use Homebrew:
4747

4848
```sh
4949
brew install libxml2
5050
brew install libxslt
51+
brew install openssl
5152
```
5253

5354
On Debian Linux:
5455

5556
```sh
56-
apt-get install libxml2-dev libxslt-dev
57+
apt-get install libxml2-dev libxslt-dev libssl-dev openssl libcurl4-openssl-dev
5758
```
5859

5960
On other systems, the packages might have slightly different names.
@@ -65,6 +66,14 @@ Legacy projects that may be revisited:
6566
python setup.py develop
6667
```
6768

69+
**On MacOS,** you may need additional configuration to use the Homebrew
70+
openssl. Try the following:
71+
72+
```sh
73+
PYCURL_SSL_LIBRARY=openssl LDFLAGS="-L/usr/local/opt/openssl/lib" CPPFLAGS="-I/usr/local/opt/openssl/include" pip install -r requirements.txt
74+
python setup.py develop
75+
```
76+
6877
4. Copy the script `.env.example` to `.env` and supply any local configuration
6978
info you need. (Only some of the package's functionality requires this.)
7079
Apply the configuration:

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ git+https://github.com/anastasia/htmldiffer@develop
77
git+https://github.com/danielballan/htmltreediff@customize
88
html5-parser ~=0.4.9 --no-binary lxml
99
lxml ~=4.5.2
10+
pycurl ~=7.43
1011
PyPDF2 ~=1.26.0
1112
sentry-sdk ~=0.17.2
1213
requests ~=2.24.0

web_monitoring/diff_server/server.py

Lines changed: 83 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,14 @@
88
import logging
99
import mimetypes
1010
import os
11+
import pycurl
1112
import re
1213
import cchardet
1314
import sentry_sdk
1415
import signal
1516
import sys
17+
from tornado.curl_httpclient import CurlAsyncHTTPClient, CurlError
18+
import tornado.simple_httpclient
1619
import tornado.httpclient
1720
import tornado.ioloop
1821
import tornado.web
@@ -82,11 +85,37 @@
8285
print('DIFFER_MAX_BODY_SIZE must be an integer', file=sys.stderr)
8386
sys.exit(1)
8487

85-
# TODO: we'd like to support CurlAsyncHTTPClient, but we need to figure out how
86-
# to enforce something like max_body_size with it.
87-
tornado.httpclient.AsyncHTTPClient.configure(None,
88+
89+
class LimitedCurlAsyncHTTPClient(CurlAsyncHTTPClient):
90+
"""
91+
A customized version of Tornado's CurlAsyncHTTPClient that adds support for
92+
maximum response body sizes. The API is the same as that for Tornado's
93+
SimpleAsyncHTTPClient: set ``max_body_size`` to an integer representing the
94+
maximum number of bytes in a response body.
95+
"""
96+
def initialize(self, max_clients=10, defaults=None, max_body_size=None):
97+
self.max_body_size = max_body_size
98+
defaults = defaults or {}
99+
defaults['prepare_curl_callback'] = self.prepare_curl
100+
super().initialize(max_clients=max_clients, defaults=defaults)
101+
102+
def prepare_curl(self, curl):
103+
if self.max_body_size:
104+
# NOTE: cURL's docs suggest this doesn't work if the server doesn't
105+
# send a Content-Length header, but it seems to do just fine in
106+
# tests. ¯\_(ツ)_/¯
107+
curl.setopt(pycurl.MAXFILESIZE, self.max_body_size)
108+
109+
110+
HTTP_CLIENT = LimitedCurlAsyncHTTPClient
111+
if os.getenv('USE_SIMPLE_HTTP_CLIENT'):
112+
HTTP_CLIENT = None
113+
tornado.httpclient.AsyncHTTPClient.configure(HTTP_CLIENT,
88114
max_body_size=MAX_BODY_SIZE)
89-
client = tornado.httpclient.AsyncHTTPClient()
115+
116+
117+
def get_http_client():
118+
return tornado.httpclient.AsyncHTTPClient()
90119

91120

92121
class PublicError(tornado.web.HTTPError):
@@ -329,6 +358,12 @@ async def fetch_diffable_content(self, url, expected_hash, query_params):
329358
with open(url[7:], 'rb') as f:
330359
body = f.read()
331360
response = MockResponse(url, body)
361+
# Only support HTTP(S) URLs.
362+
elif not url.startswith('http://') and not url.startswith('https://'):
363+
raise PublicError(400,
364+
f'URL must use HTTP or HTTPS protocol: "{url}"',
365+
'Invalid URL for upstream content',
366+
extra={'url': url})
332367
else:
333368
# Include request headers defined by the query param
334369
# `pass_headers=HEADER_NAMES` in the upstream request. This is
@@ -344,15 +379,19 @@ async def fetch_diffable_content(self, url, expected_hash, query_params):
344379
headers[header_key] = header_value
345380

346381
try:
382+
client = get_http_client()
347383
response = await client.fetch(url, headers=headers,
348384
validate_cert=VALIDATE_TARGET_CERTIFICATES)
349385
except ValueError as error:
350386
raise PublicError(400, str(error))
387+
# Only raised by the simple client and not by the cURL client.
351388
except OSError as error:
352389
raise PublicError(502,
353390
f'Could not fetch "{url}": {error}',
354391
'Could not fetch upstream content',
355392
extra={'url': url, 'cause': str(error)})
393+
394+
# --- SIMPLE CLIENT ERRORS ----------------------------------------
356395
except tornado.simple_httpclient.HTTPTimeoutError:
357396
raise PublicError(504,
358397
f'Timed out while fetching "{url}"',
@@ -371,6 +410,46 @@ async def fetch_diffable_content(self, url, expected_hash, query_params):
371410
'Connection closed while fetching upstream',
372411
extra={'url': url,
373412
'max_size': client.max_body_size})
413+
414+
# --- CURL CLIENT ERRORS ------------------------------------------
415+
except CurlError as error:
416+
# Documentation for cURL error codes:
417+
# https://curl.haxx.se/libcurl/c/libcurl-errors.html
418+
# PyCurl has constants named `E_*` vs. libcurl's `CURLE_*`
419+
if error.errno == pycurl.E_URL_MALFORMAT:
420+
raise PublicError(400,
421+
str(error),
422+
'Invalid URL for cURL',
423+
extra={'url': url})
424+
# TODO: raise a nicer error from LimitedCurlAsyncHTTPClient
425+
elif error.errno == pycurl.E_FILESIZE_EXCEEDED:
426+
raise PublicError(502,
427+
f'Upstream response too big for "{url}"'
428+
f'(max: {client.max_body_size} bytes)',
429+
'Upstream content too big',
430+
extra={'url': url,
431+
'max_size': client.max_body_size})
432+
elif (error.errno == pycurl.E_COULDNT_RESOLVE_PROXY
433+
or error.errno == pycurl.E_COULDNT_CONNECT
434+
or error.errno == 8 # E_WEIRD_SERVER_REPLY
435+
or error.errno == pycurl.E_REMOTE_ACCESS_DENIED
436+
or error.errno == pycurl.E_HTTP2):
437+
raise PublicError(502,
438+
f'Could not fetch "{url}": {error}',
439+
'Could not fetch upstream content',
440+
extra={'url': url, 'cause': str(error)})
441+
elif error.errno == pycurl.E_OPERATION_TIMEDOUT:
442+
raise PublicError(504,
443+
f'Timed out while fetching "{url}"',
444+
'Could not fetch upstream content',
445+
extra={'url': url})
446+
else:
447+
raise PublicError(502,
448+
f'Unknown error fetching "{url}"',
449+
f'Unknown error fetching upstream content: {error}',
450+
extra={'url': url})
451+
452+
# --- COMMON ERRORS SUPPORTED BY ALL CLIENTS ----------------------
374453
except tornado.httpclient.HTTPError as error:
375454
# If the response is actually coming from a web archive,
376455
# allow error codes. The Memento-Datetime header indicates

web_monitoring/tests/test_diffing_server_exc_handling.py

Lines changed: 56 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import asyncio
12
import json
23
import os
34
import unittest
@@ -17,6 +18,17 @@
1718
from io import BytesIO
1819

1920

21+
def patch_http_client(**kwargs):
22+
"""
23+
Create HTTP clients in the diffing server with the specified parameters
24+
during this patch. Can be a function decorator or context manager.
25+
"""
26+
def get_client():
27+
return tornado.httpclient.AsyncHTTPClient(force_instance=True,
28+
**kwargs)
29+
return patch.object(df, 'get_http_client', get_client)
30+
31+
2032
class DiffingServerTestCase(AsyncHTTPTestCase):
2133

2234
def get_app(self):
@@ -88,7 +100,7 @@ class DiffingServerFetchTest(DiffingServerTestCase):
88100

89101
def test_pass_headers(self):
90102
mock = MockAsyncHttpClient()
91-
with patch.object(df, 'client', wraps=mock):
103+
with patch.object(df, 'get_http_client', return_value=mock):
92104
mock.respond_to(r'/a$')
93105
mock.respond_to(r'/b$')
94106

@@ -175,6 +187,18 @@ def test_not_reachable_url_b(self):
175187
self.assertEqual(response.code, 502)
176188
self.assertFalse(response.headers.get('Etag'))
177189

190+
@patch_http_client(defaults=dict(request_timeout=0.5))
191+
def test_timeout_upstream(self):
192+
async def responder(handler):
193+
await asyncio.sleep(1)
194+
handler.write('HELLO!'.encode('utf-8'))
195+
196+
with SimpleHttpServer(responder) as server:
197+
response = self.fetch('/html_source_dmp?'
198+
f'a={server.url("/whatever1")}&'
199+
f'b={server.url("/whatever2")}')
200+
assert response.code == 504
201+
178202
def test_missing_params_caller_func(self):
179203
response = self.fetch('http://example.org/')
180204
with self.assertRaises(KeyError):
@@ -196,7 +220,7 @@ def test_accepts_errors_from_web_archives(self):
196220
we proceed with diffing.
197221
"""
198222
mock = MockAsyncHttpClient()
199-
with patch.object(df, 'client', wraps=mock):
223+
with patch.object(df, 'get_http_client', return_value=mock):
200224
mock.respond_to(r'/error$', code=404, headers={'Memento-Datetime': 'Tue Sep 25 2018 03:38:50'})
201225
mock.respond_to(r'/success$')
202226

@@ -307,34 +331,31 @@ def test_validates_bad_hashes(self):
307331

308332

309333
class DiffingServerResponseSizeTest(DiffingServerTestCase):
334+
@patch_http_client(max_body_size=100 * 1024)
310335
def test_succeeds_if_response_is_small_enough(self):
311336
async def responder(handler):
312337
text = (80 * 1024) * 'x'
313338
handler.write(text.encode('utf-8'))
314339

315-
client = tornado.httpclient.AsyncHTTPClient(force_instance=True,
316-
max_body_size=100 * 1024)
317-
with patch.object(df, 'client', client):
318-
with SimpleHttpServer(responder) as server:
319-
response = self.fetch('/html_source_dmp?'
320-
f'a={server.url("/whatever1")}&'
321-
f'b={server.url("/whatever2")}')
322-
assert response.code == 200
340+
with SimpleHttpServer(responder) as server:
341+
response = self.fetch('/html_source_dmp?'
342+
f'a={server.url("/whatever1")}&'
343+
f'b={server.url("/whatever2")}')
344+
assert response.code == 200
323345

346+
@patch_http_client(max_body_size=100 * 1024)
324347
def test_stops_if_response_is_too_big(self):
325348
async def responder(handler):
326349
text = (110 * 1024) * 'x'
327350
handler.write(text.encode('utf-8'))
328351

329-
client = tornado.httpclient.AsyncHTTPClient(force_instance=True,
330-
max_body_size=100 * 1024)
331-
with patch.object(df, 'client', client):
332-
with SimpleHttpServer(responder) as server:
333-
response = self.fetch('/html_source_dmp?'
334-
f'a={server.url("/whatever1")}&'
335-
f'b={server.url("/whatever2")}')
336-
assert response.code == 502
352+
with SimpleHttpServer(responder) as server:
353+
response = self.fetch('/html_source_dmp?'
354+
f'a={server.url("/whatever1")}&'
355+
f'b={server.url("/whatever2")}')
356+
assert response.code == 502
337357

358+
@patch_http_client(max_body_size=100 * 1024)
338359
def test_stops_reading_early_when_content_length_is_a_lie(self):
339360
async def responder(handler):
340361
# Tornado tries to be careful and prevent us from sending more
@@ -347,26 +368,23 @@ async def responder(handler):
347368
text = (110 * 1024) * 'x'
348369
handler.write(text.encode('utf-8'))
349370

350-
client = tornado.httpclient.AsyncHTTPClient(force_instance=True,
351-
max_body_size=100 * 1024)
352-
with patch.object(df, 'client', client):
353-
with SimpleHttpServer(responder) as server:
354-
response = self.fetch('/html_source_dmp?'
355-
f'a={server.url("/whatever1")}&'
356-
f'b={server.url("/whatever2")}')
357-
# Even though the response was longer than the max_body_size,
358-
# the client should have stopped reading when it hit the number
359-
# of bytes set in the Content-Length, header, which is less
360-
# than the client's limit. So what should actually happen is a
361-
# successful diff of the first <Content-Length> bytes of the
362-
# responses.
363-
assert response.code == 200
364-
# The responses should be the same, and should only be
365-
# <Content-Length> bytes long (all our chars are basic ASCII
366-
# in this case, so len(bytes) == len(characters)).
367-
result = json.loads(response.body)
368-
assert result['change_count'] == 0
369-
assert len(result['diff'][0][1]) == 1024
371+
with SimpleHttpServer(responder) as server:
372+
response = self.fetch('/html_source_dmp?'
373+
f'a={server.url("/whatever1")}&'
374+
f'b={server.url("/whatever2")}')
375+
# Even though the response was longer than the max_body_size,
376+
# the client should have stopped reading when it hit the number
377+
# of bytes set in the Content-Length, header, which is less
378+
# than the client's limit. So what should actually happen is a
379+
# successful diff of the first <Content-Length> bytes of the
380+
# responses.
381+
assert response.code == 200
382+
# The responses should be the same, and should only be
383+
# <Content-Length> bytes long (all our chars are basic ASCII
384+
# in this case, so len(bytes) == len(characters)).
385+
result = json.loads(response.body)
386+
assert result['change_count'] == 0
387+
assert len(result['diff'][0][1]) == 1024
370388

371389

372390
def mock_diffing_method(c_body):

0 commit comments

Comments
 (0)