Skip to content

Commit ae11b22

Browse files
committed
Add post on web-scraping with Puppeteer
Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
1 parent 4fe261c commit ae11b22

File tree

5 files changed

+367
-0
lines changed

5 files changed

+367
-0
lines changed
Lines changed: 367 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,367 @@
1+
---
2+
title: "Web scraping that just works with OpenFaaS with Puppeteer"
3+
description: "Learn how to scrape webpages using Puppeteer and Serverless Functions built with OpenFaaS."
4+
date: 2020-10-28
5+
image: /images/2020-puppeteer-scraping/puppeteer.jpg
6+
categories:
7+
- automation
8+
- scraping
9+
- nodejs
10+
- chrome
11+
author_staff_member: alex
12+
dark_background: true
13+
14+
---
15+
16+
Learn how to scrape webpages using Puppeteer and Serverless Functions built with OpenFaaS.
17+
18+
## Introduction to web testing and scraping
19+
20+
In this post I'll introduce you Puppeteer and show you how to use it to automate and scrape websites using OpenFaaS functions.
21+
22+
There's two main reasons you may want to automate a web browser:
23+
* to run compliance and end-to-end tests against your application
24+
* to gather information from a webpage which doesn't have an API available
25+
26+
When testing an application, there are numerous options and these fall into two categories: rendered webpages, running with JavaScript and a real browser, and then text-based tests which can only parse static HTML. As you may imagine, loading a full web-browser in memory is a heavy-weight task. In a previous position I worked heavily with [Selenium](https://www.selenium.dev), which has language bindings for C#, Java, Python, Ruby and other languages. Whilst our team tried to implement most of our tests in the unit-testing layer, there were instances where automated web tests added value, and mean that the QA team could be involved in the development cycle by writing User Acceptance Tests (UATs) before the developers had started coding.
27+
28+
Selenium is still popular in the industry, and it inspired the [W3C Working Draft of a Webdriver API](https://www.w3.org/TR/webdriver/) that browsers can implement to make testing easier.
29+
30+
The other use-case is not to test websites, but to extract information from them when an API is not available, or does not have the endpoints required. In some instances, you see a mixture of both usecases, for instance - a company may file tax documents through a web-page using automated web-browsers, when that particular jurisdiction doesn't provide an API.
31+
32+
### Kicking the tires with AWS Lambda
33+
34+
I learned more recently of a friend who offers a search for Trademarks through his SaaS product, and for that purpose he chose a more modern alternative to Selenium called Puppeteer. In fact if you search StackOverflow or Google for "scraping and Lambda" you will likely see "Puppeteer" mentioned along with "headless-chrome." I was curious to try out Puppeteer with AWS Lambda, and the path was less than ideal, with friction at almost every step of the way.
35+
36+
* The popular [aws-chrome-lambda](https://github.com/alixaxel/chrome-aws-lambda) npm module is over 40MB in size because it ships a static binary binary, meaning it can't be uploaded as a regular Lambda zip file, or as a Lambda layer
37+
* The zip file needs to be uploaded through a separate AWS S3 bucket in the same region as the function
38+
* The layer can then be referenced from your function.
39+
* Local testing is very difficult, and there are many StackOverflow issues about getting the right combination of npm modules
40+
41+
I am sure that this can be done, and is being run at scale. It could be quite compelling for small businesses if they don't spend too much time fighting the above, and can stay within the free-tier.
42+
43+
![AWS Lambda screenshot](/images/2020-puppeteer-scraping/lambda.png)
44+
45+
> Getting the title of a simple webpage - 15.5s
46+
47+
That said, OpenFaaS can run anywhere, even on a 5-10 USD VPS and because OpenFaaS uses containers, it got me thinking.
48+
49+
### Is there another way?
50+
51+
So I wanted to see if the experience would be any better with OpenFaaS. So I set out to see if I could get Puppeteer working with OpenFaaS, and this isn't the first time I've been there. It's something that I've come back to from time to time. Today, things seem even easier with a pre-compiled headless Chrome browser being available from [buildkite.com](https://buildkite.com).
52+
53+
Typical tasks involve logging into a portal and taking screenshots. Anecdotally, when I ran a simple test to navigate to a blog and take a screenshot, this took 15.5s in AWS Lambda, but only 1.6s running locally within OpenFaaS on my laptop. I was also able to build and test the function locally, the same way as in the cloud.
54+
55+
56+
## Walkthrough
57+
58+
We'll now walk through the steps to set up a function with Node.js and Puppeteer, so that you can adapt an example and try out your existing tests that you may have running on AWS Lambda.
59+
60+
### Deploy OpenFaaS
61+
62+
What are the features we can leverage from OpenFaaS?
63+
64+
* Extend the function's timeout to whatever we want
65+
* Run the invocation asynchronously, and in parallel
66+
* Get a HTTP callback with the result when done, such as a screenshot or test result in JSON
67+
* Limit concurrency with `max_inflight` environment variable in our `stack.yml` file to prevent overloading the container
68+
* Trigger the invocations from cron, or events like Kafka and NATS
69+
* Get rate, error and duration (RED) metrics from Prometheus, and view them in Grafana
70+
71+
You can deploy OpenFaaS to Kubernetes or on a small VM using the faasd project. The faasd project doesn't require Kubernetes and uses the containerd project.
72+
73+
* Docs: [Deploy OpenFaaS](https://docs.openfaas.com/deployment/)
74+
75+
For the impatient, our arkade tool can get you up and running in less than 5 minutes
76+
77+
```bash
78+
curl -sLS https://get-arkade.dev | sh
79+
sudo mv arkade /usr/local/bin/
80+
81+
arkade get kind
82+
arkade get kubectl
83+
arkade get faas-cli
84+
85+
arkade install openfaas
86+
```
87+
88+
The `arkade info openfaas` command will print out everything you need to log in and get a connection to your OpenFaaS gateway UI.
89+
90+
### Create a function with the puppeteer-node12 template
91+
92+
```bash
93+
# Set to your Docker Hub account or registry address
94+
export OPENFAAS_PREFIX=alexellis2
95+
96+
faas-cli template pull https://github.com/alexellis/openfaas-puppeteer-template
97+
faas-cli new --lang puppeteer-node12 scrape-title --prefix $OPENFAAS_PREFIX
98+
```
99+
100+
Let's get the title of a webpage passed in via a JSON HTTP body, then return the result as JSON.
101+
102+
Now edit `./scrape-title/handler.js`
103+
104+
```javascript
105+
'use strict'
106+
const assert = require('assert')
107+
const puppeteer = require('puppeteer')
108+
109+
module.exports = async (event, context) => {
110+
let browser = await puppeteer.launch({
111+
args: [
112+
'--no-sandbox',
113+
'--disable-setuid-sandbox',
114+
'--disable-dev-shm-usage'
115+
]
116+
})
117+
118+
const browserVersion = await browser.version()
119+
120+
let page = await browser.newPage()
121+
let uri = "https://inlets.dev/blog/"
122+
if(event.body && event.body.uri) {
123+
uri = event.body.uri
124+
}
125+
126+
const response = await page.goto(uri)
127+
128+
let title = await page.title()
129+
130+
browser.close()
131+
return context
132+
.status(200)
133+
.succeed({"title": title})
134+
}
135+
```
136+
137+
### Deploy and test the scrape-title function
138+
139+
Deploy the `scrape-title` function to OpenFaaS.
140+
141+
```bash
142+
faas-cli up -f scrape-title.yml
143+
```
144+
145+
You can run `faas-cli describe FUNCTION` to get a synchronous or asynchronous URL for use with `curl` along with whether the function is ready for invocations. The `faas-cli` can also be used to invoke functions and we'll do that below.
146+
147+
```bash
148+
faas-cli describe scrape-title
149+
150+
Name: scrape-title
151+
Status: Not Ready
152+
Replicas: 1
153+
Available replicas: 0
154+
Invocations: 0
155+
Image: alexellis2/scrape-title:latest
156+
Function process: node index.js
157+
URL: http://127.0.0.1:8080/function/scrape-title
158+
Async URL: http://127.0.0.1:8080/async-function/scrape-title
159+
```
160+
161+
Try invoking the function synchronously:
162+
163+
```bash
164+
echo '{"uri": "https://inlets.dev/blog"}' | faas-cli invoke scrape-title \
165+
--header "Content-type=application/json"
166+
167+
{"title":"Inlets PRO – Inlets – The Cloud Native Tunnel"}
168+
```
169+
170+
Running with `time curl` was 10 times faster than my test with AWS Lambda with 256MB RAM allocated.
171+
172+
```bash
173+
time curl http://127.0.0.1:8080/function/scrape-title --data-binary '{"uri": "https://example.com"}' --header "Content-type: application/json"
174+
{"title":"Example Domain"}
175+
real 0m0.727s
176+
user 0m0.004s
177+
sys 0m0.004s
178+
```
179+
180+
Alternatively run async:
181+
182+
```bash
183+
echo '{"uri": "https://inlets.dev/blog"}' | faas-cli invoke scrape-title \
184+
--async \
185+
--header "Content-type=application/json"
186+
187+
Function submitted asynchronously.
188+
```
189+
190+
Run async, post the response to another service like [requestbin](https://requestbin.com) or another function:
191+
192+
```bash
193+
echo '{"uri": "https://inlets.dev/blog"}' | faas-cli invoke scrape-title \
194+
--async \
195+
--header "Content-type=application/json" \
196+
--header "X-Callback-Url=https://enthao98x79id.x.pipedream.net"
197+
198+
Function submitted asynchronously.
199+
```
200+
201+
![RequestBin example](/images/2020-puppeteer-scraping/callback.png)
202+
203+
> Example of a result posted back to RequestBin
204+
205+
Each invocation has a unique `X-Call-Id` header, which can be used for tracing and connecting requests to [asynchronous responses](https://docs.openfaas.com/reference/async/).
206+
207+
### Take a screenshot and return it as a PNG file
208+
209+
One of the limitations of AWS Lambda is that it can only return a JSON response, whilst there may be good reasons for this approach, OpenFaaS allows a binary input and response for functions.
210+
211+
Let's try taking a screenshot of the page, and capturing it to a file.
212+
213+
```bash
214+
# Set to your Docker Hub account or registry address
215+
export OPENFAAS_PREFIX=alexellis2
216+
217+
faas-cli new --lang puppeteer-node12 screenshot-page --prefix $OPENFAAS_PREFIX
218+
```
219+
220+
Edit `./screenshot-page/handler.js`
221+
222+
```javascript
223+
'use strict'
224+
const assert = require('assert')
225+
const puppeteer = require('puppeteer')
226+
const fs = require('fs').promises
227+
228+
module.exports = async (event, context) => {
229+
let browser = await puppeteer.launch({
230+
args: [
231+
'--no-sandbox',
232+
'--disable-setuid-sandbox',
233+
'--disable-dev-shm-usage'
234+
]
235+
})
236+
237+
const browserVersion = await browser.version()
238+
console.log(`Started ${browserVersion}`)
239+
let page = await browser.newPage()
240+
let uri = "https://inlets.dev/blog/"
241+
if(event.body && event.body.uri) {
242+
uri = event.body.uri
243+
}
244+
245+
const response = await page.goto(uri)
246+
console.log("OK","for",uri,response.ok())
247+
248+
let title = await page.title()
249+
const result = {
250+
"title": title
251+
}
252+
await page.screenshot({ path: `/tmp/page.png` })
253+
254+
let data = await fs.readFile("/tmp/page.png")
255+
256+
browser.close()
257+
return context
258+
.status(200)
259+
.headers({"Content-type": "application/octet-stream"})
260+
.succeed(data)
261+
}
262+
```
263+
264+
Now deploy the function as before:
265+
266+
```bash
267+
faas-cli up -f screenshot-page.yml
268+
```
269+
270+
Invoke the function, and capture the response to a file:
271+
272+
```bash
273+
echo '{"uri": "https://inlets.dev/blog"}' | \
274+
faas-cli invoke screenshot-page \
275+
--header "Content-type=application/json" > screenshot.png
276+
```
277+
278+
Now open `screenshot.png` and check the result.
279+
280+
### Deploy a Grafana dashboard
281+
282+
We can observe the RED metrics from our functions using the built-in Prometheus UI, or we can deploy Grafana and access the OpenFaaS dashboard.
283+
284+
```bash
285+
kubectl -n openfaas run \
286+
--image=stefanprodan/faas-grafana:4.6.3 \
287+
--port=3000 \
288+
grafana
289+
```
290+
291+
```bash
292+
kubectl port-forward pod/grafana 3000:3000 -n openfaas
293+
```
294+
295+
Access the UI at http://127.0.0.1:3000 and login with admin/admin.
296+
297+
![Grafana dashboard and metrics](/images/2020-puppeteer-scraping/grafana.png)
298+
299+
See also: [OpenFaaS Metrics](https://docs.openfaas.com/architecture/metrics/)
300+
301+
### Hardening
302+
303+
If you'd like to limit how many browsers can open at once, you can set `max_inflight` within the function's deployment file:
304+
305+
```yaml
306+
version: 1.0
307+
provider:
308+
name: openfaas
309+
gateway: http://127.0.0.1:8080
310+
functions:
311+
scrape-title:
312+
lang: puppeteer-node12
313+
handler: ./scrape-title
314+
image: alexellis2/scrape-title:latest
315+
environment:
316+
max_inflight: 1
317+
```
318+
319+
A separate queue can also be configured in OpenFaaS for web-scraping with a set level of parallelism that you prefer.
320+
321+
See also: [Async docs](https://docs.openfaas.com/reference/async/#asynchronous-functions)
322+
323+
You can also set a hard limit on memory if you wish:
324+
325+
```yaml
326+
version: 1.0
327+
provider:
328+
name: openfaas
329+
gateway: http://127.0.0.1:8080
330+
functions:
331+
scrape-title:
332+
lang: puppeteer-node12
333+
handler: ./scrape-title
334+
image: alexellis2/scrape-title:latest
335+
limits:
336+
memory: 256Mi
337+
```
338+
339+
See also: [memory limits](https://docs.openfaas.com/reference/yaml/#function-memorycpu-limits)
340+
341+
### Long timeouts
342+
343+
Whilst a timeout value is required, this number can be as large as you like.
344+
345+
See also: [Featured Tutorial: Expanded timeouts in OpenFaaS](https://docs.openfaas.com/tutorials/expanded-timeouts/)
346+
347+
### Getting triggered
348+
349+
If you want to trigger the function periodically, for instance to generate a weekly or daily report, then you can use a cron syntax.
350+
351+
Users of NATS or Kafka can also trigger functions directly from events.
352+
353+
See also: [OpenFaaS triggers](https://docs.openfaas.com/reference/triggers/)
354+
355+
## Wrapping up
356+
357+
You now have the tools you need to deploy automated tests and web-scraping code using Puppeteer. Since OpenFaaS can leverage Kubernetes, you can use auto-scaling pools of nodes and much longer timeouts than are typically available with cloud-based functions products. OpenFaaS plays well with others such as NATS which powers asynchronous invocations, Prometheus to collect metrics, and Grafana to observe throughput and duration and share the status of the system with others in the team.
358+
359+
The pre-compiled versions of Chrome included with docker-puppeteer and aws-chrome-lambda will not run on a Raspberry Pi or ARM64 machine, however there is a possibility that they can be rebuilt. For speedy web-scraping from a Raspberry Pi or ARM64 server, you could consider other options such as [scrapy](https://scrapy.org).
360+
361+
Ultimately, I am going to be biased here, but I found the experience of getting Puppeteer to work with OpenFaaS much simpler than with AWS Lambda, and think you should give it a shot.
362+
363+
Find out more:
364+
365+
* [buildkite/docker-puppeteer](https://github.com/buildkite/docker-puppeteer)
366+
* [alexellis/openfaas-puppeteer-template](https://github.com/alexellis/openfaas-puppeteer-template)
367+
* [aws-chrome-lambda](https://github.com/alixaxel/chrome-aws-lambda)
302 KB
Loading
206 KB
Loading
246 KB
Loading
129 KB
Loading

0 commit comments

Comments
 (0)