|
| 1 | +--- |
| 2 | +title: "Web scraping that just works with OpenFaaS with Puppeteer" |
| 3 | +description: "Learn how to scrape webpages using Puppeteer and Serverless Functions built with OpenFaaS." |
| 4 | +date: 2020-10-28 |
| 5 | +image: /images/2020-puppeteer-scraping/puppeteer.jpg |
| 6 | +categories: |
| 7 | + - automation |
| 8 | + - scraping |
| 9 | + - nodejs |
| 10 | + - chrome |
| 11 | +author_staff_member: alex |
| 12 | +dark_background: true |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +Learn how to scrape webpages using Puppeteer and Serverless Functions built with OpenFaaS. |
| 17 | + |
| 18 | +## Introduction to web testing and scraping |
| 19 | + |
| 20 | +In this post I'll introduce you Puppeteer and show you how to use it to automate and scrape websites using OpenFaaS functions. |
| 21 | + |
| 22 | +There's two main reasons you may want to automate a web browser: |
| 23 | +* to run compliance and end-to-end tests against your application |
| 24 | +* to gather information from a webpage which doesn't have an API available |
| 25 | + |
| 26 | +When testing an application, there are numerous options and these fall into two categories: rendered webpages, running with JavaScript and a real browser, and then text-based tests which can only parse static HTML. As you may imagine, loading a full web-browser in memory is a heavy-weight task. In a previous position I worked heavily with [Selenium](https://www.selenium.dev), which has language bindings for C#, Java, Python, Ruby and other languages. Whilst our team tried to implement most of our tests in the unit-testing layer, there were instances where automated web tests added value, and mean that the QA team could be involved in the development cycle by writing User Acceptance Tests (UATs) before the developers had started coding. |
| 27 | + |
| 28 | +Selenium is still popular in the industry, and it inspired the [W3C Working Draft of a Webdriver API](https://www.w3.org/TR/webdriver/) that browsers can implement to make testing easier. |
| 29 | + |
| 30 | +The other use-case is not to test websites, but to extract information from them when an API is not available, or does not have the endpoints required. In some instances, you see a mixture of both usecases, for instance - a company may file tax documents through a web-page using automated web-browsers, when that particular jurisdiction doesn't provide an API. |
| 31 | + |
| 32 | +### Kicking the tires with AWS Lambda |
| 33 | + |
| 34 | +I learned more recently of a friend who offers a search for Trademarks through his SaaS product, and for that purpose he chose a more modern alternative to Selenium called Puppeteer. In fact if you search StackOverflow or Google for "scraping and Lambda" you will likely see "Puppeteer" mentioned along with "headless-chrome." I was curious to try out Puppeteer with AWS Lambda, and the path was less than ideal, with friction at almost every step of the way. |
| 35 | + |
| 36 | +* The popular [aws-chrome-lambda](https://github.com/alixaxel/chrome-aws-lambda) npm module is over 40MB in size because it ships a static binary binary, meaning it can't be uploaded as a regular Lambda zip file, or as a Lambda layer |
| 37 | +* The zip file needs to be uploaded through a separate AWS S3 bucket in the same region as the function |
| 38 | +* The layer can then be referenced from your function. |
| 39 | +* Local testing is very difficult, and there are many StackOverflow issues about getting the right combination of npm modules |
| 40 | + |
| 41 | +I am sure that this can be done, and is being run at scale. It could be quite compelling for small businesses if they don't spend too much time fighting the above, and can stay within the free-tier. |
| 42 | + |
| 43 | + |
| 44 | + |
| 45 | +> Getting the title of a simple webpage - 15.5s |
| 46 | +
|
| 47 | +That said, OpenFaaS can run anywhere, even on a 5-10 USD VPS and because OpenFaaS uses containers, it got me thinking. |
| 48 | + |
| 49 | +### Is there another way? |
| 50 | + |
| 51 | +So I wanted to see if the experience would be any better with OpenFaaS. So I set out to see if I could get Puppeteer working with OpenFaaS, and this isn't the first time I've been there. It's something that I've come back to from time to time. Today, things seem even easier with a pre-compiled headless Chrome browser being available from [buildkite.com](https://buildkite.com). |
| 52 | + |
| 53 | +Typical tasks involve logging into a portal and taking screenshots. Anecdotally, when I ran a simple test to navigate to a blog and take a screenshot, this took 15.5s in AWS Lambda, but only 1.6s running locally within OpenFaaS on my laptop. I was also able to build and test the function locally, the same way as in the cloud. |
| 54 | + |
| 55 | + |
| 56 | +## Walkthrough |
| 57 | + |
| 58 | +We'll now walk through the steps to set up a function with Node.js and Puppeteer, so that you can adapt an example and try out your existing tests that you may have running on AWS Lambda. |
| 59 | + |
| 60 | +### Deploy OpenFaaS |
| 61 | + |
| 62 | +What are the features we can leverage from OpenFaaS? |
| 63 | + |
| 64 | +* Extend the function's timeout to whatever we want |
| 65 | +* Run the invocation asynchronously, and in parallel |
| 66 | +* Get a HTTP callback with the result when done, such as a screenshot or test result in JSON |
| 67 | +* Limit concurrency with `max_inflight` environment variable in our `stack.yml` file to prevent overloading the container |
| 68 | +* Trigger the invocations from cron, or events like Kafka and NATS |
| 69 | +* Get rate, error and duration (RED) metrics from Prometheus, and view them in Grafana |
| 70 | + |
| 71 | +You can deploy OpenFaaS to Kubernetes or on a small VM using the faasd project. The faasd project doesn't require Kubernetes and uses the containerd project. |
| 72 | + |
| 73 | +* Docs: [Deploy OpenFaaS](https://docs.openfaas.com/deployment/) |
| 74 | + |
| 75 | +For the impatient, our arkade tool can get you up and running in less than 5 minutes |
| 76 | + |
| 77 | +```bash |
| 78 | +curl -sLS https://get-arkade.dev | sh |
| 79 | +sudo mv arkade /usr/local/bin/ |
| 80 | + |
| 81 | +arkade get kind |
| 82 | +arkade get kubectl |
| 83 | +arkade get faas-cli |
| 84 | + |
| 85 | +arkade install openfaas |
| 86 | +``` |
| 87 | + |
| 88 | +The `arkade info openfaas` command will print out everything you need to log in and get a connection to your OpenFaaS gateway UI. |
| 89 | + |
| 90 | +### Create a function with the puppeteer-node12 template |
| 91 | + |
| 92 | +```bash |
| 93 | +# Set to your Docker Hub account or registry address |
| 94 | +export OPENFAAS_PREFIX=alexellis2 |
| 95 | + |
| 96 | +faas-cli template pull https://github.com/alexellis/openfaas-puppeteer-template |
| 97 | +faas-cli new --lang puppeteer-node12 scrape-title --prefix $OPENFAAS_PREFIX |
| 98 | +``` |
| 99 | + |
| 100 | +Let's get the title of a webpage passed in via a JSON HTTP body, then return the result as JSON. |
| 101 | + |
| 102 | +Now edit `./scrape-title/handler.js` |
| 103 | + |
| 104 | +```javascript |
| 105 | +'use strict' |
| 106 | +const assert = require('assert') |
| 107 | +const puppeteer = require('puppeteer') |
| 108 | + |
| 109 | +module.exports = async (event, context) => { |
| 110 | + let browser = await puppeteer.launch({ |
| 111 | + args: [ |
| 112 | + '--no-sandbox', |
| 113 | + '--disable-setuid-sandbox', |
| 114 | + '--disable-dev-shm-usage' |
| 115 | + ] |
| 116 | + }) |
| 117 | + |
| 118 | + const browserVersion = await browser.version() |
| 119 | + |
| 120 | + let page = await browser.newPage() |
| 121 | + let uri = "https://inlets.dev/blog/" |
| 122 | + if(event.body && event.body.uri) { |
| 123 | + uri = event.body.uri |
| 124 | + } |
| 125 | + |
| 126 | + const response = await page.goto(uri) |
| 127 | + |
| 128 | + let title = await page.title() |
| 129 | + |
| 130 | + browser.close() |
| 131 | + return context |
| 132 | + .status(200) |
| 133 | + .succeed({"title": title}) |
| 134 | +} |
| 135 | +``` |
| 136 | + |
| 137 | +### Deploy and test the scrape-title function |
| 138 | + |
| 139 | +Deploy the `scrape-title` function to OpenFaaS. |
| 140 | + |
| 141 | +```bash |
| 142 | +faas-cli up -f scrape-title.yml |
| 143 | +``` |
| 144 | + |
| 145 | +You can run `faas-cli describe FUNCTION` to get a synchronous or asynchronous URL for use with `curl` along with whether the function is ready for invocations. The `faas-cli` can also be used to invoke functions and we'll do that below. |
| 146 | + |
| 147 | +```bash |
| 148 | +faas-cli describe scrape-title |
| 149 | + |
| 150 | +Name: scrape-title |
| 151 | +Status: Not Ready |
| 152 | +Replicas: 1 |
| 153 | +Available replicas: 0 |
| 154 | +Invocations: 0 |
| 155 | +Image: alexellis2/scrape-title:latest |
| 156 | +Function process: node index.js |
| 157 | +URL: http://127.0.0.1:8080/function/scrape-title |
| 158 | +Async URL: http://127.0.0.1:8080/async-function/scrape-title |
| 159 | +``` |
| 160 | + |
| 161 | +Try invoking the function synchronously: |
| 162 | + |
| 163 | +```bash |
| 164 | +echo '{"uri": "https://inlets.dev/blog"}' | faas-cli invoke scrape-title \ |
| 165 | + --header "Content-type=application/json" |
| 166 | + |
| 167 | +{"title":"Inlets PRO – Inlets – The Cloud Native Tunnel"} |
| 168 | +``` |
| 169 | + |
| 170 | +Running with `time curl` was 10 times faster than my test with AWS Lambda with 256MB RAM allocated. |
| 171 | + |
| 172 | +```bash |
| 173 | +time curl http://127.0.0.1:8080/function/scrape-title --data-binary '{"uri": "https://example.com"}' --header "Content-type: application/json" |
| 174 | +{"title":"Example Domain"} |
| 175 | +real 0m0.727s |
| 176 | +user 0m0.004s |
| 177 | +sys 0m0.004s |
| 178 | +``` |
| 179 | + |
| 180 | +Alternatively run async: |
| 181 | + |
| 182 | +```bash |
| 183 | +echo '{"uri": "https://inlets.dev/blog"}' | faas-cli invoke scrape-title \ |
| 184 | + --async \ |
| 185 | + --header "Content-type=application/json" |
| 186 | + |
| 187 | +Function submitted asynchronously. |
| 188 | +``` |
| 189 | + |
| 190 | +Run async, post the response to another service like [requestbin](https://requestbin.com) or another function: |
| 191 | + |
| 192 | +```bash |
| 193 | +echo '{"uri": "https://inlets.dev/blog"}' | faas-cli invoke scrape-title \ |
| 194 | + --async \ |
| 195 | + --header "Content-type=application/json" \ |
| 196 | + --header "X-Callback-Url=https://enthao98x79id.x.pipedream.net" |
| 197 | + |
| 198 | +Function submitted asynchronously. |
| 199 | +``` |
| 200 | + |
| 201 | + |
| 202 | + |
| 203 | +> Example of a result posted back to RequestBin |
| 204 | +
|
| 205 | +Each invocation has a unique `X-Call-Id` header, which can be used for tracing and connecting requests to [asynchronous responses](https://docs.openfaas.com/reference/async/). |
| 206 | + |
| 207 | +### Take a screenshot and return it as a PNG file |
| 208 | + |
| 209 | +One of the limitations of AWS Lambda is that it can only return a JSON response, whilst there may be good reasons for this approach, OpenFaaS allows a binary input and response for functions. |
| 210 | + |
| 211 | +Let's try taking a screenshot of the page, and capturing it to a file. |
| 212 | + |
| 213 | +```bash |
| 214 | +# Set to your Docker Hub account or registry address |
| 215 | +export OPENFAAS_PREFIX=alexellis2 |
| 216 | + |
| 217 | +faas-cli new --lang puppeteer-node12 screenshot-page --prefix $OPENFAAS_PREFIX |
| 218 | +``` |
| 219 | + |
| 220 | +Edit `./screenshot-page/handler.js` |
| 221 | + |
| 222 | +```javascript |
| 223 | +'use strict' |
| 224 | +const assert = require('assert') |
| 225 | +const puppeteer = require('puppeteer') |
| 226 | +const fs = require('fs').promises |
| 227 | + |
| 228 | +module.exports = async (event, context) => { |
| 229 | + let browser = await puppeteer.launch({ |
| 230 | + args: [ |
| 231 | + '--no-sandbox', |
| 232 | + '--disable-setuid-sandbox', |
| 233 | + '--disable-dev-shm-usage' |
| 234 | + ] |
| 235 | + }) |
| 236 | + |
| 237 | + const browserVersion = await browser.version() |
| 238 | + console.log(`Started ${browserVersion}`) |
| 239 | + let page = await browser.newPage() |
| 240 | + let uri = "https://inlets.dev/blog/" |
| 241 | + if(event.body && event.body.uri) { |
| 242 | + uri = event.body.uri |
| 243 | + } |
| 244 | + |
| 245 | + const response = await page.goto(uri) |
| 246 | + console.log("OK","for",uri,response.ok()) |
| 247 | + |
| 248 | + let title = await page.title() |
| 249 | + const result = { |
| 250 | + "title": title |
| 251 | + } |
| 252 | + await page.screenshot({ path: `/tmp/page.png` }) |
| 253 | + |
| 254 | + let data = await fs.readFile("/tmp/page.png") |
| 255 | + |
| 256 | + browser.close() |
| 257 | + return context |
| 258 | + .status(200) |
| 259 | + .headers({"Content-type": "application/octet-stream"}) |
| 260 | + .succeed(data) |
| 261 | +} |
| 262 | +``` |
| 263 | + |
| 264 | +Now deploy the function as before: |
| 265 | + |
| 266 | +```bash |
| 267 | +faas-cli up -f screenshot-page.yml |
| 268 | +``` |
| 269 | + |
| 270 | +Invoke the function, and capture the response to a file: |
| 271 | + |
| 272 | +```bash |
| 273 | +echo '{"uri": "https://inlets.dev/blog"}' | \ |
| 274 | +faas-cli invoke screenshot-page \ |
| 275 | + --header "Content-type=application/json" > screenshot.png |
| 276 | +``` |
| 277 | + |
| 278 | +Now open `screenshot.png` and check the result. |
| 279 | + |
| 280 | +### Deploy a Grafana dashboard |
| 281 | + |
| 282 | +We can observe the RED metrics from our functions using the built-in Prometheus UI, or we can deploy Grafana and access the OpenFaaS dashboard. |
| 283 | + |
| 284 | +```bash |
| 285 | +kubectl -n openfaas run \ |
| 286 | + --image=stefanprodan/faas-grafana:4.6.3 \ |
| 287 | + --port=3000 \ |
| 288 | + grafana |
| 289 | +``` |
| 290 | + |
| 291 | +```bash |
| 292 | +kubectl port-forward pod/grafana 3000:3000 -n openfaas |
| 293 | +``` |
| 294 | + |
| 295 | +Access the UI at http://127.0.0.1:3000 and login with admin/admin. |
| 296 | + |
| 297 | + |
| 298 | + |
| 299 | +See also: [OpenFaaS Metrics](https://docs.openfaas.com/architecture/metrics/) |
| 300 | + |
| 301 | +### Hardening |
| 302 | + |
| 303 | +If you'd like to limit how many browsers can open at once, you can set `max_inflight` within the function's deployment file: |
| 304 | + |
| 305 | +```yaml |
| 306 | +version: 1.0 |
| 307 | +provider: |
| 308 | + name: openfaas |
| 309 | + gateway: http://127.0.0.1:8080 |
| 310 | +functions: |
| 311 | + scrape-title: |
| 312 | + lang: puppeteer-node12 |
| 313 | + handler: ./scrape-title |
| 314 | + image: alexellis2/scrape-title:latest |
| 315 | + environment: |
| 316 | + max_inflight: 1 |
| 317 | +``` |
| 318 | +
|
| 319 | +A separate queue can also be configured in OpenFaaS for web-scraping with a set level of parallelism that you prefer. |
| 320 | +
|
| 321 | +See also: [Async docs](https://docs.openfaas.com/reference/async/#asynchronous-functions) |
| 322 | +
|
| 323 | +You can also set a hard limit on memory if you wish: |
| 324 | +
|
| 325 | +```yaml |
| 326 | +version: 1.0 |
| 327 | +provider: |
| 328 | + name: openfaas |
| 329 | + gateway: http://127.0.0.1:8080 |
| 330 | +functions: |
| 331 | + scrape-title: |
| 332 | + lang: puppeteer-node12 |
| 333 | + handler: ./scrape-title |
| 334 | + image: alexellis2/scrape-title:latest |
| 335 | + limits: |
| 336 | + memory: 256Mi |
| 337 | +``` |
| 338 | +
|
| 339 | +See also: [memory limits](https://docs.openfaas.com/reference/yaml/#function-memorycpu-limits) |
| 340 | +
|
| 341 | +### Long timeouts |
| 342 | +
|
| 343 | +Whilst a timeout value is required, this number can be as large as you like. |
| 344 | +
|
| 345 | +See also: [Featured Tutorial: Expanded timeouts in OpenFaaS](https://docs.openfaas.com/tutorials/expanded-timeouts/) |
| 346 | +
|
| 347 | +### Getting triggered |
| 348 | +
|
| 349 | +If you want to trigger the function periodically, for instance to generate a weekly or daily report, then you can use a cron syntax. |
| 350 | +
|
| 351 | +Users of NATS or Kafka can also trigger functions directly from events. |
| 352 | +
|
| 353 | +See also: [OpenFaaS triggers](https://docs.openfaas.com/reference/triggers/) |
| 354 | +
|
| 355 | +## Wrapping up |
| 356 | +
|
| 357 | +You now have the tools you need to deploy automated tests and web-scraping code using Puppeteer. Since OpenFaaS can leverage Kubernetes, you can use auto-scaling pools of nodes and much longer timeouts than are typically available with cloud-based functions products. OpenFaaS plays well with others such as NATS which powers asynchronous invocations, Prometheus to collect metrics, and Grafana to observe throughput and duration and share the status of the system with others in the team. |
| 358 | +
|
| 359 | +The pre-compiled versions of Chrome included with docker-puppeteer and aws-chrome-lambda will not run on a Raspberry Pi or ARM64 machine, however there is a possibility that they can be rebuilt. For speedy web-scraping from a Raspberry Pi or ARM64 server, you could consider other options such as [scrapy](https://scrapy.org). |
| 360 | +
|
| 361 | +Ultimately, I am going to be biased here, but I found the experience of getting Puppeteer to work with OpenFaaS much simpler than with AWS Lambda, and think you should give it a shot. |
| 362 | +
|
| 363 | +Find out more: |
| 364 | +
|
| 365 | +* [buildkite/docker-puppeteer](https://github.com/buildkite/docker-puppeteer) |
| 366 | +* [alexellis/openfaas-puppeteer-template](https://github.com/alexellis/openfaas-puppeteer-template) |
| 367 | +* [aws-chrome-lambda](https://github.com/alixaxel/chrome-aws-lambda) |
0 commit comments