A quick piece of code to run a scraper in the cloud. You can run it daily as a cron
job.
Constructs a docker container that mimics the AWS runtime envioment to build out and test your app, then uploads an AWS CloudFormationStack project to AWS, creating a function and a layer. You must create a bucket to hook it up to and also add a CloudWatch Event trigger to run it daily.
Uses the tiny "html_table_parser" module to convert an HTML table to an array of python lists, as pandas (what I'd usually use) is too large to fit within the 250 MB size restriction for the AWS layer once I add the essential binary files (selenium, my webdriver, chromium).
- Python 3.7
- Chromium 86.0.4240.0
- Chromedriver 86.0.4240.22.0
- Selenium 3.14
- Selenium
- Chromium
- pygsheets module
- google sheet JSON authorization file
- time module
- datetime module (yes both are needed)
- this HTML table parsing module
- Adapt the file
/src/lambda_function.py
for your target URI - Add your modules/bin dependencies to the requirements.txt file
make
commands:make docker-build
+make docker-run
to build and test your local docker containermake lambda-layer-build
+make lamdba-function-build
to build your ZIP files of source code and layer requirementsmake BUCKET=your_bucket_name create-stack
to upload the stack to your specified AWS bucket- to do this you will need to have AWS command line tools installed.
- change the install location in the makefile, the path is currently set as
/Users/your_username/bin/aws
- log in to AWS CL tools with
aws configure
and input your AWS access key id, secret key id, region and output format (which should be JSON in all caps) now run the create-stack command and it'll stick in the given bucket (as long as your AWS account has access to that bucket)
- AWS settings:
- EventBridge rule is
cron(0 13 * * ? *)
, which runs daily at 13 UTC, which is 8am CST/9am EST - Layer should hook up automatically but always worth checking. You can actually add additional layers but the 250 MB cap is for all layers associated with a function
- You can tailor the memory and timeout to your needs, though with pygsheets its always good to give it a little extra time to make sure it has time to talk to the google API
- EventBridge rule is
This package is licenced under the GPL v3. See the file LICENSE.
It is a fork of Vittorio Nardone's pychromeless
scraper that just makes an image of the result (link).