This project provides a local development environment for AWS Glue using VS Code Dev Containers.
- Docker Desktop
- Visual Studio Code
- VS Code Remote - Containers extension
- AWS CLI
Follow the guide at Install Docker Engine on Ubuntu
# Create docker group if it doesn't exist
sudo groupadd docker
# Add current user to docker group
sudo usermod -aG docker $USER
# NOTE: Log out and log back in for group changes to take effect
# After logging back in, verify Docker works without sudo:
docker run hello-world
For more info see Visual Studio Code on Linux
VS Code is officially distributed as a Snap package in the Snap Store.
You can install it by running:
sudo snap install --classic code # or code-insiders
Once installed, the Snap daemon takes care of automatically updating VS Code in the background. You get an in-product update notification whenever a new update is available.
- Open VS Code
- Press
Ctrl+Shift+X
to open Extensions - Search and install:
- Remote - Containers
- Python
- Jupyter
To install for this Installing or updating to the latest version of the AWS CLI
# Configure AWS credentials
aws configure
# Enter your:
# - AWS Access Key ID
# - AWS Secret Access Key
# - Default region
# - Default output format
Ensure your system meets these minimum requirements:
- 8GB RAM (16GB recommended)
- 20GB free disk space
- 4 CPU cores (8 recommended)
To check system resources:
# Check RAM
free -h
# Check disk space
df -h
# Check CPU cores
nproc
After completing these steps, proceed to the "Getting Started" section.
- Clone this repository
- Open the repository in VS Code
- When prompted, click "Reopen in Container" or run the "Remote-Containers: Reopen in Container" command
- Wait for the container to build and start (this may take a few minutes)
- AWS Glue version 4.0 environment
- Apache Spark 3.3.0-amzn-1
- JupyterLab interface
- Python development tools
- Pre-configured VS Code settings
- Spark UI:
4040
- Spark History Server:
18080
- Livy:
8998
- JupyterLab:
8889
- Use the integrated terminal to run Glue jobs
- Debug Python scripts with VS Code's debugging tools
- Access AWS services using your local credentials
- Configure AWS credentials in your local
~/.aws/credentials
- The container will automatically mount these credentials
Check test.ipynb
for an example of:
- Setting up a Spark session
- Creating a DynamicFrame
- Basic data transformation
The project includes pytest examples for testing PySpark code. To run the tests:
pytest
The tests/test_spark_transformations.py
file demonstrates:
- Setting up PySpark test fixtures
- Testing DataFrame transformations
- Testing DataFrame filters and aggregations
- Best practices for unit testing with PySpark
Example test:
def test_filter_by_department(spark, sample_data):
filtered_df = sample_data.filter(sample_data.department == "Engineering")
assert filtered_df.count() == 2
This project uses pre-commit hooks to ensure code quality and consistency. A pre-commit hook is a script that runs automatically before a git commit is completed. It's part of Git's hooks system and helps maintain code quality by running checks before code changes are committed.
The pre-commit configuration consists of two main files:
.pre-commit-config.yaml
: Defines the hooks and their sourcespyproject.toml
: Contains Python-specific tool configurations
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0 # Version of pre-commit-hooks to use
hooks:
- id: trailing-whitespace # Removes trailing whitespace
- id: end-of-file-fixer # Ensures files end with newline
# ...other hooks...
- repo: https://github.com/psf/black
rev: 23.9.1 # Version of Black to use
hooks:
- id: black # Python code formatter
[tool.isort]
profile = "black" # Makes isort compatible with Black
skip_glob = ["migrations/*"] # Directories to skip
[tool.codespell]
skip = ["*.json","*.yaml"] # Files to skip
builtin = "clear,rare,informal,usage,code,names" # Dictionary types to use
ignore-words-list = "dne,iam,IAM,jupyter,master,thead" # Words to ignore
- trailing-whitespace: Removes trailing whitespace
- end-of-file-fixer: Ensures files end with a newline
- black: Python code formatter
- isort: Sorts Python imports
- pydocstyle: Checks Python docstring formatting
- detect-private-key: Prevents committing private keys
- check-added-large-files: Prevents large files from being committed
- check-yaml: Validates YAML syntax
- yamllint: Lints YAML files
- yamlfmt: Formats YAML files
- codespell: Checks for common misspellings
- Install pre-commit:
pip install pre-commit
- Install the pre-commit hooks:
pre-commit install
- Hooks run automatically on
git commit
- Run manually on all files:
pre-commit run --all-files
- Run a specific hook:
pre-commit run <hook-id>
- If JupyterLab doesn't start, check the logs in
/tmp/postStart.out
- Ensure Docker has sufficient resources allocated
- Verify AWS credentials are properly configured
This environment includes two Python versions:
Available packages:
Package Version
------------------------- --------------
aiobotocore 2.4.1
aiohappyeyeballs 2.4.3
aiohttp 3.8.3
aioitertools 0.11.0
aiosignal 1.3.1
anyio 4.6.0
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
asttokens 2.4.1
async-lru 2.0.4
async-timeout 4.0.2
asynctest 0.13.0
attrs 22.2.0
autovizwidget 0.21.0
avro-python3 1.10.2
babel 2.16.0
beautifulsoup4 4.12.3
bleach 6.1.0
boto 2.49.0
boto3 1.24.70
botocore 1.27.59
certifi 2021.5.30
cffi 1.17.1
chardet 3.0.4
charset-normalizer 2.1.1
click 8.1.3
comm 0.2.2
cryptography 43.0.1
cycler 0.10.0
Cython 0.29.32
debugpy 1.8.6
decorator 5.1.1
defusedxml 0.7.1
docutils 0.17.1
enum34 1.1.10
exceptiongroup 1.2.2
executing 2.1.0
fastjsonschema 2.20.0
fqdn 1.5.1
frozenlist 1.3.3
fsspec 2021.8.1
gssapi 1.9.0
h11 0.14.0
hdijupyterutils 0.21.0
httpcore 1.0.6
httpx 0.27.2
idna 2.10
importlib-metadata 5.0.0
iniconfig 2.0.0
ipykernel 6.29.5
ipython 8.28.0
ipywidgets 8.1.5
isoduration 20.11.0
jedi 0.19.1
Jinja2 3.1.4
jmespath 0.10.0
joblib 1.0.1
json5 0.9.25
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2023.12.1
jupyter 1.1.1
jupyter_client 8.6.3
jupyter-console 6.6.3
jupyter_core 5.7.2
jupyter-events 0.10.0
jupyter-lsp 2.2.5
jupyter_server 2.14.2
jupyter_server_terminals 0.5.3
jupyterlab 4.2.5
jupyterlab_pygments 0.3.0
jupyterlab_server 2.27.3
jupyterlab_widgets 3.0.13
kaleido 0.2.1
kiwisolver 1.4.4
krb5 0.7.0
MarkupSafe 3.0.0
matplotlib 3.4.3
matplotlib-inline 0.1.7
mistune 3.0.2
mpmath 1.2.1
multidict 6.0.4
nbclient 0.10.0
nbconvert 7.16.4
nbformat 5.10.4
nest-asyncio 1.6.0
nltk 3.7
notebook 7.2.2
notebook_shim 0.2.4
numpy 1.23.5
overrides 7.7.0
packaging 23.0
pandas 1.5.1
pandocfilters 1.5.1
parso 0.8.4
patsy 0.5.1
pexpect 4.9.0
Pillow 9.4.0
pip 23.0.1
platformdirs 4.3.6
plotly 5.16.0
pluggy 1.5.0
pmdarima 2.0.1
prometheus_client 0.21.0
prompt_toolkit 3.0.48
psutil 6.0.0
ptvsd 4.3.2
ptyprocess 0.7.0
pure_eval 0.2.3
pyarrow 10.0.0
pycparser 2.22
pydevd 2.5.0
Pygments 2.18.0
pyhocon 0.3.58
PyMySQL 1.0.2
pyparsing 2.4.7
pyspnego 0.11.1
pytest 8.3.3
python-dateutil 2.8.2
python-json-logger 2.0.7
pytz 2021.1
PyYAML 6.0.1
pyzmq 26.2.0
referencing 0.35.1
regex 2022.10.31
requests 2.23.0
requests-kerberos 0.15.0
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rpds-py 0.20.0
s3fs 2022.11.0
s3transfer 0.6.0
scikit-learn 1.1.3
scipy 1.9.3
seaborn 0.12.2
Send2Trash 1.8.3
setuptools 49.1.3
six 1.16.0
sniffio 1.3.1
soupsieve 2.6
sparkmagic 0.21.0
stack-data 0.6.3
statsmodels 0.13.5
subprocess32 3.5.4
sympy 1.8
tbats 1.1.0
tenacity 9.0.0
terminado 0.18.1
threadpoolctl 3.1.0
tinycss2 1.3.0
tomli 2.0.2
tornado 6.4.1
tqdm 4.64.1
traitlets 5.14.3
types-python-dateutil 2.9.0.20241003
typing_extensions 4.12.2
uri-template 1.3.0
urllib3 1.25.11
wcwidth 0.2.13
webcolors 24.8.0
webencodings 0.5.1
websocket-client 1.8.0
wheel 0.37.0
widgetsnbextension 4.0.13
wrapt 1.14.1
yarl 1.8.2
zipp 3.10.0
Available packages:
Package Version
------------------ -----------
backcall 0.2.0
debugpy 1.7.0
decorator 5.1.1
entrypoints 0.4
exceptiongroup 1.2.2
importlib-metadata 6.7.0
iniconfig 2.0.0
ipykernel 6.16.2
ipython 7.34.0
jedi 0.19.2
jupyter-client 7.4.9
jupyter-core 4.12.0
matplotlib-inline 0.1.6
nest-asyncio 1.6.0
numpy 1.21.6
packaging 24.0
pandas 1.3.5
parso 0.8.4
pexpect 4.9.0
pickleshare 0.7.5
pip 20.2.2
pluggy 1.2.0
prompt-toolkit 3.0.48
psutil 6.1.1
ptyprocess 0.7.0
pygments 2.17.2
pytest 7.4.4
python-dateutil 2.9.0.post0
pytz 2024.2
pyzmq 26.2.0
setuptools 49.1.3
six 1.17.0
tomli 2.0.1
tornado 6.2
traitlets 5.9.0
typing-extensions 4.7.1
wcwidth 0.2.13
zipp 3.15.0
Here are some helpful related projects for AWS Glue development:
- Pytest for AWS Glue - Examples of pytest implementation for AWS Glue jobs
- glue-libs-devcontainer - Docker development container for AWS Glue
- glue_libs_sso - AWS Glue libraries with SSO support