A list of tools and whatnot under the umbrella of Data Engineering
- AWS resources:
- S3
- EC2
- RDS
- RedShift
- EMR
- Kinesis
- Athena
- Lambda
- VPC
- Glue
- Sagemaker
- AWS tools:
- Boto3 (Python)
- AWS CLI
- Cron, Airflow, Ooozie, Luigi, and/or AWS Step Functions
- Scheduling/Workflows: Airflow, Oozie, Luigi, Cron, and/or AWS Step Functions
- Spark
- Data Transformation: Pandas, Dask
- ML Pipelines: Numpy, Scikit-Learn
- Python
- Bash
- SQL
- Optional: Scala (for Spark), Java (for Spark, Kafka, or Storm)
- Fact-Dimensional Warehouses
- Slowly Changing Dimensions
- Star Schema, Snowflake Schema
- Index Tuning
- Query Tuning
- Transactional Processing: Lock and Block
- OLTP vs OLAP
- Lambda Architecture
- Kappa Architecture
- Batch
- Mini-Batch
- Streaming
- Click and Drag (e.g. Looker)
- SQL Based (e.g. Tableau, Looker, Mode, Periscope)
- SQL/Python/R based (e.g. Jupyter, Mode)
- RedShift
- BigQuery
- Snowflake
- RDBMS (e.g. AWS RDS, Google SQL)
- Docker
- CI/CD (CircleCI, TravisCI, or Jenkins)
- Pytest (or Unittest)
- Tox
- AWS CLI
- Bash (Awk, Grep, Sed)
- Click
- Argparse
- Python-Fire
A quick preview of some favorite tools and/or frameworks are:
-
Python
- Airflow
- Spark
- Dask
- Pandas
- Boto3 (AWS SDK)
- Flask
- Pyramid
- Scikit-Learn
- TensorFlow
- Apache Arrow - PyArrow
-
Jupyter Notebooks
- JupyterHub
-
DockerHub
-
Databases
- Postgres
- RedShift
- Presto
- MapD (the database)
- ElasticSearch (index/search engine)
-
Business Intelligence Tools
- MapD Database, Charting, Rendering Engine by OmniSci
- SuperSet
- ELK Stack (Kibana)
- Bokeh
- Plotly
- Dash
-
Data Visualization Projects
If you're going to be a Python developer of any kind, e.g. Data Scientist, Web Developer, and/or Data Engineer there's a few things you should set up from the beginning. Many of these will help abstract away sensitive information (e.g. passwords) from your code repos so you don't accidently commit any sensitive info to Github. They also help make your workflows much faster.
Some potential things you should consider doing first:
- Install Miniconda (latest version)
- Set considerations for you in setting up environment variables
- Customize your Bash terminal
- Aliases for bash, awk, grep, sed, git commands
- Dotfiles
- Turbo charge your Mac development environment
I recommend installing the latest stable version of Miniconda, as of this post that was Python 3.x. You can still set up conda environments for Python 2.x and we'll get to that later. If you have any problems with installing, read the documentation, do a web search query (e.g. Google, DuckDuckGo, Brave, FireFox), or search for help on YouTube. You got this.
In Python you're going to need an environment manager
and a package manager
. Conda is both. You can read more about Conda here: Conda: Myths and Misconceptions, Conda - Package Manager, Conda - Environment Manager. My personal recommendation is to use Conda to manage your environments, but NOT to manage your package installation. The main reason is because conda-forge (where Conda installs from) doesn't have all the packages Pypi does (where pip
installs from). So for package management, I recommend using pip.
- Add these to your .bash_profile, your .aws config
- e.g. DB Credentials, AWS Credentials
e.g. update your .bash_profile
export DB_USER = "my_username"
export DB_PASSWORD = "my_password"
Then you can access them in your Python scripts like so.
import os
db_user = os.environ.get("DB_USER")
db_pass = os.environ.get("DB_PASS")
Some other environment config variables examples should be stored accordingly in one of the following
.bashrc
- AWS -
.aws/credentials
- SSJ -
.ssh
- here you will have yourid_rsa
secret and public ssh key files
So what's the difference between your .bash_profile
and your .bashrc
file? Well StackExchange has a great answer for that here, but simply put once you open up a terminal your .bash_profile
is loaded immediately and you have access to all of those environment variables immediately. On a Mac and (I think) most Linux distros (I haven't tried them all) you start with some CLI tool, e.g. bash. So that is your "login" shell. When you type another interactive shell at that command line, e.g. bash
, python
, python3
, or ksh
you will now load the .bashrc
file. This might be something you want say when you use Python at the CLI.
If you want to keep some environment variables separate, you can always put a statement into your .bash_profile
to load variables from another area on login
.
For example, put this in your .bash_profile
and it will load all the environment variables from your .bashrc
file. You can do this with any other file, e.g. .aws/credentials
if [ -f ~/.bashrc ]; then
source ~/.bashrc
fi
You may use whatever version/source control you like. There's two main flavors, subversion and git. As of this writing, Git has 3 main hosting services: Github, Bitbucket, and Gitlab.
Then you're going to need to generate an SSH key so you can SSH from your host (e.g. your laptop or an EC2) to your git responsitory (e.g. Github).
There's some instructiosn on how to this here: Generating a new SSH key and adding it to the ssh-agent
You have to do some magic to make it work. There's a few suggestions here: SSHing into Multiple Github Accounts
For example your `~/.ssh folder may have multiple keys:
~/.ssh/id_rsa
~/.ssh/id_rsa_home
~/.ssh/id_rsa_work
~/.ssh/id_rsa_aws
You will have to add all these keys to your SSH agent, e.g.
ssh-add ~/.ssh/id_rsa
ssh-add ~/.ssh/id_rsa_home
ssh-add ~/.ssh/id_rsa_work
ssh-add ~/.ssh/id_rsa_aws
Now to use each for a specific task you'll have to use the -i
(identity) tag for the correct pairing.
e.g. ssh -i ~/.ssh/id_rsa_aws ubuntu@aws-sdf-adfs-s112312.com
If you need to add SSH keys to an EC2, you can find instructions for that here:
------- Not yet organized
brew upgrade brew
## upgrades to all current packages
brew upgrade
## installs upgrades for any previously installed packages
- Getting Started with DotFiles
- Donne Martin's DotFile
- Awesome DotFiles
- Mathias Bynens DotFiles
- https://dotfiles.github.io/
- https://blog.flowblok.id.au/2013-02/shell-startup-scripts.html
- https://stackoverflow.com/questions/67699/how-to-clone-all-remote-branches-in-git?rq=1
- https://stackoverflow.com/questions/2003505/how-do-i-delete-a-git-branch-both-locally-and-remotely?rq=1
- https://stackoverflow.com/questions/61212/how-to-remove-local-untracked-files-from-the-current-git-working-tree?rq=1
- https://stackoverflow.com/questions/1628088/reset-local-repository-branch-to-be-just-like-remote-repository-head?noredirect=1&lq=1
- https://stackoverflow.com/questions/32056324/there-is-no-tracking-information-for-the-current-branch