Skip to content

Commit 0a0ec7c

Browse files
committed
+ README.md
1 parent b9aacda commit 0a0ec7c

File tree

1 file changed

+216
-0
lines changed

1 file changed

+216
-0
lines changed

README.md

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
2+
# Docker Big Data Cluster
3+
4+
A ready to go Big Data cluster (Hadoop + Hadoop Streaming + Spark + PySpark) with Docker and Docker Swarm!
5+
6+
7+
## Why?
8+
9+
Although today you can find several repositories ready to deploy a Spark or Hadoop cluster, they all run into the same problem: they do not work when deployed on Docker Swarm due to several issues ranging from the definition of the worker nodes to connection problems with Docker network interfaces.
10+
11+
This repository seeks to solve the problem by offering a functional alternative, both a toy cluster to deploy on a single machine, as well as a real cluster that works on multiple nodes that conform a Docker Swarm cluster.
12+
13+
14+
## Features
15+
16+
This repository is inspired by and uses several scripts taken from [Rubenafo's repo][rubenafo-repo] and [Sdesilva26's repo][sdesilva26-repo], however there are several changes introduced; the API is simpler, there is more documentation about usage and some extra features:
17+
18+
- ✅ Ready to deploy in a Docker Swarm cluster: all the networking and port configuration issues have been fixed so you can scale your cluster to as many worker nodes as you need.
19+
- ⚡️ Hadoop, HDFS, Spark, Scala and PySpark ready to use: all the tools are available inside the container globally so you don't have to fight with environment variables and executable paths.
20+
- 🌟 New technology: our image offers Hadoop 3.2.2, Spark 3.1.1 and Python 3.8.5!
21+
- ⚙️ Less configuration: we have removed some settings to keep the minimum possible configuration, this way you prevent errors, unexpected behaviors and get the freedom to set parameters via environment variables and have an agile development that does not require rebuilding the Docker image.
22+
- 🐍 Python dependencies: we include the most used Python dependencies like Pandas, Numpy and Scipy to be able to work on datasets and perform mathematical operations (you can remove them if you don't need them!)
23+
24+
25+
## Running toy cluster
26+
27+
You have two ways to run a cluster on a single machine:
28+
29+
- Use `toy-cluster.sh` script...
30+
- Or `docker-compose.yml` file
31+
32+
33+
### Using toy-cluster.sh script
34+
35+
The script has the following commands:
36+
37+
- deploy: create a new Docker network, containers (a master and 3 workers) and start these last
38+
- start: start the existing containers
39+
- stop: stop the running containers
40+
- remove: remove all the created containers
41+
- info:: useful URLs
42+
43+
So, if you want to try your new cluster run `./toy-cluster.sh deploy` to create a network, containers and format namenode HDFS (note that this script will start the containers too). To stop, start again or remove just run the `stop`, `start` or `remove` respectively.
44+
45+
Use `./toy-cluster.sh info` to see the URLs to check Hadoop and Spark clusters status.
46+
47+
48+
### Using docker-compose.yml file
49+
50+
The `docker-compose.yml` file has the same structure than `toy-cluster.sh` script except for the use of volumes to preserve HDFS data.
51+
52+
- To start the cluster run: `docker-compose up -d`
53+
- To stop the cluster run: `docker-compose down`
54+
55+
**Important:** the use of `./toy-cluster.sh info` works with this! So you can get the useful cluster URLs.
56+
57+
58+
## Running a real cluster in Docker Swarm
59+
60+
Here is the important stuff, there are some minors steps to do to make it work: first of all you need a Docker Swarm cluster:
61+
62+
1. Start the cluster in your master node: `docker swarm init`.
63+
1. Generate a token for the workers to be added ([official doc][swarm-docs]): `docker swarm join-token worker`. It will print on screen a token in a command that must be executed in all the workers to be added.
64+
1. Run the command generated in the previous step in all workers node: `docker swarm join: --token <token generated> <HOST>:<PORT>`
65+
66+
You have your Docker Swarm cluster! Now you have to label all the nodes to indicate which one will be the *master* and *workers*. On master node run:
67+
68+
1. List all cluster nodes to get their ID: `docker node ls`
69+
1. Label master node as master: `docker node update --label-add role=master <MASTER NODE ID>`
70+
1. **For every** worker ID node run: `docker node update --label-add role=worker <WORKER NODE ID>`
71+
72+
Create needed network and volumes:
73+
74+
```
75+
docker network create -d overlay cluster_net_swarm
76+
docker volume create --name=hdsf_master_data_swarm
77+
docker volume create --name=hdsf_master_checkpoint_data_swarm
78+
docker volume create --name=hdsf_worker_data_swarm
79+
```
80+
81+
Now it is time to select a tag of the Docker image. The default is latest but it is not recommended to use it in production. After choose one, set it version on `docker-compose_cluster.yml` and the command below.
82+
83+
Only for the first time, you need to format the namenode information directory. **Do not execute this command when you are in production with valid data stored as you will lose all your data stored in the HDFS**:
84+
85+
`docker container run -v hdsf_master_data_swarm:/home/hadoop/data/nameNode jwaresolutions/big-data-cluster:<tag> /usr/local/hadoop/bin/hadoop namenode -format`
86+
87+
Now you are ready to deploy your production cluster!
88+
89+
`docker stack deploy -c docker-compose_cluster.yml big-data-cluster`
90+
91+
<!-- TODO: add ports -->
92+
93+
## Usage
94+
95+
Finally you can use your cluster! Like the toy cluster, you have available some useful URLs:
96+
97+
- \<MASTER IP>:8088 -> Hadoop panel
98+
- \<MASTER IP>:8080 -> Spark panel
99+
- \<MASTER IP>:9870 -> HDFS panel
100+
101+
Enter the master node:
102+
103+
`docker container exec -it <MASTER CONTAINER ID> bash`
104+
105+
106+
### HDFS
107+
108+
You can store files in the Hadoop Distributed File System:
109+
110+
```
111+
echo "test" > test.txt
112+
hdfs dfs -copyFromLocal ./test.txt /test.txt
113+
```
114+
115+
If you check in a worker node that the file is visible in the entire cluster:
116+
117+
`hdfs dfs -ls /`
118+
119+
<!-- ### TODO: add Hadoop -->
120+
121+
### Spark and PySpark
122+
123+
1. You can initiate a PySpark console: `pyspark --master spark://master-node:7077`
124+
1. Now, for example, read a file and count lines:
125+
126+
```python
127+
lines = sc.textFile('hdfs://master-node:9000/test.txt')
128+
lines_count = lines.count()
129+
print(f'Line count -> {lines_count}')
130+
```
131+
1. Or you can submit an script:
132+
1. Make the script:
133+
134+
```python
135+
from pyspark import SparkContext
136+
import random
137+
138+
NUM_SAMPLES = 1000
139+
140+
sc = SparkContext("spark://master-node:7077", "Pi Estimation")
141+
142+
143+
def inside(p):
144+
x, y = random.random(), random.random()
145+
return x*x + y*y < 1
146+
147+
count = sc.parallelize(range(0, NUM_SAMPLES)) \
148+
.filter(inside).count()
149+
print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))
150+
```
151+
152+
153+
2. Submit it: `spark-submit your-script.py`
154+
155+
156+
## Going further
157+
158+
159+
### Expand number of workers
160+
161+
Adding workers to cluster is easy:
162+
163+
1. Add a worker to your Swarm cluster as explained in [Running a real cluster in Docker Swarm](##running-a-real-cluster-in-Docker-Swarm) and label it with `role=worker`.
164+
1. Increment the number of replicas in `docker-compose_cluster.yml` for `worker` service.
165+
1. Deploy the stack again with `docker stack deploy -c docker-compose_cluster.yml big-data-cluster` (restart is no required).
166+
167+
168+
### Add files/folder inside cluster
169+
170+
In both `docker-compose.yml` (toy cluster) and `docker-compose_cluster.yml` (real cluster) there is a commented line in `volumes` section. Just uncomment it and set the the file/folder in host file and the destination inside master node in cluster! For more information read [official documentation][volumes-docs] about `volumes` setting in Docker Compose.
171+
172+
173+
### Add Python dependencies
174+
175+
1. Add the dependency to the `requirements.txt` file.
176+
1. Build the image again.
177+
178+
## Frequent problems
179+
180+
181+
### Connection refused error
182+
183+
Sometimes it throws a *Connection refused* error when run a HDFS command or try to access to DFS from Hadoop/Spark. There is [official documentation][connection-refused-docs] about this problem. The solution that works for this repository was running the commands listed in [this Stack Overflow answer][connection-refused-answer]. That is why you need to format the namenode directory the first time you are deploying the real cluster (see [Running a real cluster in Docker Swarm](##running-a-real-cluster-in-docker-swarm))
184+
185+
186+
### Port 9870 is not working
187+
188+
This problem means that Namenode is now running in master node, is associated with [Connection refused for HDFS](###connection-refused-for-hdfs) problem and has the same solution. Once Namenode is running the port should be working correctly.
189+
190+
191+
## Contributing
192+
193+
Any kind of help is welcome and appreciated! If you find a bug please submit an issue or make a PR:
194+
195+
1. Fork this repo.
196+
1. Create a branch where you will develop some changes.
197+
1. Make a PR.
198+
199+
There are some TODOs to complete:
200+
201+
- [ ] Find a way to prevent *Connection refused* error to avoid format the namenode information directory
202+
- [ ] Add examples for Hadoop
203+
- [ ] Add examples for Hadoop Streaming
204+
- [ ] Add examples for Spark Streaming
205+
206+
207+
<!--
208+
TODO: agregar explicacion de error de los puertos agregando referencias a https://docs.docker.com/engine/swarm/ingress/#bypass-the-routing-mesh
209+
-->
210+
211+
[rubenafo-repo]: https://github.com/rubenafo/docker-spark-cluster
212+
[sdesilva26-repo]: https://github.com/sdesilva26/docker-spark
213+
[swarm-docs]: https://docs.docker.com/engine/swarm/join-nodes/
214+
[volumes-docs]: https://docs.docker.com/compose/compose-file/compose-file-v3/#volumes
215+
[connection-refused-docs]: https://cwiki.apache.org/confluence/display/HADOOP2/ConnectionRefused
216+
[connection-refused-answer]: https://stackoverflow.com/a/42281292/7058363

0 commit comments

Comments
 (0)