Tweet Crawler is a command-line utility that automatically collects tweet data from the Twitter Streaming API. Using this program, you can collect data about specific user-defined topics on Twitter (e.g. "programming", "qatar", "world cup", etc.), tweets from specific locations (e.g. San Francisco, New York, etc.), or just listen to a sample of all the tweets in a given language (e.g. English, Arabic, etc.).
In addition to displaying the collected tweets on the command-line, this utility also stores the JSON for these tweets within a database (MongoDB).
Installing and running Tweet Crawler on your machine should be a pretty straightforward process. In order to do so, you first need to install suitable versions of Docker and Docker Compose.
Making use of Docker (and Docker Compose) greatly simplifies the Tweet Crawler setup process and will make it easier for you to run this program on the platform of your choice.
If you already have up-to-date versions of these programs, feel free to skip straight to the Install and Run section.
We've included links below to installation instructions for Docker Community Edition on three popular platforms: Windows, Mac, and Ubuntu (Linux).
After following these instructions, you will be able to directly run the commands in the Install and Run section.
- Visit the Install Docker for Windows page
- Follow the instructions listed in the section called Install Docker for Windows Desktop App
- Visit the Install Docker for Mac page
- After downloading the
.dmg
file, follow the instructions listed in the section called Install and Run Docker for Mac
- Visit the Get Docker CE for Ubuntu page
- Follow the instructions listed under the Install Docker CE section
- To install Docker Compose, visit the Install Docker Compose page
- Follow the instructions under the Install Compose section
- Follow the Linux post-install instructions
If you've fulfilled the above prerequisites, it's very simple to download and run Tweet Crawler from the command line (PowerShell, Terminal, etc.):
Clone the project repository from Github
cd ~
git clone https://github.com/qcrisw/tweet-crawler.git
cd tweet-crawler
Before you can run Tweet Crawler, you need to provide valid credentials to access the Twitter API.
To setup these credentials, follow the instructions here.
Once you've got these in hand, run the following command:
cp env-sample .env
Edit the .env
file by inserting your Twitter app credentials and save the file.
Also, If you want to use the crawler with a proxy server, specify the proxy IP address within the .env
file.
docker-compose run pycrawler track <keyword1> <keyword2> ...
...where each <keyword>
is a topic you're interested in streaming from Twitter.
docker-compose run pycrawler sample <language-code>
...where <language-code>
is an optional value drawn from the list of Twitter - Supported Languages.
If you leave this field empty, the crawler will sample all tweets posted on Twitter.
docker-compose run pycrawler geo <bound-box-1> <bound-box-2> ...
...where each <bound-box>
is a space-separated list of coordinates specified in the following order:
- longitude of left edge
- latitude of bottom edge
- longitude of right edge
- latitude of top edge
For example, the following will crawl all tweets from the United States (minus Alaska & Hawaii):
docker-compose run pycrawler geo -125.2 25.6 -66.9 49.6
In addition to a single area, it's possible to crawl tweets from multiple, disjoint areas by specifying multiple bounding boxes.
For example, the following will crawl all tweets from South Korea and the United States (minus Alaska & Hawaii):
docker-compose run pycrawler geo 123.7 32.7 131.1 39.0 -125.2 25.6 -66.9 49.6
For best results, you can use a tool like BoundingBox, with output mode set to CSV, to fine-tune the selected bounding box.
If you are only interested in using Tweet Crawler, then you can simply ignore this section.
However, if you would like to modify the source code and/or contribute to the project, reading this section is essential to building the source code on your development machine.
At a high level, the Tweet Crawler development process consists of three main steps, which are repeated many times:
- Update the source code with new changes
- Build Docker images for the updated version of project
- Run project to test updated codebase
The source code can be updated using any IDE or editor of your choice.
We've already covered how to run Tweet crawler in the Running the Tweet Crawler section.
As such, we will only give detailed instructions on how to build the Docker images required to run a development version of the Tweet Crawler.
- Add the
build
key to thepycrawler
service indocker-compose.yaml
pycrawler:
build: .
image: qcrisw/pycrawler:latest
<rest of service definition>
- Update the
image
key for thepycrawler
service indocker-compose.yaml
pycrawler:
build: .
image: <your-org-name-here>/pycrawler:latest
<rest of service definition>
- Clone the
qcrisw/task-worker
repository
git clone https://github.com/qcrisw/task-worker.git
- Add the
build
key to therqworker
service indocker-compose.yaml
rqworker:
build: ./task-worker
image: qcrisw/rqworker:latest
<rest of service definition>
- Update the
image
key for therqworker
service indocker-compose.yaml
rqworker:
build: ./task-worker
image: <your-org-name-here>/rqworker:latest
<rest of service definition>
- Run
docker-compose build
from the project directory - Wait until the build process is complete
At this point, you can run the Tweet Crawler project using the newly built Docker images, as outlined in the Running the Tweet Crawler section.
If you want to re-build the project after updating the source code, simply re-run the docker-compose build
command from the main project directory.
-
How do I add new types of tasks to the message queue?
- Define a new task function in the
mq/tasks.py
file - Add any new dependencies to
requirements.txt
- Run
cp mq/tasks.py task-worker/mq/tasks.py
from the main project directory - Run
cp requirements.txt task-worker/requirements.txt
from the main project directory
- Define a new task function in the
-
Why does Docker still run my old code even after I update the source code?
- By default, running
docker-compose run
will not build the newly updated Docker images - In order to run the updated code, you need to run
docker-compose build
followed bydocker-compose run
- By default, running
-
Where can I find the collected tweet data?
- The collected tweet data is stored as a set of "daily collections" named using "Year_Month_Day" format (UTC) in MongoDB
- To access the tweet objects stored in these collections, use the following steps:
- Run
docker-compose ps
to see the full list of running containers - Find the name of the running
mongo
container (e.g.tweet-crawler_mongo_1
) - Run
docker exec -it <name-of-mongo-container> bash
, followed bymongo social-analytics
- You can use the Mongo Shell Command Reference to query & access the data contained in MongoDB
- Run
-
How can I export the raw data outside MongoDB?
- Run
docker-compose ps
to see the full list of running containers on the host machine - Find the name of the running
mongo
container (e.g.tweet-crawler_mongo_1
) - Run
docker exec -it <name-of-mongo-container> bash
- Run
mongoexport --db=social_analytics --collection=<daily-collection-name> --out=<output-file-name>
- Wait until the export process is completed
exit
from therqworker
container- From the host machine, run
docker container cp tweet-crawler_mongo_1:<output-file-name> <output-file-name>
- At this point, a dump of the raw tweet data is present on your host machine with the given
<output-file-name>
- Run
-
How can I start up MongoDB without starting up the crawler?
- To stop all running containers in the Tweet Crawler project, run
docker-compose down
- To start up only MongoDB, run
docker-compose run --detach mongo
- To stop all running containers in the Tweet Crawler project, run
-
Why can't I see the logs of
rqworker
containers usingdocker logs
?- Due to certain implementation issues, the logs of each
rqworker
are not accessible viadocker logs
- However, you can still access the
rqworker
logs using the following steps:- Run
docker-compose ps
to list all running containers - Find the name of a running
rqworker
container (e.g.tweet-crawler_rqworker_1
) - Run
docker exec -it <name-of-rqworker-container> bash
- Run
cat logs.txt
to view the logs generated by the worker
- Run
- Due to certain implementation issues, the logs of each
- Tweepy - Twitter SDK for Python
- PyMongo - Python driver for MongoDB
- MongoDB - Database for long-term storage of tweets
- RQ - Python-based Redis Queue library
- Redis - In-memory DB for storing tweets
This project is licensed under the MIT License - see the LICENSE.md file for details