Grounded tracking on anything in realtime using natural language queries with Grounding DINO, Grounding DINO 1.5, and Ollama which provides easy access to various LLMs.
-
Make sure you have Anaconda installed. If not, you can download it here.
-
Prepare your environment:
- Create the conda environment with python 3.10
conda create -n dino-llama python=3.10 -y conda activate dino-llama
-
Install pytorch for your system. Follow the official documentation for your specific system.
-
Install the rest of the dependencies from the package
git clone https://github.com/Jshulgach/Grounded-SAM-2-Stream.git cd Grounded-SAM-2-Stream pip install -e .
-
Prepare the LLM model:
- Install Ollama on your respective OS (Linux, MacOS, Windows) by following the readme instructions
- Create the llm model with the customized prompt.
ollama create dino_llama -f <\path\to\repo>\llm\Modelfile
-
Set up CUDA for GPU usage with Grounding DINO (Optional). You should be able to download the CUDA Toolkit from the Nvidia website.
A demo is included to handlea single video directory passed (or number for webcam). The DINOStream application creates a server and listens for messages from a client.
python dino_stream.py 0
Open a separate terminal and send a message to the server. The default server address is localhost
and the default port is 15555
. You can change the server address and port in the dino_stream.py
file:
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('localhost', 15555))
s.send(b'I am looking for something blue that squirts water')
To run the multi-camera streaming demo, you can run the following command which opens multiple webcams (assuming you have more than one connected):
python multi_stream.py