-
Node.js
- Required version: Node.js 18.x or higher recommended
- Download from nodejs.org
-
Yarn Package Manager
- Required since the project uses a yarn.lock file
- Install using:
npm install -g yarn
-
Ollama and Llama3.2 3B and nomic-embed-text
- Required for running the RAG pipeline
- Download the latest version from ollama.ai
- Follow the installation instructions for your operating system
- Make sure Ollama is downloaded properly by running
ollama -v
in your terminal - Download Llama3.2 3B
- Download nomic-embed-text
-
Git
- Download from git
-
Clone the project repository to your local machine
git clone [repository-url] cd [project-directory]
-
Install dependencies
yarn
-
Fill out the Environment Variables if needed
NEXT_PUBLIC_USE_OPENAI=false # set this as 'true' to use OpenAI
NEXT_PUBLIC_OPENAI_API_KEY= # also set this if you want to use OpenAI
-
Run the development server
yarn dev
-
Open Ollama by running
ollama serve
in your terminal
The application should now be ready for use. Visit http://localhost:3000 to view the application.
Feel free to jump right in, and try it out with the included use-this-pdf.pdf
file from a cool class I took.
Note: the first message will take a bit of time to generate when you use Ollama, but subsequent messages will be much faster.
the below table responses are from Llama3.2 3B.
Question | Answer |
---|---|
What are Crista's Five Laws of Performant Software? | Crista Videira Lopes' Five Laws of Performant Software include: 1. Programming language ≪ Programmers' awareness of performance. 2. Performance improvements = log(controlled experiments). 3. N*bad != good (No amount of nodes, cores, memory, etc., will save you from poorly written code). The other two laws are not explicitly mentioned in the provided context, but they can be inferred based on general knowledge about Crista's Five Laws of Performant Software. If you need more information, please let me know and I'll try to find it for you. |
What is the maximum latency recommended for systems that need to respond instantaneously? | According to the context, 100ms (0.1 seconds) is the maximum recommended latency for systems that purport to respond instantaneously, as stated by Jakob Nielsen in his article "Response times: The 3 important limits" from January 1993. |
How does the document explain the difference between bandwidth and latency? | The document explains the difference between bandwidth and latency through an example of various communications technologies, such as high latency and high bandwidth. It states that while both can be present in a system, they serve different purposes: "We will focus on completing the items (doing useful work), not on transmitting information..." This implies that bandwidth is about transmitting information, whereas latency is about the time it takes to complete tasks or respond to users. |
According to the lecture, why is parallelization not always the answer to performance problems? | According to the context, parallelization is not always the answer because of "dependencies" and "coordination overhead". Specifically, a task can't start processing until it knows what it's supposed to process, and combining results from multiple tasks can be difficult. |
What was the specific problem mentioned in the anecdote about report generation taking three hours, and how was it solved? | According to the context, the anecdote mentions a report generation task that took three hours, but was actually only running for 30 minutes. The server had timed out the task due to reaching its time limit. |
Here's a quick demo on using tokto and an overview of the features.
Before I explain the choices I made, I want to list the stretch goals that influenced my decisions:
- I wanted to build a chatbot with great UX/UI
- features for debugging and tracing the RAG pipeline.
- I wanted to do the whole chatbot locally
I was motivated to choose those stretch goals because I am interested in the observability and interpretability of RAG pipelines,and I wanted to build a chatbot that I wished I had when I was in university and the WiFi didn't work 🙁
I measured frameworks based on the following criteria:
- Ease of use: How fast can you iterate and build using the framework?
- Effectiveness: Can it help me build a high-quality RAG pipeline and a good chatbot?
I picked the above criterias because they were important for building products.
I decided to use Langchain because it had a lot of integrations and utility for quickly building a chatbot, especially for a fully local chatbot. I was able to use it to build a RAG pipeline that ran on a web worker and integrate it with Ollama quickly and reliably. Despite LlamaIndex having great parser and indexing capabilities, Langchain makes up for it with a much more robust and flexible framework and library of integrations.
I measured embedding models based on the following criteria:
- Speed: how fast are inferences for the embeddings?
- Ease of use: how easy is it to use the embeddings?
- Evaluation score average: how well does the embedding perform on average across all tasks?
- Cost: how much does it cost to use the embeddings?
Once again I picked the above metrics because they were important for building products. Speed is needed for a seamless experience, and a high score average is important for getting relevant retrieval results. Lastly, cost and ease of use is important for building a product that is affordable and usable for students.
When evaluating embedding models, I consulted the MTEB leaderboard https://huggingface.co/spaces/mteb/leaderboard. I decided on nomic-text-embed-v1
because it's one of the best open-source local embeddings with a great average score across all tasks. It is a bit slower than the other options, but it is still fast enough for a chatbot. Lastly, it is easy to host and use with Ollama.
For model providers, I measured them based on the following criteria:
- Cost: how much does it cost to run the model?
- Ease of use: can I run the model locally easily?
- Performance: how well does the model learn from long and short contexts?
- Context window size: how many tokens can the model handle?
Cost and ease of use are important for building a product that is affordable for students, especially since they would be hosting the model on their own computers. Performance and context window size are important for giving relevant and accurate answers to questions about users' short assignments or long notes.
Since it's important to have a model that is easy to run locally and cost-effective, I looked at open-source models that were significantly smaller and had 1B-5B parameters. I ended up choosing llama3.2 3B
as it was very lightweight and performed in general better on evaluations than offerings like Gemma 2 2B IT
and Phi-3.5-mini IT
. Also, it has a context window size of 128k tokens, which is amazing for short and long contexts.
For a backup model, I picked GPT-4o
since it was fairly cheap @ $2.50 / 1M tokens, performs very well on retrieval evaluations, and has a context window size of 128k tokens.
In the end, I was able to build a chatbot that runs sufficiently fast locally with relevant answers. It was also very useful to have a debug panel to trace the retrieved documents, and the generated query, and to have an easy way to swap out the system prompt. Also picking Langchain was a great choice, since more tedious tasks like managing chat history and integrating with external tools like Ollama were made simpler, so I could focus on creating a good chatbot.
Since I was using a WebPDF parser, it was not good at handling documents with sophisticated formatting or images. So it was difficult to get certain details in research papers. Since I am trying to create a locally running RAG pipeline, I was limited to WebPDF parser (after researching for alternatives). As a result, I improved the parsed PDF content by configuring text splitting with separators and adding spacing to parsed text for better clarity and to deal with the rich formatting.
At first, I wanted to retrieve documents, only if it was necessary. However, the model did not give me relevant answers. I then debugged by checking the traced retrieved documents panel and realized that the retrieval tool was never called. As a solution, I decided to retrieve documents for every message, since self-hosting was cost-effective. Overall, it slowed the inference time, but it allowed me to get relevant results.
The model gave answers but struggled with some specific details. After doing some research on simple techniques, I improved the answers by rewriting the question to create a better query to grab relevant context. I also prompt-engineered the system prompt by adding better formatting to the context and adding rules to the prompt to make it more concise and relevant.
- I would like to improve the readability and maintainability of the codebase by adding E2E tests and refactoring states into custom hooks and contexts.
- Use a better PDF parser, potentially with Llama parser.
- I would like to use more advanced RAG techniques, specifically hybrid search (using both semantic and keyword search) with a reordering technique to improve the retrieval results.
- I would like to improve the UI by adding a cancel sent button, a chat history panel, and a reset chat without a refreshing button.
- I also want to add a voice chat feature, since it's a feature I use a lot in my daily life.