The project leverages on Large Language Models (LLMs) and prompt engineering to create a research assistant that gives answers to your question based on what is available in the vector data base store on your deeplake account.
- All files can only after installing all dependencies in the
environment.yml
file - The
notebook folder
contains the jupyter notebook file for testing the project as a whole and for experimenting - The
vector_database_creation.py
file is for creating the vector database resource for the LLM - The
rag_research_assistant_main.py
file is the main drivver code for the research assistant - The initial data resources for the database creation can be found in the
research_articles.zip
file
Before installing the dependencies in the environment.yml
file. Kindly do the following first
- Download and install Anaconda
- Once Conda is installed, open your CMD and run the following command
C:/Users/your_system_name/anaconda3/Scripts/activate
- Should see something like
'(anaconda3)'C:\Users\your_system_name\Desktop\>
as an output in your CMDNB: Do not close the CMD terminal, would be needed later on
- Once Conda is installed, open your CMD and run the following command
- Sigup for Cohere Cohere
- Once your account is created, navigate to API keys in your profile and create Trial Cohere API key. BE SURE TO COPY IT
- Sigup for Active Loop (Your vector database) Active Loop
- Once your account is created, navigate to API tokens in your profile and create your API token. BE SURE TO COPY IT
- Sign up for Hugging face (Access to models) Huggingface
- Once your account is created, navigate to access tokens and create an access token of
read only
. BE SURE TO COPY YOUR ACCESS TOKEN
- Once your account is created, navigate to access tokens and create an access token of
- Navigate to your desktop and create a new folder called
research_assistant
and paste theenvironment.yml
file into the folder - On your cmd navigate into the
research_assistant
folder usingcd research_assistant
- Run
conda env create -f environment.yml -p ../research_assistant/rag
on your cmd - Run
conda env list
on your cmd to list all environments created using Anaconda - Run
conda activate C:\Users\your_system_name\Desktop\research_assistant\rag
on your cmd to activate the environment- Should see something like
'(rag)'C:\Users\your_system_name\Desktop\research_assistant>
as an output in your CMD
- Should see something like
- Run
conda list
on your cmd to check if all dependencies have been installed
- Paste all your tokens in the .env file
- Activate your conda environment as previously shown
'(anaconda3)'C:\Users\your_system_name\Desktop\>
conda activate C:\Users\your_system_name\Desktop\research_assistant\rag
- Navigate to the folder of your project;
research_assistant
usingcd research_assistant
- Navigate to the folder of your project;
vector_base_creation
usingcd vector_base_creation
- Run
python vector_database_creation.py
to create your vector database - Navigate to the folder of your project;
rag_research_assistant
usingcd rag_research_assistant
- Run
python rag_research_assistant_main.py
to run your research assistant.
For example prompts, refer to prompts.md
When vector_database_creation.py
is ran, the following output is given when all criteria are met
When `rag_research_assistant_main.py is ran, the following example output is given
Question
Why did Mehedi Tajrian analyse child development and what was the best classifier?
Answer
Mehedi Tajrian analyzed child development due to:
- The rapid spread of misinformation online complicating accurate decision-making, especially for parents.
- The lack of research into distinguishing myths and facts about child development using text mining and classification models.
- The potential risks of inaccurate information on child treatment and development.
- To provide valuable insights for making informed decisions, thus aiding parents in handling misinformation.
-To shed light on myths around child development and aid in making informed decisions. These include several stages, including data pre-processing through text mining techniques, and analysis with six traditional machine learning classifiers and one deep learning model using two feature extraction techniques.
-The best performing classifier is the Logistic Regression (LR) model with a 90% accuracy rate. The model also stands out for its speed and efficiency, with very low testing times per statement, and demonstrated robust performance on both k-fold and leave-one-out cross-validation.
Source(s):
- Title: Analysis of child development facts and myths using text mining techniques and classification models, Page: 1
- Title: Analysis of child development facts and myths using text mining techniques and classification models, Page: 15
- Title: Analysis of child development facts and myths using text mining techniques and classification models, Page: 2
Here is the Publication on;
I did experience an issue from the huggingface platform but was solved thanks to the open source community! Highly grateful to you all!
Happy prompting and may the RAG be with you young JEDI!