🧠 RAG Data Collector Module

A modular, pluggable data ingestion and processing pipeline for Retrieval-Augmented Generation (RAG) systems.
Collects, cleans, chunks, and vectorizes unstructured content from PDFs, text files, manual entries, and web sources.

📦 Features

📂 Ingests content from:
- PDFs, .txt, .md files
- Hardcoded/manual knowledge entries
- Websites using a configurable base URL
🧹 Cleans & normalizes text using regex, HTML tag removal, etc.
🔪 Smart chunking by:
- Character size
- Sentence
- Paragraph
🧠 Embeds and stores documents in a FAISS vector DB using HuggingFace Sentence Transformers
⚙️ Fully configurable via config.py
💻 CLI support for scripted runs

🧰 Tech Stack

Layer	Tool/Library
Embeddings	`sentence-transformers` (MiniLM)
Vector DB	`FAISS`
Parsing	`PyPDF2`, `requests`, `BeautifulSoup`
Framework	`LangChain`, `Python`

📁 Project Structure

RAG_data_collector_module/

├── main.py # CLI runner

├── init.py # Package bootstrap

├── collector.py # Document collection logic

├── config.py # Central config

├── storage.py # JSON & FAISS storage

├── sources/ # Ingestion modules

│ ├── RG_LLM_Integration.ipynb # llama, LoRA finetuning

├── sources/ # Ingestion modules

│ ├── files.py # PDF, text file input

│ ├── robots.py

│ ├── manual.py # Hardcoded content

│ └── web.py # Web scraping

├── utils/ # Preprocessing

│ ├── cleaner.py # Cleaning rules

│ └── chunker.py # Chunking strategies

└── vector_store/ # Output vector DB

🚀 Setup Instructions

1. Clone the Repo

git clone https://github.com/AmanS2501/RAG_data_collector_module.git
cd RAG_data_collector_module (You have run this as a package from root folder)

conda create -n chatbotenv python=3.10
conda activate chatbotenv
pip install -r requirements.txt
cd ..

python -m RAG_data_collector_module

--collect-only   # Skip vector storing
--no-chunk       # Skip chunking
--clean          # Remove old files and vectors
--stats          # Print collected metadata

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
__pycache__		__pycache__
colab notebooks		colab notebooks
sources		sources
utils		utils
.gitignore		.gitignore
README.md		README.md
Sample_Agent_implementation.py		Sample_Agent_implementation.py
__init__.py		__init__.py
__main__.py		__main__.py
app.py		app.py
collector.py		collector.py
config.py		config.py
requirements.txt		requirements.txt
storage_utils.py		storage_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 RAG Data Collector Module

📦 Features

🧰 Tech Stack

📁 Project Structure

🚀 Setup Instructions

1. Clone the Repo

About

Uh oh!

Releases

Packages

Languages

AmanS2501/RAG_data_collector_module

Folders and files

Latest commit

History

Repository files navigation

🧠 RAG Data Collector Module

📦 Features

🧰 Tech Stack

📁 Project Structure

🚀 Setup Instructions

1. Clone the Repo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages