MetaScraper

MetaScraper is a web application that allows users to upload a CSV file containing a list of URLs. The system asynchronously scrapes each URL to extract meta tags—such as title, description, and keywords—and stores the results in a PostgreSQL database. The application is built with FastAPI, utilizes Celery for background task processing, Redis as a message broker, and Docker for containerization. Also it is deployed and managed using Kubernetes, ensuring scalability and high availability.

Features

User Authentication: Secure login system to ensure only authenticated users can access the functionalities.
CSV Upload: Users upload a list of URLs for scraping.
Asynchronous Processing: URLs are scraped in the background, allowing users to continue with other tasks.
Progress Tracking: Check the status of ongoing scraping tasks.
Results Viewing: Retrieve and view the extracted metadata for each URL.

Architecture

The application is composed of the following components:

FastAPI: Serves as the web framework for building the API.
PostgreSQL: Stores user data, urls, uploaded files, and the extracted metadata.
Redis: Acts as a message broker for Celery tasks.
Celery: Manages asynchronous background tasks for scraping URLs.
Docker: Containerizes the application for consistent and reproducible deployments.
Kubernetes: For deploying and managing the application efficiently.

Getting Started

Installation

Clone the Repository:

git clone https://github.com/youngroma/MetaScraper.git
cd MetaScraper

Environment Variables:
- fill '.env' the file with environment variables
- fill 'k8s/secrets.yaml' file (in base64 format) with environment variables (for kubernetes)
Replace variables with your desired credentials.
Build and Start the Application:

Use Docker Compose to build and start all services:
```
docker-compose up --build
```
This command will start the following containers:
- The FastAPI application.
- The PostgreSQL database.
- The Redis message broker.
- The Celery worker for background tasks.
Apply Database Migrations:

Once the containers are running, apply the database migrations:
```
docker-compose exec web alembic upgrade head
```
This ensures that the database schema is up-to-date.
Deploying with Kubernetes
- Apply the Kubernetes manifests from the k8s/ directory:
```
kubectl apply -f k8s/
```
- Verify the deployment:
```
kubectl get pods -n metascraper
```

Usage

Access the Application:

The FastAPI application will be available at http://localhost:8000.
Uploading a CSV File:

To upload a CSV file containing URLs:
- Authenticate using your credentials.
- Upload CVS file (for example with using Postmnan)
Using Postman for API Testing

Importing the Postman Collection

In this repository you have Postman collection 'MetaScraper.postman_collection.json' :
- Open Postman.
- Click Import (top-left corner).
- Select "File", then choose MetaScraper.postman_collection.json.
- Click Import to add the collection.
- Before making any API requests, you need to log in and obtain an authentication token.
  - In the response, copy the "access_token".
  - Go to the Authorization tab in Postman - Select Bearer Token - Paste the copied token into the field.
- Now you can use the predefined API requests to interact with the system.
- Create a CSV file with URLs and upload it!
Checking Scraping Progress:

After uploading, you can check the status of your scraping task by accessing the /status/{task_id} endpoint, where {task_id} is the ID returned upon uploading the CSV.
Viewing Extracted Metadata:

Once scraping is complete, retrieve the extracted metadata through the /results/{task_id} endpoint.
Monitoring and Metrics (Prometheus + Grafana)

MetaScraper integrates Prometheus for collecting metrics and Grafana for visualizing them. This setup enables real-time monitoring of system performance and health.
- Prometheus is configured to scrape metrics from FastAPI, Redis, and PostgreSQL. It collects relevant data on system performance, including request counts, processing times, and health status.
- Grafana is used to visualize the metrics collected by Prometheus. It allows you to create dashboards for monitoring the performance of the FastAPI application, Redis, and PostgreSQL, providing insights into system health and response times.
- Metrics Endpoint The application exposes a /metrics endpoint that provides real-time metrics for Prometheus. This endpoint collects data such as request counts, response times, and system performance, which can be scraped by Prometheus for monitoring and visualization in Grafana.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
app		app
k8s		k8s
tests		tests
worker		worker
.env		.env
Dockerfile		Dockerfile
MetaScraper.postman_collection.json		MetaScraper.postman_collection.json
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
postgres.conf		postgres.conf
prometheus.yml		prometheus.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MetaScraper

Features

Architecture

Getting Started

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

youngroma/MetaScraper

Folders and files

Latest commit

History

Repository files navigation

MetaScraper

Features

Architecture

Getting Started

Installation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages