Verification Engine: AI-Powered Technology Stack Analysis

This Script is a Python-based tool designed to analyze web pages and PDF documents for the presence of specific keywords. It extracts relevant text, determines if the keyword is used in a meaningful context, and provides an explanation for its findings.

Key Technologies & Libraries

The Problem It Solves

Manually verifying a company's technology stack is a time-consuming and often inaccurate process. This engine was built to automate and enhance this process, providing reliable, data-driven insights with minimal human effort.

Challenge (The Manual Way)	Solution (Verification Engine)
High Time & Effort	Automates the analysis of hundreds of URLs, saving countless hours of manual work.
Low Accuracy	Uses generative AI to understand context, distinguishing real usage from casual mentions or irrelevant matches.
Difficult to Scale	Processes large CSV files in batches, making it possible to analyze entire market segments.
Unstructured Data	Outputs clean, structured data in CSV and JSON formats, ready for immediate analysis and reporting.

Key Features

Multi-Format Content Analysis: Seamlessly processes both live websites (HTML) and PDF documents, extracting text from complex layouts.
AI-Powered Contextual Verification: Leverages Google's Gemini large language model to move beyond simple keyword matches. It analyzes the surrounding text to determine if a technology is being used operationally, mentioned in passing, or used in a different context.
Intelligent Date Extraction: Employs a multi-layered heuristic approach to identify the publication or modification date of content by searching URL patterns, metadata, and document text.
Scalable & Resilient Processing: Built to handle large datasets from a CSV input. A robust checkpointing system saves progress to both CSV and JSON formats, allowing you to resume lengthy analysis tasks without data loss.
Structured Data Output: Generates clean, analysis-ready output in both CSV and JSON formats, detailing the company, domain, keyword, verification status, and a clear explanation from the AI model.

Getting Started

Follow these instructions to get a local copy up and running.

Prerequisites

Python 3.9 or higher
An active Google Gemini API key.

Installation & Configuration

Clone the repository:

git clone https://github.com/Krishna9588/Verification_Engine.git
cd Verification_Engine

Install dependencies using a Virtual Environment:

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

# Install required libraries from the requirements file
pip install -r requirements.txt

Configure API Keys: Open the explain_url.py file and add your Google Gemini API keys to the API_KEYS list. The script includes a key rotation feature to cycle through multiple keys if needed.
```
# In explain_url.py
API_KEYS = ["YOUR_API_KEY_1",
            "YOUR_API_KEY_2",
            # Add more keys if you have them
           ]
```

Usage

The Verification Engine is designed to be run from the command line and uses a CSV file for batch processing.

1. Prepare the Input CSV

Create a CSV file (e.g., input/my_companies.csv) with the following columns:

company_name: The name of the company (optional, can be derived from the URL).
domain: The company's primary domain (e.g., example.com).
keyword: The technology or software keyword to search for (e.g., AWS, VMware).
company_url: The specific URL of the page or PDF to analyze.

Example input.csv:

company_name,domain,keyword,company_url
Birlasoft,birlasoft.com,AWS,https://www.birlasoft.com/services/enterprise-products/aws
Example,"example.com",Glue,https://www.example.com/sustainability-report.pdf

2. Run the Engine

Execute the main script from the project's root directory. The script will guide you through the process.

python main_working_json.py

3. Review the Output

Upon completion, the script will generate:

A CSV file in the results_csv/ directory.
A JSON file in the results_json/ directory.
Checkpoint files in the checkpoint/ and checkpoint_json/ directories, which log each result as it's processed.

Contributing

Contributions are welcome! If you have suggestions for improving the engine, please feel free to fork the repository, make your changes, and submit a pull request. You can also open an issue to report bugs or suggest new features.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.idea		.idea
__pycache__		__pycache__
checkpoint		checkpoint
checkpoint_json		checkpoint_json
extract		extract
input		input
results_csv		results_csv
results_json		results_json
Readme.md		Readme.md
explain.py		explain.py
explain_s.py		explain_s.py
explain_url.py		explain_url.py
info.py		info.py
kimi.py		kimi.py
kimi_2.py		kimi_2.py
main_working_json.py		main_working_json.py
main_working_test.py		main_working_test.py
prompts		prompts
requirements.txt		requirements.txt
test.py		test.py
test_api.py		test_api.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Verification Engine: AI-Powered Technology Stack Analysis

Key Technologies & Libraries

The Problem It Solves

Key Features

Getting Started

Prerequisites

Installation & Configuration

Usage

1. Prepare the Input CSV

2. Run the Engine

3. Review the Output

Contributing

About

Uh oh!

Releases

Packages

Languages

Krishna9588/Verification_Engine

Folders and files

Latest commit

History

Repository files navigation

Verification Engine: AI-Powered Technology Stack Analysis

Key Technologies & Libraries

The Problem It Solves

Key Features

Getting Started

Prerequisites

Installation & Configuration

Usage

1. Prepare the Input CSV

2. Run the Engine

3. Review the Output

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages