Synthetic Data Generator Langchain

Overview

The Challenge: The Data Access Paradox

In government agencies and high-security organizations, developers face a critical chicken-and-egg problem: you need data to build proof-of-concepts (POCs), proof-of-value (POVs), and minimum viable products (MVPs), but you can't access the real data due to security, privacy, and clearance restrictions.

This creates an impossible situation:

Security teams won't release sensitive data for development
Privacy regulations (GDPR, HIPAA, government classifications) prevent data sharing
Clearance requirements may exclude developers from accessing classified datasets
Development timelines can't wait for lengthy security approval processes
Innovation stalls without realistic data to test and validate solutions

The Solution: Contextually-Aware Synthetic Data Generation

The Synthetic Data Generator breaks this deadlock by creating realistic, contextually-aware synthetic datasets that preserve the statistical properties and semantic relationships of original data while ensuring complete privacy and security compliance.

Key Value Propositions:

✅ Accelerate Development: Build POCs and MVPs without waiting for data access approvals
✅ Maintain Privacy: Generate synthetic data that contains no actual PII or sensitive information
✅ Preserve Context: Maintain semantic relationships and statistical properties of original datasets
✅ Enable Innovation: Allow developers to work with realistic data regardless of clearance level
✅ Compliance Ready: Built-in PII protection ensures regulatory compliance from day one

Technical Excellence

The Synthetic Data Generator leverages LangChain framework, advanced data processing techniques, embedding methods, and Azure OpenAI to generate new content that resembles the original data in both content and context, while providing enterprise-grade security and privacy protection.

Features

Load original data from various sources using LangChain document loaders
Process data through intelligent chunking and embedding with LangChain
Store embeddings in vector databases (Azure Cognitive Search, FAISS)
Generate synthetic content using Azure OpenAI integration
Support for multiple search methods with LangChain vector stores
Built-in PII protection using Presidio analyzer

Azure Products Used

Azure OpenAI Service: For embeddings and synthetic content generation
Azure Cognitive Search: For vector search capabilities (optional)

Workflow Architecture

Custom Embeddings Approach

This codebase implements a custom embeddings pipeline with manual orchestration:

1. Custom Embeddings Generation

Using Embedder class with Azure OpenAI embeddings
Manual control over embedding model selection and parameters
Flexible embedding strategies supporting Azure OpenAI

2. Manual Pipeline Orchestration

Documents are processed through separate, coordinated steps:

Load → Preprocess → Chunk → Embed → Store

Step-by-Step Workflow:

Load: Multi-format document loading via LangChain loaders
Preprocess: Text cleaning and PII protection
Chunk: Intelligent text splitting with RecursiveCharacterTextSplitter
Embed: Vector generation using Azure OpenAI
Store: Vector storage in Azure Cognitive Search or FAISS

3. Separate Vector Store Management

LangChainVectorStore handles vector operations independently
Unified interface for Azure Search and FAISS backends
Manual embedding injection into vector storage

Pipeline Flow Diagram

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   DataLoader│───▶│Preprocessor │───▶│   Chunker   │
│ (LangChain) │    │(PII Protection)│  │ (LangChain) │
└─────────────┘    └─────────────┘    └─────────────┘
                                             │
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│Vector Store │◀───│  Embedder   │◀───│  Documents  │
│(LangChain)  │    │(Azure OpenAI)│   │  (Chunked)  │
└─────────────┘    └─────────────┘    └─────────────┘
                                             │
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Output    │◀───│  Generator  │◀───│   Search    │
│(Synthetic)  │    │(Azure OpenAI)│   │(Similarity) │
└─────────────┘    └─────────────┘    └─────────────┘

PII Protection Features

Built-in PII Safeguards

Pre-processing Anonymization: PII detection and anonymization using Presidio
Configurable Protection: Customizable PII entities and protection levels
Generation Validation: Post-generation PII detection and filtering

PII Protection Layers

Input Layer: Document content anonymization using Presidio
Storage Layer: Vector embeddings contain no readable PII
Output Layer: Generated content validation and cleaning

Supported PII Types

Personal names (PERSON)
Email addresses (EMAIL_ADDRESS)
Phone numbers (PHONE_NUMBER)
Social Security Numbers (US_SSN)
Credit card numbers (CREDIT_CARD)
Custom patterns via configuration

Project Structure

synthetic-data-generator/
├── src/
│   ├── main.py
│   ├── config/
│   │   ├── __init__.py
│   │   └── settings.py
│   ├── data_processing/
│   │   ├── __init__.py
│   │   ├── pipeline.py        # Manual orchestration
│   │   ├── chunker.py         # LangChain text splitting
│   │   ├── embedder.py        # Azure OpenAI embeddings
│   │   ├── loader.py          # LangChain document loaders
│   │   ├── preprocessor.py    # Text cleaning & PII protection
│   │   └── generator.py       # Synthetic data generation
│   ├── vector_store/
│   │   ├── __init__.py
│   │   ├── base_vector_store.py
│   │   ├── langchain_vector_store.py  # Unified vector interface
│   │   └── vector_store_factory.py    # Vector store creation
│   └── utils/
│       ├── __init__.py
│       └── pii_protector.py   # PII detection and anonymization
├── .env.example
├── requirements.txt
└── README.md

Technology Stack

LangChain: Core framework for document processing, embeddings, and LLM integration
Azure OpenAI: For embeddings and synthetic data generation
Azure Cognitive Search: Vector database for semantic search (optional)
FAISS: Alternative vector database for local development
Presidio: PII detection and anonymization
Python: Core programming language

Installation

1. Clone the Repository

git clone <repository-url>
cd synthetic-data-generator

2. Create Virtual Environment (Recommended)

python -m venv venv

# On Windows
venv\Scripts\activate

# On macOS/Linux
source venv/bin/activate

3. Install Dependencies

# Install all Python dependencies
pip install -r requirements.txt

4. Set Up Environment Variables

Copy the .env.example file and configure your Azure services:

cp .env.example .env

Edit .env with your Azure credentials:

# Azure OpenAI Configuration
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your_api_key_here
AZURE_OPENAI_API_VERSION=2023-05-15
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-35-turbo
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002

# Azure Cognitive Search (Optional)
AZURE_SEARCH_ENDPOINT=https://your-search-service.search.windows.net
AZURE_SEARCH_API_KEY=your_search_admin_key
AZURE_SEARCH_INDEX_NAME=documents

# Vector Store Configuration
VECTOR_STORE_TYPE=azure_search  # or 'faiss' for local development
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

Usage

Quick Start

# Run the main application
python src/main.py

Using Individual Components

1. Document Loading

from src.data_processing.loader import DataLoader

loader = DataLoader()
documents = loader.load_documents("path/to/your/documents")

2. Data Processing Pipeline

from src.data_processing.pipeline import DataProcessingPipeline

pipeline = DataProcessingPipeline()
results = pipeline.process_documents(documents)

3. Synthetic Data Generation

from src.data_processing.generator import SyntheticDataGenerator

generator = SyntheticDataGenerator()
synthetic_data = generator.generate_synthetic_data(
    context="Your context here",
    num_samples=10
)

Core Components

Data Processing (LangChain-Powered)

Loader: Multi-format document loading (PDF, CSV, JSON, TXT, DOCX)
Preprocessor: Text cleaning and PII protection using Presidio
Chunker: Intelligent text splitting with overlap using RecursiveCharacterTextSplitter
Embedder: Azure OpenAI embeddings integration
Generator: Synthetic data generation using Azure OpenAI with prompt templates
Pipeline: End-to-end orchestration with manual control

Vector Store (LangChain Integration)

Base Vector Store: Abstract interface for vector operations
LangChain Vector Store: Unified interface for Azure Search and FAISS
Vector Store Factory: Factory pattern for creating vector store instances

PII Protection

PIIProtector: Presidio-based PII detection and anonymization
Configurable Entities: Support for multiple PII types
Anonymization: Replace PII with placeholders or remove entirely

Configuration

Configuration settings are managed through the config/settings.py module, which loads environment variables and provides centralized configuration management.

Key Benefits of LangChain Integration

Unified Interface: Consistent API across different vector stores and LLMs
Rich Ecosystem: Access to multiple document loaders and embedding models
Azure Integration: Native support for Azure OpenAI and Azure Cognitive Search
Prompt Management: Structured prompt templates for consistent generation
Flexibility: Easy switching between different backends

Troubleshooting

Azure OpenAI Connection Issues

If you encounter connection errors:

# Verify your environment variables are set
python -c "
import os
from dotenv import load_dotenv
load_dotenv()
print('Endpoint:', os.getenv('AZURE_OPENAI_ENDPOINT'))
print('API Key:', 'Set' if os.getenv('AZURE_OPENAI_API_KEY') else 'Not Set')
"

Import Path Issues

If you encounter module import errors:

# Run from the project root directory
cd /c:/Users/andre/ai-models/synthetic-data-generator
python src/main.py

Missing Dependencies

If you encounter missing package errors:

# Reinstall all requirements
pip install -r requirements.txt --force-reinstall

Contributing

Contributions are welcome! Please submit a pull request or open an issue for any enhancements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.env		.env
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup_dependencies.py		setup_dependencies.py

License

andyogah/synthetic-data-generator-langchain

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Generator Langchain

Overview

The Challenge: The Data Access Paradox

The Solution: Contextually-Aware Synthetic Data Generation

Technical Excellence

Features

Azure Products Used

Workflow Architecture

Custom Embeddings Approach

1. Custom Embeddings Generation

2. Manual Pipeline Orchestration

3. Separate Vector Store Management

Pipeline Flow Diagram

PII Protection Features

Built-in PII Safeguards

PII Protection Layers

Supported PII Types

Project Structure

Technology Stack

Installation

1. Clone the Repository

2. Create Virtual Environment (Recommended)

3. Install Dependencies

4. Set Up Environment Variables

Usage

Quick Start

Using Individual Components

1. Document Loading

2. Data Processing Pipeline

3. Synthetic Data Generation

Core Components

Data Processing (LangChain-Powered)

Vector Store (LangChain Integration)

PII Protection

Configuration

Key Benefits of LangChain Integration

Troubleshooting

Azure OpenAI Connection Issues

Import Path Issues

Missing Dependencies

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages