Skip to content

Extract2MD is a powerful and versatile AI-enabled client-side JavaScript library for extracting text from PDF files and converting it into Markdown.

License

Notifications You must be signed in to change notification settings

hashangit/Extract2MD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Extract2MD - Enhanced PDF to Markdown Converter

NPM Version License Downloads

Sponsor on Patreon

A powerful client-side JavaScript library for converting PDFs to Markdown with multiple extraction methods and optional LLM enhancement. Now with scenario-specific methods for different use cases.

Extract2MD

πŸš€ Quick Start

Extract2MD now offers 5 distinct scenarios for different conversion needs:

import Extract2MDConverter from 'extract2md';

// Scenario 1: Quick conversion only
const markdown1 = await Extract2MDConverter.quickConvertOnly(pdfFile);

// Scenario 2: High accuracy OCR conversion only  
const markdown2 = await Extract2MDConverter.highAccuracyConvertOnly(pdfFile);

// Scenario 3: Quick conversion + LLM enhancement
const markdown3 = await Extract2MDConverter.quickConvertWithLLM(pdfFile);

// Scenario 4: High accuracy conversion + LLM enhancement
const markdown4 = await Extract2MDConverter.highAccuracyConvertWithLLM(pdfFile);

// Scenario 5: Combined extraction + LLM enhancement (most comprehensive)
const markdown5 = await Extract2MDConverter.combinedConvertWithLLM(pdfFile);

πŸ“‹ Scenarios Explained

Scenario 1: Quick Convert Only

  • Use case: Fast conversion when PDF has selectable text
  • Method: quickConvertOnly(pdfFile, config?)
  • Tech: PDF.js text extraction only
  • Output: Basic markdown formatting

Scenario 2: High Accuracy Convert Only

  • Use case: PDFs with images, scanned documents, complex layouts
  • Method: highAccuracyConvertOnly(pdfFile, config?)
  • Tech: Tesseract.js OCR
  • Output: Markdown from OCR extraction

Scenario 3: Quick Convert + LLM

  • Use case: Fast extraction with AI enhancement for better formatting
  • Method: quickConvertWithLLM(pdfFile, config?)
  • Tech: PDF.js + WebLLM
  • Output: AI-enhanced markdown with improved structure and clarity

Scenario 4: High Accuracy + LLM

  • Use case: OCR extraction with AI enhancement
  • Method: highAccuracyConvertWithLLM(pdfFile, config?)
  • Tech: Tesseract.js OCR + WebLLM
  • Output: AI-enhanced markdown from OCR

Scenario 5: Combined + LLM (Recommended)

  • Use case: Most comprehensive conversion using both extraction methods
  • Method: combinedConvertWithLLM(pdfFile, config?)
  • Tech: PDF.js + Tesseract.js + WebLLM with specialized prompts
  • Output: Best possible markdown leveraging strengths of both extraction methods

βš™οΈ Configuration

Create a configuration object or JSON file to customize behavior:

const config = {
  // PDF.js Worker
  pdfJsWorkerSrc: "../pdf.worker.min.mjs",
  
  // Tesseract OCR Settings
  tesseract: {
    workerPath: "./tesseract-worker.min.js",
    corePath: "./tesseract-core.wasm.js", 
    langPath: "./lang-data/",
    language: "eng",
    options: {}
  },
  
  // LLM Configuration
  webllm: {
    model: "Qwen3-0.6B-q4f16_1-MLC",
    // Optional: Custom model
    customModel: {
      model: "https://huggingface.co/mlc-ai/your-model/resolve/main/",
      model_id: "YourModel-ID",
      model_lib: "https://example.com/your-model.wasm",
      required_features: ["shader-f16"],
      overrides: { conv_template: "qwen" }
    },
    options: {
      temperature: 0.7,
      maxTokens: 4096
    }
  },
  
  // System Prompt Customizations
  systemPrompts: {
    singleExtraction: "Focus on preserving code examples exactly.",
    combinedExtraction: "Pay attention to tables and diagrams from OCR."
  },
  
  // Processing Options
  processing: {
    splitPascalCase: false,
    pdfRenderScale: 2.5,
    postProcessRules: [
      { find: /\bAPI\b/g, replace: "API" }
    ]
  },
  
  // Progress Tracking
  progressCallback: (progress) => {
    console.log(`${progress.stage}: ${progress.message}`);
    if (progress.currentPage) {
      console.log(`Page ${progress.currentPage}/${progress.totalPages}`);
    }
  }
};

// Use configuration
const markdown = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);

πŸ”§ Advanced Usage

Using Individual Components

import { 
  WebLLMEngine, 
  OutputParser, 
  SystemPrompts,
  ConfigValidator 
} from 'extract2md';

// Validate configuration
const validatedConfig = ConfigValidator.validate(userConfig);

// Initialize WebLLM engine
const engine = new WebLLMEngine(validatedConfig);
await engine.initialize();

// Generate text
const result = await engine.generate("Your prompt here");

// Parse output
const parser = new OutputParser();
const cleanMarkdown = parser.parse(result);

Custom System Prompts

The library uses different system prompts for different scenarios:

// For scenarios 3 & 4 (single extraction)
const singlePrompt = SystemPrompts.getSingleExtractionPrompt(
  "Additional instruction: Preserve all technical terms."
);

// For scenario 5 (combined extraction) 
const combinedPrompt = SystemPrompts.getCombinedExtractionPrompt(
  "Focus on creating comprehensive documentation."
);

Configuration from JSON

import { ConfigValidator } from 'extract2md';

// Load from JSON string
const config = ConfigValidator.fromJSON(configJsonString);

// Use with any scenario
const result = await Extract2MDConverter.quickConvertWithLLM(pdfFile, config);

🎯 Error Handling & Progress Tracking

const config = {
  progressCallback: (progress) => {
    switch (progress.stage) {
      case 'scenario_5_start':
        console.log('Starting combined conversion...');
        break;
      case 'webllm_load_progress':
        console.log(`Loading model: ${progress.progress}%`);
        break;
      case 'ocr_page_process':
        console.log(`OCR: ${progress.currentPage}/${progress.totalPages}`);
        break;
      case 'webllm_generate_start':
        console.log('AI enhancement in progress...');
        break;
      case 'scenario_5_complete':
        console.log('Conversion completed!');
        break;
      default:
        console.log(`${progress.stage}: ${progress.message}`);
    }
    
    if (progress.error) {
      console.error('Error:', progress.error);
    }
  }
};

try {
  const result = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);
  console.log('Success:', result);
} catch (error) {
  console.error('Conversion failed:', error.message);
}

πŸ”„ Migration from Legacy API

If you're using the old API, you can still access it:

import { LegacyExtract2MDConverter } from 'extract2md';

// Old way
const converter = new LegacyExtract2MDConverter(options);
const quick = await converter.quickConvert(pdfFile);
const ocr = await converter.highAccuracyConvert(pdfFile);
const enhanced = await converter.llmRewrite(text);

// New way (recommended)
const quick = await Extract2MDConverter.quickConvertOnly(pdfFile, config);
const ocr = await Extract2MDConverter.highAccuracyConvertOnly(pdfFile, config);
const enhanced = await Extract2MDConverter.quickConvertWithLLM(pdfFile, config);

🌟 Features

  • 5 Scenario-Specific Methods: Choose the right approach for your use case
  • WebLLM Integration: Client-side AI enhancement with Qwen models
  • Custom Model Support: Use your own trained models
  • Advanced Output Parsing: Automatic removal of thinking tags and formatting
  • Comprehensive Configuration: Fine-tune every aspect of the conversion
  • Progress Tracking: Real-time updates for UI integration
  • TypeScript Support: Full type definitions included
  • Backwards Compatible: Legacy API still available

πŸ“š TypeScript Support

Full TypeScript definitions are included:

import Extract2MDConverter, { 
  Extract2MDConfig, 
  ProgressReport,
  CustomModelConfig 
} from 'extract2md';

const config: Extract2MDConfig = {
  webllm: {
    model: "Qwen3-0.6B-q4f16_1-MLC",
    options: {
      temperature: 0.7,
      maxTokens: 4096
    }
  },
  progressCallback: (progress: ProgressReport) => {
    console.log(progress.stage, progress.message);
  }
};

const result: string = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);

πŸ—οΈ Installation & Deployment

NPM Installation

npm install extract2md

CDN Usage

<script src="https://unpkg.com/extract2md@2.0.0/dist/assets/extract2md.umd.js"></script>
<script>
    // Available as global Extract2MD
    const result = await Extract2MD.Extract2MDConverter.quickConvertOnly(pdfFile);
</script>

Worker Files Configuration

The package requires worker files for PDF.js and Tesseract.js. These are automatically copied during build:

// Default worker paths (adjust for your deployment)
const config = {
    pdfJsWorkerSrc: "/pdf.worker.min.mjs",
    tesseract: {
        workerPath: "/tesseract-worker.min.js",
        corePath: "/tesseract-core.wasm.js"
    }
};

Bundle Size Considerations

  • Total Size: ~11 MB (includes OCR and PDF processing)
  • PDF.js: ~950 KB
  • Tesseract.js: ~4.5 MB
  • WebLLM: Variable (model-dependent)

Use lazy loading and code splitting for production deployments.

πŸ“š Documentation

πŸ“„ License

MIT License - see LICENSE file for details.

🀝 Contributing

Contributions welcome! Please read the contributing guidelines before submitting PRs.

πŸ› Issues

Report issues on the GitHub Issues page.

About

Extract2MD is a powerful and versatile AI-enabled client-side JavaScript library for extracting text from PDF files and converting it into Markdown.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •