A powerful client-side JavaScript library for converting PDFs to Markdown with multiple extraction methods and optional LLM enhancement. Now with scenario-specific methods for different use cases.
Extract2MD now offers 5 distinct scenarios for different conversion needs:
import Extract2MDConverter from 'extract2md';
// Scenario 1: Quick conversion only
const markdown1 = await Extract2MDConverter.quickConvertOnly(pdfFile);
// Scenario 2: High accuracy OCR conversion only
const markdown2 = await Extract2MDConverter.highAccuracyConvertOnly(pdfFile);
// Scenario 3: Quick conversion + LLM enhancement
const markdown3 = await Extract2MDConverter.quickConvertWithLLM(pdfFile);
// Scenario 4: High accuracy conversion + LLM enhancement
const markdown4 = await Extract2MDConverter.highAccuracyConvertWithLLM(pdfFile);
// Scenario 5: Combined extraction + LLM enhancement (most comprehensive)
const markdown5 = await Extract2MDConverter.combinedConvertWithLLM(pdfFile);
- Use case: Fast conversion when PDF has selectable text
- Method:
quickConvertOnly(pdfFile, config?)
- Tech: PDF.js text extraction only
- Output: Basic markdown formatting
- Use case: PDFs with images, scanned documents, complex layouts
- Method:
highAccuracyConvertOnly(pdfFile, config?)
- Tech: Tesseract.js OCR
- Output: Markdown from OCR extraction
- Use case: Fast extraction with AI enhancement for better formatting
- Method:
quickConvertWithLLM(pdfFile, config?)
- Tech: PDF.js + WebLLM
- Output: AI-enhanced markdown with improved structure and clarity
- Use case: OCR extraction with AI enhancement
- Method:
highAccuracyConvertWithLLM(pdfFile, config?)
- Tech: Tesseract.js OCR + WebLLM
- Output: AI-enhanced markdown from OCR
- Use case: Most comprehensive conversion using both extraction methods
- Method:
combinedConvertWithLLM(pdfFile, config?)
- Tech: PDF.js + Tesseract.js + WebLLM with specialized prompts
- Output: Best possible markdown leveraging strengths of both extraction methods
Create a configuration object or JSON file to customize behavior:
const config = {
// PDF.js Worker
pdfJsWorkerSrc: "../pdf.worker.min.mjs",
// Tesseract OCR Settings
tesseract: {
workerPath: "./tesseract-worker.min.js",
corePath: "./tesseract-core.wasm.js",
langPath: "./lang-data/",
language: "eng",
options: {}
},
// LLM Configuration
webllm: {
model: "Qwen3-0.6B-q4f16_1-MLC",
// Optional: Custom model
customModel: {
model: "https://huggingface.co/mlc-ai/your-model/resolve/main/",
model_id: "YourModel-ID",
model_lib: "https://example.com/your-model.wasm",
required_features: ["shader-f16"],
overrides: { conv_template: "qwen" }
},
options: {
temperature: 0.7,
maxTokens: 4096
}
},
// System Prompt Customizations
systemPrompts: {
singleExtraction: "Focus on preserving code examples exactly.",
combinedExtraction: "Pay attention to tables and diagrams from OCR."
},
// Processing Options
processing: {
splitPascalCase: false,
pdfRenderScale: 2.5,
postProcessRules: [
{ find: /\bAPI\b/g, replace: "API" }
]
},
// Progress Tracking
progressCallback: (progress) => {
console.log(`${progress.stage}: ${progress.message}`);
if (progress.currentPage) {
console.log(`Page ${progress.currentPage}/${progress.totalPages}`);
}
}
};
// Use configuration
const markdown = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);
import {
WebLLMEngine,
OutputParser,
SystemPrompts,
ConfigValidator
} from 'extract2md';
// Validate configuration
const validatedConfig = ConfigValidator.validate(userConfig);
// Initialize WebLLM engine
const engine = new WebLLMEngine(validatedConfig);
await engine.initialize();
// Generate text
const result = await engine.generate("Your prompt here");
// Parse output
const parser = new OutputParser();
const cleanMarkdown = parser.parse(result);
The library uses different system prompts for different scenarios:
// For scenarios 3 & 4 (single extraction)
const singlePrompt = SystemPrompts.getSingleExtractionPrompt(
"Additional instruction: Preserve all technical terms."
);
// For scenario 5 (combined extraction)
const combinedPrompt = SystemPrompts.getCombinedExtractionPrompt(
"Focus on creating comprehensive documentation."
);
import { ConfigValidator } from 'extract2md';
// Load from JSON string
const config = ConfigValidator.fromJSON(configJsonString);
// Use with any scenario
const result = await Extract2MDConverter.quickConvertWithLLM(pdfFile, config);
const config = {
progressCallback: (progress) => {
switch (progress.stage) {
case 'scenario_5_start':
console.log('Starting combined conversion...');
break;
case 'webllm_load_progress':
console.log(`Loading model: ${progress.progress}%`);
break;
case 'ocr_page_process':
console.log(`OCR: ${progress.currentPage}/${progress.totalPages}`);
break;
case 'webllm_generate_start':
console.log('AI enhancement in progress...');
break;
case 'scenario_5_complete':
console.log('Conversion completed!');
break;
default:
console.log(`${progress.stage}: ${progress.message}`);
}
if (progress.error) {
console.error('Error:', progress.error);
}
}
};
try {
const result = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);
console.log('Success:', result);
} catch (error) {
console.error('Conversion failed:', error.message);
}
If you're using the old API, you can still access it:
import { LegacyExtract2MDConverter } from 'extract2md';
// Old way
const converter = new LegacyExtract2MDConverter(options);
const quick = await converter.quickConvert(pdfFile);
const ocr = await converter.highAccuracyConvert(pdfFile);
const enhanced = await converter.llmRewrite(text);
// New way (recommended)
const quick = await Extract2MDConverter.quickConvertOnly(pdfFile, config);
const ocr = await Extract2MDConverter.highAccuracyConvertOnly(pdfFile, config);
const enhanced = await Extract2MDConverter.quickConvertWithLLM(pdfFile, config);
- 5 Scenario-Specific Methods: Choose the right approach for your use case
- WebLLM Integration: Client-side AI enhancement with Qwen models
- Custom Model Support: Use your own trained models
- Advanced Output Parsing: Automatic removal of thinking tags and formatting
- Comprehensive Configuration: Fine-tune every aspect of the conversion
- Progress Tracking: Real-time updates for UI integration
- TypeScript Support: Full type definitions included
- Backwards Compatible: Legacy API still available
Full TypeScript definitions are included:
import Extract2MDConverter, {
Extract2MDConfig,
ProgressReport,
CustomModelConfig
} from 'extract2md';
const config: Extract2MDConfig = {
webllm: {
model: "Qwen3-0.6B-q4f16_1-MLC",
options: {
temperature: 0.7,
maxTokens: 4096
}
},
progressCallback: (progress: ProgressReport) => {
console.log(progress.stage, progress.message);
}
};
const result: string = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);
npm install extract2md
<script src="https://unpkg.com/extract2md@2.0.0/dist/assets/extract2md.umd.js"></script>
<script>
// Available as global Extract2MD
const result = await Extract2MD.Extract2MDConverter.quickConvertOnly(pdfFile);
</script>
The package requires worker files for PDF.js and Tesseract.js. These are automatically copied during build:
// Default worker paths (adjust for your deployment)
const config = {
pdfJsWorkerSrc: "/pdf.worker.min.mjs",
tesseract: {
workerPath: "/tesseract-worker.min.js",
corePath: "/tesseract-core.wasm.js"
}
};
- Total Size: ~11 MB (includes OCR and PDF processing)
- PDF.js: ~950 KB
- Tesseract.js: ~4.5 MB
- WebLLM: Variable (model-dependent)
Use lazy loading and code splitting for production deployments.
- Migration Guide - Upgrade from legacy API
- Deployment Guide - Production deployment instructions
- Examples - Complete usage examples
- How To Run the Demo - Instructions on how to run the demo
- TypeScript Definitions - Full type definitions
MIT License - see LICENSE file for details.
Contributions welcome! Please read the contributing guidelines before submitting PRs.
Report issues on the GitHub Issues page.