Extract structured data from job postings using OpenAI's structured output capabilities.
- π― Smart Extraction: Extract structured information from any job posting URL
- π§ AI-Powered: Uses OpenAI's GPT models with structured output for accurate parsing
- π Comprehensive Data: Covers all job posting aspects including salary, skills, requirements
- π Advanced Scraping: Playwright-based scraping handles JavaScript-heavy sites
- πΎ File Storage: Save as markdown and JSON with SHA-1 hash filenames for deduplication
- π Reliability: Automatic retries with exponential backoff for robust operation
- π Observability: Detailed logging with Logfire integration
- π Modern Python: Full type hints and Python 3.8+ support
pip install data-job-parser
After installation, install Playwright browsers:
playwright install chromium
from data_job_parser import JobPostingParser
# Initialize with OpenAI API key
parser = JobPostingParser(api_key="your-openai-api-key")
# Parse a job posting
job_data = parser.parse("https://example.com/job-posting")
# Access structured data
print(f"Title: {job_data.title}")
print(f"Company: {job_data.company}")
print(f"Location: {job_data.location.city}, {job_data.location.country}")
print(f"Salary: {job_data.salary.min_amount}-{job_data.salary.max_amount} {job_data.salary.currency}")
print(f"Skills: {', '.join(job_data.required_skills)}")
# Parse and save both markdown and JSON
job_data, markdown_path, json_path = await parser.parse_async(
"https://jobs.pradagroup.com/job/Milan-Data-Engineer/1199629101/",
save_markdown=True,
save_json=True
)
print(f"Markdown: {markdown_path}")
print(f"JSON: {json_path}")
from data_job_parser import JobPostingParser
parser = JobPostingParser(api_key="your-api-key")
urls = ["https://job1.com", "https://job2.com", "https://job3.com"]
for url in urls:
try:
job_data, md_path, json_path = parser.parse(
url,
save_markdown=True,
save_json=True
)
print(f"β
{job_data.title} at {job_data.company}")
except Exception as e:
print(f"β Failed to parse {url}: {e}")
Create a .env
file in your project root:
# Required
OPENAI_API_KEY=your-openai-api-key
# Optional - Logging
LOGFIRE_TOKEN=your-logfire-token
# Optional - Model Configuration
OPENAI_MODEL=gpt-4-turbo-preview
# Optional - Playwright Settings
PLAYWRIGHT_HEADLESS=true
PLAYWRIGHT_TIMEOUT=60000
Option 1: Parameter
parser = JobPostingParser(api_key="your-api-key")
Option 2: Environment Variable
export OPENAI_API_KEY="your-api-key"
parser = JobPostingParser() # Auto-loads from env
# Use different OpenAI model
parser = JobPostingParser(
api_key="your-api-key",
model="gpt-4o" # or gpt-3.5-turbo, etc.
)
Files are saved with SHA-1 hash filenames to prevent duplicates:
data/
βββ markdown/
β βββ a1b2c3d4e5f6789.md
βββ json/
βββ a1b2c3d4e5f6789.json
The parser extracts comprehensive job information:
Core Information
- Title, company, location, description
- Work type (full-time, part-time, contract)
- Work mode (remote, hybrid, on-site)
- Experience level required
Compensation & Benefits
- Salary range with currency
- Benefits and perks
- Stock options, bonuses
Skills & Requirements
- Required technical skills
- Preferred/nice-to-have skills
- Education requirements
- Years of experience needed
Additional Details
- Team size and department
- Application process
- Company culture information
The parser includes robust error handling:
from data_job_parser import JobPostingParser
from data_job_parser.exceptions import ParsingError, ScrapingError
parser = JobPostingParser(api_key="your-api-key")
try:
job_data = parser.parse("https://example.com/job")
except ScrapingError as e:
print(f"Failed to scrape URL: {e}")
except ParsingError as e:
print(f"Failed to parse content: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
# Clone repository
git clone https://github.com/mazzasaverio/data-job-parser.git
cd data-job-parser
# Install with uv (recommended)
uv sync --dev
# Or with pip
pip install -e ".[dev]"
# Run all tests
uv run pytest
# With coverage report
uv run pytest --cov=src/data_job_parser --cov-report=html
# Run specific test file
uv run pytest tests/test_parser.py -v
# Format code
uv run ruff format .
# Lint code
uv run ruff check .
# Type checking
uv run mypy src/
-
Update version in both files:
src/data_job_parser/__init__.py
pyproject.toml
-
Run quality checks:
uv run pytest uv run ruff check . uv run mypy src/
-
Commit and tag:
git add . git commit -m "chore: bump version to X.Y.Z" git push origin main git tag vX.Y.Z git push origin vX.Y.Z
-
Automated deployment: GitHub Actions will automatically:
- Run tests
- Build package
- Publish to PyPI
- Create GitHub release
We welcome contributions! Please follow these steps:
- Fork the repository
- Create feature branch:
git checkout -b feature/amazing-feature
- Make your changes with tests
- Run quality checks:
uv run pytest && uv run ruff check .
- Commit changes:
git commit -m 'feat: add amazing feature'
- Push branch:
git push origin feature/amazing-feature
- Open a Pull Request
- Write tests for new features
- Follow existing code style
- Update documentation as needed
- Use conventional commit messages
- Python: 3.8+
- OpenAI API Key: Required for parsing
- Internet Connection: For web scraping and API calls
See CHANGELOG.md for version history and changes.
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for structured output capabilities
- Playwright for robust web scraping
- Pydantic for data validation
- Logfire for observability
Made with β€οΈ by Saverio Mazza