Entity resolution and fuzzy matching for comic book titles.
Comic Matcher is a specialized package for matching comic book titles across different formats and sources. It uses a combination of techniques from the record linkage toolkit with domain-specific optimizations for comic book naming conventions.
- Specialized comic book title parser
- Fuzzy matching with comic-specific optimizations
- Handling for series, volume, issue numbers
- Support for X-Men and other special series cases
- Configurable blocking and comparison rules
- Pre-computed fuzzy hash support
- Robust handling of sequels, team-ups, and special editions
- Smart filtering to avoid common bad matches
# Install directly from PyPI
pip install comic-matcher
# Install from GitHub
pip install git+https://github.com/JoshCLWren/comic_matcher.git
# Install from the local directory (for development)
pip install -e .
# Or install required dependencies only
pip install -r requirements.txt
from comic_matcher import ComicMatcher
# Initialize the matcher
matcher = ComicMatcher()
# Example data
source_comics = [
{"title": "Uncanny X-Men", "issue": "142"},
{"title": "Amazing Spider-Man", "issue": "300"}
]
target_comics = [
{"title": "X-Men", "issue": "142"},
{"title": "Spider-Man", "issue": "300"}
]
# Find matches
matches = matcher.match(source_comics, target_comics)
# Print results
print(f"Found {len(matches)} matches")
print(matches)
# Match comics between two sources
comic-matcher match source_data.csv target_data.csv -o matches.csv
# Parse a comic title into components
comic-matcher parse "Uncanny X-Men (1963) #142"
# Get help
comic-matcher --help
# Find best match for a single comic
comic = {"title": "Uncanny X-Men", "issue": "142"}
candidates = [
{"title": "X-Men", "issue": "142"},
{"title": "X-Men", "issue": "143"},
{"title": "X-Force", "issue": "1"}
]
best_match = matcher.find_best_match(comic, candidates)
print(best_match)
from comic_matcher import ComicTitleParser
parser = ComicTitleParser()
parsed = parser.parse("Uncanny X-Men (1963) #142")
print(parsed)
Comic Matcher includes specialized handling for various complex comic title patterns:
# Will match same sequel number but not different sequels
matcher.find_best_match({"title": "Civil War II", "issue": "1"},
[{"title": "Civil War", "issue": "1"},
{"title": "Civil War II", "issue": "1"},
{"title": "Civil War III", "issue": "1"}])
# Properly handles team-up formats
matcher.find_best_match({"title": "Wolverine", "issue": "1"},
[{"title": "Wolverine/Doop", "issue": "1"}]) # Won't match
matcher.find_best_match({"title": "Wolverine/Doop", "issue": "1"},
[{"title": "Wolverine/Doop", "issue": "1"}]) # Will match
# Handles subtitle differences
matcher.find_best_match({"title": "X-Men: Phoenix", "issue": "1"},
[{"title": "X-Men: Legacy", "issue": "1"}]) # Won't match
# Distinguishes special editions
matcher.find_best_match({"title": "X-Men", "issue": "1"},
[{"title": "X-Men Annual", "issue": "1"}]) # Won't match
The recommended way to set up your development environment is using the provided Makefile:
# Clone the repository
git clone https://github.com/JoshCLWren/comic_matcher.git
cd comic_matcher
# Create a Python 3.12 virtual environment with pyenv
brew install pyenv virtualenv
make venv
# Install dev dependencies
make dev
This will create a pyenv virtual environment called comic_matcher_py312
using Python 3.12.
The project includes a Makefile with common development tasks:
# Create a Python 3.12 virtual environment with pyenv
make venv
# Install development dependencies
make dev
# Run tests
make test
# Run tests with coverage report
make test-cov
# Run tests with detailed output
make test-verbose
# Run linting with Ruff
make lint
# Format code with Ruff
make format
# Clean up temporary files
make clean
# Build the package
make build
# Check test coverage
make coverage
This workflow ensures a clean, isolated development environment and consistent code quality.
This project uses Ruff for linting and formatting. Ruff is a fast, modern Python linter and formatter written in Rust. It replaces multiple tools (flake8, black, isort, etc.) with a single, unified tool.
To lint your code:
make lint
To automatically format your code:
make format
# Basic matching example
python examples/basic_matching.py
# Integration example
python examples/integration_example.py
The project includes a comprehensive test suite using pytest. The tests cover all major components:
test_parser.py
: Tests for the comic title parsing functionalitytest_matcher.py
: Tests for the core matcher functionalitytest_utils.py
: Tests for utility functionstest_cli.py
: Tests for command-line interfacetest_bad_matches*.py
: Specialized tests for known problematic match casestest_sequel_detection.py
: Tests for sequel detection and handling
To run the tests:
# Run all tests
make test
# Run with coverage report
make test-cov
# Run tests with verbose output
make test-verbose
# Run specific test categories
pytest tests/test_bad_matches*.py -v
The tests use pytest fixtures defined in tests/conftest.py
to provide sample data and common setup. This makes the tests more readable and maintainable.
The matching algorithm follows these steps:
- Parse and normalize titles using the specialized comic title parser
- Generate candidate pairs using recordlinkage blocking
- Compute similarity scores for titles and issue numbers
- Filter candidates based on domain-specific rules:
- Different sequel numbers (e.g., "Civil War II" vs "Civil War III")
- Team-up vs. solo titles (e.g., "Wolverine/Doop" vs "Wolverine")
- Titles with different subtitles (e.g., "X-Men: Phoenix" vs "X-Men: Legacy")
- Special edition differences (e.g., "X-Men Annual" vs "X-Men")
- Calculate weighted similarity with adjusted weights:
- Title: 35%
- Issue number: 45%
- Year: 10%
- Special edition type: 10%
- Apply threshold and return matches
The parser extracts and normalizes:
- Main title
- Volume information
- Publication year
- Special identifiers (Annual, One-Shot, etc.)
- Subtitles
- Issue numbers
This project uses GitHub Actions for continuous integration and delivery:
- Python CI: Runs tests and linting on multiple Python versions
- Security Scan: Checks for security vulnerabilities in code and dependencies
- CodeQL Analysis: Performs advanced code quality and security analysis
- Dependency Review: Reviews dependencies in pull requests for vulnerabilities
- Dependency Update: Automatically updates dependencies weekly
- Build and Publish: Builds and publishes releases to PyPI
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create a virtual environment (
make venv
) - Install development dependencies (
make dev
) - Create your feature branch (
git checkout -b feature/amazing-feature
) - Make your changes and run tests (
make test
) - Format your code (
make format
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
All pull requests are automatically tested using our CI workflows.
MIT