feat: Add comprehensive HTML file support with intelligent parsing and minimal dependenciesfeat: Add HTML file support for instrument loadingfeat: Add HTML file support for instrument loadingFeature/html file support #117
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces comprehensive HTML file support for instrument loading in Harmony, implementing HTML parsing functionality with BeautifulSoup integration, complete documentation, and robust testing. The implementation addresses maintainer feedback from PR #72 by providing a lightweight, dependency-friendly solution that maintains code quality standards.
Key Changes
src/harmony/parsing/html_parser.py): Complete HTML parsing implementation with BeautifulSoup integration and graceful fallback to basic text extraction when dependencies are unavailablewrapper_all_parsers.pyto route HTML and HTM files to the new parserhtmlandhtmfile types to the FileType enumerationTechnical Implementation
Addressing Review Feedback (PR #72)
This implementation specifically addresses the concerns raised in previous review feedback:
Fixes #37
Type of change
Testing
Comprehensive testing has been implemented covering:
All tests pass locally and maintain compatibility with existing functionality. The Harmony API remains unaffected by these changes.
Test Configuration
Checklist
requirements.txt,pyproject.tomland also in therequirements.txtin the API repoAdditional Notes
This implementation prioritizes simplicity and maintainability while providing robust HTML parsing capabilities. The optional dependency approach ensures that Harmony remains lightweight while offering enhanced functionality when BeautifulSoup is available.
🎯 Why HTML File Support Matters
HTML file support is a critical enhancement for Harmony's instrument loading capabilities. Many research questionnaires and surveys are distributed in HTML format, especially in web-based research, online assessments, and digital health studies. This feature bridges a significant gap in Harmony's parsing ecosystem, enabling researchers to directly import and harmonize questionnaire content from HTML sources without manual conversion.
🔧 Technical Approach & Implementation
This implementation introduces intelligent HTML parsing with a dual-tier architecture:
Core Architecture
Question Detection Intelligence
The parser employs sophisticated linguistic analysis to identify questionnaire items:
📋 Changes Made
1. New HTML Parser (
src/harmony/parsing/html_parser.py) - 257 linesconvert_html_to_instruments()- Main entry point for HTML processing_extract_text_with_beautifulsoup()- Enhanced HTML processing with DOM parsing_extract_text_basic()- Regex-based HTML tag removal with entity handling_extract_questions_from_text()- Intelligent questionnaire item identification_is_likely_question()- Advanced heuristics for question detection2. Parser Integration (
src/harmony/parsing/wrapper_all_parsers.py) - 21 changes.htmland.htmfiles_get_instruments_from_file()3. FileType Enum Extension (
src/harmony/schemas/enums/file_types.py) - 8 changeshtml: str = 'html'andhtm: str = 'htm'enumerations🎯 Addressing Maintainer Feedback from PR #72
This implementation specifically resolves all concerns raised in previous review feedback:
✅ Minimal Dependencies
requirements.txtorpyproject.toml✅ Code Quality Standards
✅ Maintainability & Testing
🧪 Testing Coverage
Comprehensive test scenarios implemented:
Test Configuration:
🚀 Minimal Dependencies Highlight
This implementation achieves full functionality without introducing any mandatory dependencies:
requirements.txtpyproject.tomlThe dual-tier approach ensures Harmony remains lightweight while providing enhanced capabilities when optional dependencies are present.
📝 Quality Assurance Checklist
💡 Future Enhancements
Ready for extension:
This implementation provides a robust, lightweight, and maintainable solution for HTML file support in Harmony, addressing all previous feedback while maintaining the project's high standards for code quality and minimal dependencies.
Fixes #37I'm available for any additional improvements or modifications based on reviewer feedback and am committed to iterating on this implementation to meet the project's high standards.## Description
Please include a summary of the change and which issue is fixed. Please also include relevant context. List any dependencies that are required for this change. Ideally we avoid introducing any new third party dependencies in
requirements.txtandpyproject.tomlunless absolutely necessary, because this makes the project more susceptible to breaking whenever a third party library is updated.Fixes # (issue)
Type of change
Please delete options that are not relevant.
Testing
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Since the Harmony Python package is used by the Harmony API (which is itself used by the R library and the web app), we need to avoid making any changes that break the Harmony API. Please also run the Harmony API unit tests and check that the API still runs with your changes to the Python package: https://github.com/harmonydata/harmonyapi
Test Configuration
Checklist
requirements.txt,pyproject.tomland also in therequirements.txtin the API repoOptionally: feel free to paste your Discord username in this format:
discordapp.com/users/yourIDin your pull request description, then we can know to tag you in the Harmony Discord server when we announce the PR.