Experimental prototyping for extracting text data from archive files for the student newspaper - the Bucknellian.
The project is starting with some basic prototyping during spring 2025 semester:
- Extract text from the PDF archive files provided by BU Archives
- Split the extracted text into issues. To begin with the PDF files are compiled by academic year or volume, so to facilitate digital analysis they need to be split into smaller chunks by issue date or number.
- Evaluate the extracted digital text to see if it is usable for text analysis. If the OCR quality is too bad then we may begin to explore new OCR tools.
Some project requirements and desires:
- Coding will be in Python, unless there is a compelling reason to change.
- Feel free to use open source packages where appropriate, but please document your uses in the submitted code or in the Readme.md
- Project will use Github to track code changes and team contributions.
- Students should start by cloning the main repository and then working and pushing to branches.
- We will attempt to review code pushes on a weekly basis.