Skip to content

Experimental prototyping for extracting text data from archive files for the student newspaper - the Bucknellian.

Notifications You must be signed in to change notification settings

BucknellDSC/bucknellian

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

bucknellian

Experimental prototyping for extracting text data from archive files for the student newspaper - the Bucknellian.

Goals for project in spring 2025

The project is starting with some basic prototyping during spring 2025 semester:

  1. Extract text from the PDF archive files provided by BU Archives
  2. Split the extracted text into issues. To begin with the PDF files are compiled by academic year or volume, so to facilitate digital analysis they need to be split into smaller chunks by issue date or number.
  3. Evaluate the extracted digital text to see if it is usable for text analysis. If the OCR quality is too bad then we may begin to explore new OCR tools.

Some project requirements and desires:

  • Coding will be in Python, unless there is a compelling reason to change.
  • Feel free to use open source packages where appropriate, but please document your uses in the submitted code or in the Readme.md
  • Project will use Github to track code changes and team contributions.
  • Students should start by cloning the main repository and then working and pushing to branches.
  • We will attempt to review code pushes on a weekly basis.

About

Experimental prototyping for extracting text data from archive files for the student newspaper - the Bucknellian.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published