Northern Ireland Education Text Analysis

This project analyzes educational texts from Option2 and Option1 perspectives in Northern Ireland, comparing content across different document types including textbooks, policy documents, and teacher interviews.

Project Structure

Northern_Ireland_Education_Text/
├── README.md
├── requirements.txt
├── scripts/
│   ├── config.py
│   ├── utils.py
│   ├── file_reader.py
│   └── main.py
├── data/
│   └── strand1/
│       ├── option2/
│       │   ├── Madden (2011) CCEA revision guide Chp 3. Changing Relationships.docx
│       │   ├── Doherty (2001) Northern Ireland since c.1960.docx
│       │   ├── TeacherA_option2.docx
│       │   └── ... (more textbooks and interviews)
│       ├── option1/
│       │   ├── Madden (2007) History for CCEA GCSE Revision Guide - Chapter 3.docx
│       │   ├── TeacherB_option1.docx
│       │   └── ... (more textbooks and interviews)
│       └── both/
│           ├── Reconciled_interviews/
│           │   ├── TeacherA_reconciled.docx
│           │   ├── TeacherB_reconciled.docx
│           │   └── ... (more interviews)
│           └── GCSE History (2017)-specification-Standard.docx
├── outputs/
│   └── processed_text_data.csv

option2/: All Option2 perspective documents (textbooks, teacher interviews, etc.)
option1/: All Option1 perspective documents (textbooks, teacher interviews, etc.)
both/: All shared/interview/policy documents (e.g., reconciled teacher interviews, policy docs)

Document Types

Textbooks: Educational materials by Madden, Doherty, Johnston
Policy Documents: GCSE Planning Frameworks and specifications
Combined Resources: Comprehensive resource collections
Teacher Interviews: Teacher interview transcripts (can be under option2/, option1/, or both/)

Usage

Install dependencies:

pip install -r requirements.txt

Run the pipeline for reading and grouping raw data:

python -m scripts.main

Check outputs in the outputs/ directory for processed data.

URL Processing with AI Fallback

The pipeline includes URL processing capabilities for combined documents:

Raw Content Fetching: Uses enhanced web scraping to fetch live content from URLs
AI Knowledge-Based Fallback: When raw fetching fails, uses OpenAI to generate summaries based on training data
Educational Focus: AI summaries focus on Northern Ireland education and history relevance
Automatic URL detection: Extracts URLs from text using regex patterns
Configurable limits: Control character limits and timeouts via config.py
Error handling: Handles failed requests and network issues
Rate limiting: Includes delays between requests to be respectful to servers

Configuration

URL processing can be configured in scripts/config.py:

# URL processing parameters
FETCH_URLS = False  # Set this to False to skip all URL processing
MAX_URL_CHARS = 8000  # Maximum characters to extract from each URL
URL_TIMEOUT = 15  # Timeout for URL requests in seconds

# OpenAI fallback parameters
USE_OPENAI_FALLBACK = False  # Set this to False to disable AI completely
OPENAI_MODEL = "gpt-4o-mini"  # OpenAI model to use for summarization
# OpenAI API key will be loaded from .env file or environment variable
MAX_AI_SUMMARY_CHARS = 2000  # Maximum characters for AI-generated summaries

AI Fallback Setup

To use the AI fallback functionality:

Install the required libraries:

pip install openai python-dotenv

Set your OpenAI API key in the .env file:

# In your .env file
OPENAI_API_KEY=your-api-key-here

The system will automatically:
- Try to fetch raw content from URLs using enhanced web scraping
- If raw fetching fails, use OpenAI to generate knowledge-based summaries
- Focus on Northern Ireland education and history relevance
- Provide summaries based on AI's training data about the domain

Testing (test file currently ignored)

Run the URL processing test:

python test_url_processing.py

Output Format

The processed data CSV now includes two content tracking columns:

has_url_content: Indicates whether content includes fetched web resources
- True: Content includes fetched web resources from URLs found in the document
- False: Content is from the original document only
has_ai_summary: Indicates whether content includes AI-generated summaries
- True: Content includes AI-generated summaries (knowledge-based)
- False: Content is from raw URL fetching or original document only

Content Labels

The system uses clear labels to identify content sources:

[AI-GENERATED SUMMARY FROM KNOWLEDGE BASE]: AI summary based on training data
--- URL Content {i}: {url} ---: Raw live web content
--- AI SUMMARY {i}: {url} ---: AI-generated summary

Content Flag Combinations

The output CSV includes two content tracking columns that work together:

`has_url_content`	`has_ai_summary`	Content Type	Description
`False`	`False`	Original document content only	Pure text from the source document, no URL content
`True`	`False`	Raw URL content	Successfully fetched live web content from URLs
`True`	`True`	AI-generated URL content	AI summaries generated when raw URL fetching failed

Note: When has_ai_summary=True, has_url_content

Data Cleaning

To clean the processed text data (lowercasing, stop word removal, lemmatization, etc.), run:

python scripts/preprocess.py

This will create a cleaned data file at outputs/cleaned_text_data.csv with an additional column cleaned_content containing the processed text.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
analysis		analysis
outputs		outputs
scripts		scripts
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
environment_setup.md		environment_setup.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Northern Ireland Education Text Analysis

Project Structure

Document Types

Usage

URL Processing with AI Fallback

Configuration

AI Fallback Setup

Testing (test file currently ignored)

Output Format

Content Labels

Content Flag Combinations

Data Cleaning

About

Uh oh!

Releases

Packages

Languages

jyang32/Northern_Ireland_Education_Text

Folders and files

Latest commit

History

Repository files navigation

Northern Ireland Education Text Analysis

Project Structure

Document Types

Usage

URL Processing with AI Fallback

Configuration

AI Fallback Setup

Testing (test file currently ignored)

Output Format

Content Labels

Content Flag Combinations

Data Cleaning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages