Skip to content

This is a repository containing code for analyzing educational texts in Northern Ireland. CO-PI: Dr. Jing Xu, University of Washington

Notifications You must be signed in to change notification settings

jyang32/Northern_Ireland_Education_Text

Repository files navigation

Northern Ireland Education Text Analysis

This project analyzes educational texts from Option2 and Option1 perspectives in Northern Ireland, comparing content across different document types including textbooks, policy documents, and teacher interviews.

Project Structure

Northern_Ireland_Education_Text/
├── README.md
├── requirements.txt
├── scripts/
│   ├── config.py
│   ├── utils.py
│   ├── file_reader.py
│   └── main.py
├── data/
│   └── strand1/
│       ├── option2/
│       │   ├── Madden (2011) CCEA revision guide Chp 3. Changing Relationships.docx
│       │   ├── Doherty (2001) Northern Ireland since c.1960.docx
│       │   ├── TeacherA_option2.docx
│       │   └── ... (more textbooks and interviews)
│       ├── option1/
│       │   ├── Madden (2007) History for CCEA GCSE Revision Guide - Chapter 3.docx
│       │   ├── TeacherB_option1.docx
│       │   └── ... (more textbooks and interviews)
│       └── both/
│           ├── Reconciled_interviews/
│           │   ├── TeacherA_reconciled.docx
│           │   ├── TeacherB_reconciled.docx
│           │   └── ... (more interviews)
│           └── GCSE History (2017)-specification-Standard.docx
├── outputs/
│   └── processed_text_data.csv
  • option2/: All Option2 perspective documents (textbooks, teacher interviews, etc.)
  • option1/: All Option1 perspective documents (textbooks, teacher interviews, etc.)
  • both/: All shared/interview/policy documents (e.g., reconciled teacher interviews, policy docs)

Document Types

  • Textbooks: Educational materials by Madden, Doherty, Johnston
  • Policy Documents: GCSE Planning Frameworks and specifications
  • Combined Resources: Comprehensive resource collections
  • Teacher Interviews: Teacher interview transcripts (can be under option2/, option1/, or both/)

Usage

  1. Install dependencies:
pip install -r requirements.txt
  1. Run the pipeline for reading and grouping raw data:
python -m scripts.main
  1. Check outputs in the outputs/ directory for processed data.

URL Processing with AI Fallback

The pipeline includes URL processing capabilities for combined documents:

  • Raw Content Fetching: Uses enhanced web scraping to fetch live content from URLs
  • AI Knowledge-Based Fallback: When raw fetching fails, uses OpenAI to generate summaries based on training data
  • Educational Focus: AI summaries focus on Northern Ireland education and history relevance
  • Automatic URL detection: Extracts URLs from text using regex patterns
  • Configurable limits: Control character limits and timeouts via config.py
  • Error handling: Handles failed requests and network issues
  • Rate limiting: Includes delays between requests to be respectful to servers

Configuration

URL processing can be configured in scripts/config.py:

# URL processing parameters
FETCH_URLS = False  # Set this to False to skip all URL processing
MAX_URL_CHARS = 8000  # Maximum characters to extract from each URL
URL_TIMEOUT = 15  # Timeout for URL requests in seconds

# OpenAI fallback parameters
USE_OPENAI_FALLBACK = False  # Set this to False to disable AI completely
OPENAI_MODEL = "gpt-4o-mini"  # OpenAI model to use for summarization
# OpenAI API key will be loaded from .env file or environment variable
MAX_AI_SUMMARY_CHARS = 2000  # Maximum characters for AI-generated summaries

AI Fallback Setup

To use the AI fallback functionality:

  1. Install the required libraries:
pip install openai python-dotenv
  1. Set your OpenAI API key in the .env file:
# In your .env file
OPENAI_API_KEY=your-api-key-here
  1. The system will automatically:
    • Try to fetch raw content from URLs using enhanced web scraping
    • If raw fetching fails, use OpenAI to generate knowledge-based summaries
    • Focus on Northern Ireland education and history relevance
    • Provide summaries based on AI's training data about the domain

Testing (test file currently ignored)

Run the URL processing test:

python test_url_processing.py

Output Format

The processed data CSV now includes two content tracking columns:

  • has_url_content: Indicates whether content includes fetched web resources

    • True: Content includes fetched web resources from URLs found in the document
    • False: Content is from the original document only
  • has_ai_summary: Indicates whether content includes AI-generated summaries

    • True: Content includes AI-generated summaries (knowledge-based)
    • False: Content is from raw URL fetching or original document only

Content Labels

The system uses clear labels to identify content sources:

  • [AI-GENERATED SUMMARY FROM KNOWLEDGE BASE]: AI summary based on training data
  • --- URL Content {i}: {url} ---: Raw live web content
  • --- AI SUMMARY {i}: {url} ---: AI-generated summary

Content Flag Combinations

The output CSV includes two content tracking columns that work together:

has_url_content has_ai_summary Content Type Description
False False Original document content only Pure text from the source document, no URL content
True False Raw URL content Successfully fetched live web content from URLs
True True AI-generated URL content AI summaries generated when raw URL fetching failed

Note: When has_ai_summary=True, has_url_content

Data Cleaning

To clean the processed text data (lowercasing, stop word removal, lemmatization, etc.), run:

python scripts/preprocess.py

This will create a cleaned data file at outputs/cleaned_text_data.csv with an additional column cleaned_content containing the processed text.

About

This is a repository containing code for analyzing educational texts in Northern Ireland. CO-PI: Dr. Jing Xu, University of Washington

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published