This repository processes screen text data to extract linguistic features. The workflow consists of two main steps:
- Preprocess screen text data using
run_screentext_preprocess_pipeline.py
- Generate linguistic features using
run.py
- Python 3.6+
- Required Python packages (listed in requirements.txt)
- Conda environment (recommended: use the provided
scrtxt
environment)
Each participant folder must contain the following data tables in JSONL format:
Table Name | Description | Required |
---|---|---|
applications_foreground.jsonl | Information about applications running in the foreground | ✅ |
screen.jsonl | Screen state information (on/off events) | ✅ |
screentext.jsonl | Text extracted from screens | ✅ |
If using conda environment (recommended):
conda activate scrtxt
while read requirement; do conda install --yes $requirement || pip install $requirement; done < requirements.txt
If using pip:
pip install -r requirements.txt
Before running the pipeline, ensure you have the following directory structure:
.
├── participant_data/ # Raw participant data (in JSONL format)
│ ├── participant1/ # Each participant has their own folder
│ │ ├── applications_foreground.jsonl # Required table
│ │ ├── screen.jsonl # Required table
│ │ └── screentext.jsonl # Required table
│ ├── participant2/
│ └── ...
├── step1_data/ # Will contain preprocessed data (created automatically)
├── step2_data/ # Will contain feature data (created automatically)
└── data_preprocessing/ # Preprocessing scripts
Important requirements for your data:
- All raw participant data must be stored in the
participant_data
directory - Each participant must have their own subfolder (e.g.,
participant_data/participant1/
) - Each participant folder must contain all required data tables:
applications_foreground.jsonl
screen.jsonl
screentext.jsonl
- All data files must be in JSONL (JSON Lines) format
- The preprocessing pipeline will read these JSONL files and convert them into the required format for feature extraction
The first step is to preprocess the raw screen text data using run_screentext_preprocess_pipeline.py
. This script performs multiple preprocessing steps:
- Generate app package pairs
- Clean screentext data
- Generate filtered system app transition files
- Add day IDs
- Calculate session metrics
For processing a single participant:
conda activate scrtxt
python run_screentext_preprocess_pipeline.py --participant <participant_id> [--timezone <timezone>]
For processing all participants in parallel:
conda activate scrtxt
python run_screentext_preprocess_pipeline.py --all [--timezone <timezone>] [--workers <num>]
--participant
,-p
: Participant ID to process (e.g., 1234)--all
: Process data for all participants in parallel--timezone
: Timezone for timestamp conversion (default: Australia/Melbourne)--utc
: If set, overrides timezone with UTC--workers
: Number of worker threads for parallel processing (default: 75% of available CPU cores)
After preprocessing, extract linguistic features using run.py
. This script reads the cleaned data from step1_data
and saves features to step2_data
. Feature extraction is performed in parallel by default for multiple participants.
For processing a single participant:
conda activate scrtxt
python run.py --participant <participant_id>
For processing all participants in parallel:
conda activate scrtxt
python run.py
--base_input_dir
: Base directory containing preprocessed data (default: step1_data)--base_output_dir
: Base directory for storing extracted features (default: step2_data)--participant
,-p
: Participant folder to process (if not specified, processes all)--input_filename
: Input file name inside each participant folder (default: clean_input.jsonl)--output_filename
: Output file name for extracted features (default: linguistic_features.csv)--num_workers
: Number of worker processes to use (default: 75% of available CPU cores)
# Activate the conda environment
conda activate scrtxt
# Install dependencies (if not already installed)
while read requirement; do conda install --yes $requirement || pip install $requirement; done < requirements.txt
# Step 1: Preprocess data for all participants
python run_screentext_preprocess_pipeline.py --all
# Step 2: Extract features for all participants (using parallel processing)
python run.py
# Alternatively, process a single participant
python run_screentext_preprocess_pipeline.py --participant 1234
python run.py --participant 1234
# Control the number of worker processes for parallel processing
python run.py --num_workers 4
After running both steps:
step1_data/
will contain cleaned and preprocessed data for each participantstep2_data/
will contain CSV files with linguistic features for each participant
- Missing Data Tables: Ensure each participant folder contains all three required data tables:
applications_foreground.jsonl
screen.jsonl
screentext.jsonl
- Environment Issues: Always run scripts in the proper conda environment:
conda activate scrtxt
- Parallel Processing: If encountering memory issues during parallel processing, reduce the number of workers using the
--workers
option - Processing Order: Ensure preprocessing (Step 1) completed successfully before running feature extraction (Step 2)
The feature extraction process (Step 2) can be memory-intensive, especially when processing large input files:
-
Segmentation Faults: If you encounter segmentation faults, it's likely due to memory limitations when processing very large text files. The code now includes safeguards to handle large files by:
- Processing text in smaller chunks
- Limiting the amount of text analyzed for memory-intensive operations (NER, POS tagging)
- Adding robust error handling to prevent crashes
-
RAM Dependency: The maximum file size that can be processed depends on your system's available RAM:
- 8GB RAM systems: May struggle with files larger than ~10MB
- 16GB RAM systems: Should handle files up to ~30MB
- 32GB+ RAM systems: Can process larger files more efficiently
-
Adjusting Memory Usage: You can modify the following constants in
get_features.py
to adjust memory usage based on your system capabilities:MAX_TEXT_CHUNK_SIZE
: Controls the maximum text size processed at onceMAX_TOKENS_FOR_INTENSIVE_ANALYSIS
: Limits token count for NLP operations
If you continue to experience memory issues with very large files, consider preprocessing the input files to split them into smaller chunks before running the feature extraction.