- Prerequisites
- Installation
- Understanding the Data Structure
- Method 1: Using Parselmouth (Recommended)
- Normalisation Techniques
- DTW Analysis
- Results
- Advanced Features - Not yet implemented
- Troubleshooting
- Python 3.6 or higher
- Audio files in WAV format
- Metadata CSV file with file information
- Basic understanding of audio processing concepts
pip install praat-parselmouth pandas matplotlib numpy
Your directory structure should look like this:
project/
├── metadata.csv
├── pitch_contour_extractor.py
├── pitch_normalised_contour_extractor.py
├── audio_subset/
│ ├── student/
│ │ ├── 20220408145911.wav
│ │ ├── 20220407121244.wav
│ │ └── ...
│ └── teacher/
│ ├── 20220330100622.wav
│ ├── 20220329084512.wav
│ └── ...
└── output/
├── pitch_plots/
├── pitch_plots_normalised/
├── pitch_normalised_analysis_summary.csv
└── pitch_data_extracted.csv
Your metadata.csv
should have the following columns:
s_file | t_file | s_bpm | t_bpm | s_scale | t_scale | ground_truth |
---|---|---|---|---|---|---|
20220408145911.wav | 20220330100622.wav | 110 | 110 | G3 | A3 | "[['0', '10.33', 'P'], ['3.919', '4.203', 'T']]" |
import parselmouth
import numpy as np
def extract_pitch_contour(audio_file_path, pitch_floor=75, pitch_ceiling=600, time_step=0.01):
"""
Extract pitch contour from an audio file using Parselmouth/Praat
"""
try:
# Load the sound file
sound = parselmouth.Sound(audio_file_path)
# Extract pitch using Praat's algorithm
pitch = sound.to_pitch(time_step=time_step,
pitch_floor=pitch_floor,
pitch_ceiling=pitch_ceiling)
# Get the pitch values and times
pitch_values = pitch.selected_array['frequency']
times = pitch.xs()
return times, pitch_values
except Exception as e:
print(f"Error processing {audio_file_path}: {str(e)}")
return None, None
import pandas as pd
import matplotlib.pyplot as plt
import os
def process_metadata_and_plot(csv_file_path):
"""
Complete pipeline to process metadata CSV and create pitch plots
"""
# Read metadata
df = pd.read_csv(csv_file_path)
# Create output directory
os.makedirs("pitch_plots", exist_ok=True)
results = []
for idx, row in df.iterrows():
print(f"Processing pair {idx + 1}/{len(df)}")
# Construct file paths
student_file = f"audio_subset/student/{row['s_file']}"
teacher_file = f"audio_subset/teacher/{row['t_file']}"
# Extract pitch contours
student_times, student_pitch = extract_pitch_contour(student_file)
teacher_times, teacher_pitch = extract_pitch_contour(teacher_file)
if student_times is not None and teacher_times is not None:
# Create comparison plot
create_comparison_plot(
student_times, student_pitch, row['s_file'], row['s_bpm'], row['s_scale'],
teacher_times, teacher_pitch, row['t_file'], row['t_bpm'], row['t_scale'],
idx
)
# Store results for further analysis
results.append({
'index': idx,
'student_file': row['s_file'],
'teacher_file': row['t_file'],
'student_times': student_times,
'student_pitch': student_pitch,
'teacher_times': teacher_times,
'teacher_pitch': teacher_pitch
})
return results
def create_comparison_plot(s_times, s_pitch, s_file, s_bpm, s_scale,
t_times, t_pitch, t_file, t_bpm, t_scale, index):
"""
Create a comparison plot for student and teacher pitch contours
"""
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
# Clean pitch data (replace 0s with NaN for better plotting)
s_pitch_clean = s_pitch.copy()
s_pitch_clean[s_pitch_clean == 0] = np.nan
t_pitch_clean = t_pitch.copy()
t_pitch_clean[t_pitch_clean == 0] = np.nan
# Plot student pitch contour
ax1.plot(s_times, s_pitch_clean, 'b-', linewidth=1.5, label='Student')
ax1.set_title(f"Student: {s_file} (BPM: {s_bpm}, Scale: {s_scale})")
ax1.set_ylabel('Pitch (Hz)')
ax1.grid(True, alpha=0.3)
ax1.legend()
# Plot teacher pitch contour
ax2.plot(t_times, t_pitch_clean, 'r-', linewidth=1.5, label='Teacher')
ax2.set_title(f"Teacher: {t_file} (BPM: {t_bpm}, Scale: {t_scale})")
ax2.set_ylabel('Pitch (Hz)')
ax2.set_xlabel('Time (s)')
ax2.grid(True, alpha=0.3)
ax2.legend()
plt.tight_layout()
# Save plot
plot_filename = f"pitch_plots/comparison_{index:03d}.png"
plt.savefig(plot_filename, dpi=300, bbox_inches='tight')
plt.close()
print(f"Saved: {plot_filename}")
- Student (female): Typically 180-300 Hz range
- Teacher (male): Typically 80-180 Hz range
- Anatomical differences: Vocal fold size creates systematic frequency differences
- Comparison challenge: Raw values cannot be directly compared across speakers
z = (frequency - speaker_mean) / speaker_std_dev
- Result: Mean = 0, Standard deviation = 1
- Use when: Statistical analysis, machine learning
- Advantage: Both speakers on same scale
- Caution: Problems with flat contours (low std dev)
semitones = 12 * log2(frequency / reference_frequency)
- Reference options: Speaker mean, 100 Hz, musical note
- Result: Perceptually meaningful intervals
- Use when: Musical analysis, cross-linguistic studies
- Advantage: Reflects human pitch perception
centered = frequency - speaker_mean
- Result: Relative pitch movement in Hz
- Use when: Preserving Hz units is important
- Advantage: Simple, interpretable
- Best for: Student-teacher prosody comparison
POR = (frequency - speaker_min) / (speaker_max - speaker_min)
- Result: Values between 0 and 1
- Use when: Comparing pitch range utilization
- Extract raw pitch first using Praat/Parselmouth
- Apply normalization before plotting/analysis
- Choose method based on research goals:
- Prosody comparison: Mean-centered Hz or semitones
- Statistical modeling: Z-score normalization
- Cross-speaker analysis: Semitones relative to speaker mean
# 1. Extract raw pitch (unnormalized)
pitch = sound.to_pitch()
pitch_values = pitch.selected_array['frequency']
# 2. Remove unvoiced frames (zeros)
voiced_frames = pitch_values[pitch_values > 0]
# 3. Apply normalization
speaker_mean = np.mean(voiced_frames)
normalized_pitch = voiced_frames - speaker_mean # Mean-centered
# 4. Plot normalized contours
- Raw output is NOT normalized - you must normalize manually
- Choose normalization method based on research questions
- Document which method used for reproducibility
- Consider voicing detection before normalization
The DTW pipeline processes your metadata.csv and audio files to produce comprehensive analysis results without requiring intermediate CSV storage. Here's what the pipeline does:
- Reads
metadata.csv
with student-teacher audio file pairs - Extracts pitch contours from WAV files using Parselmouth
- Applies semitone normalization relative to each speaker's mean frequency
- Cost Matrix Computation: Uses log-scale distance function
- Optimal Path Finding: Dynamic programming algorithm
- SARGAM Note Mapping: Maps frequencies to Sa, Re, Ga, Ma, Pa, Dha, Ni
- Duration Metrics: Calculates student performance duration only
- Cost Aggregation: Both average and maximum methods for each note pair
result = {
'pair_id': 'pair_0',
'cost_matrix': numpy_array, # NxM matrix of log distances
'accumulated_cost': numpy_array, # DTW accumulated costs
'optimal_path': [(i,j), ...], # List of aligned frame indices
'total_dtw_cost': float, # Final DTW cost
'path_length': int, # Number of alignment points
'student_duration': float, # Duration in seconds
'note_correspondences': {
'student_notes': ['Sa', 'Re', ...],
'teacher_notes': ['Sa', 'Re', ...],
'note_pair_costs': {('Sa','Sa'): [0.02, 0.03, ...], ...}
},
'cost_aggregation': {
'average': {('Sa','Sa'): 0.025, ('Sa','Re'): 0.067, ...},
'max': {('Sa','Sa'): 0.045, ('Sa','Re'): 0.089, ...}
}
}
The pipeline generates dtw_analysis_results.csv
with columns:
- pair_id, student_file, teacher_file
- total_dtw_cost, path_length, student_duration
- avg_cost_Sa_to_Sa, avg_cost_Sa_to_Re, etc.
- max_cost_Sa_to_Sa, max_cost_Sa_to_Re, etc.
# Extract pitch data directly
pitch_data = process_metadata_csv('metadata.csv', normalization_method='semitones')
# Initialize DTW analyzer
dtw_analyzer = DTWAnalyzer(pitch_data)
# Run analysis on all pairs
results = dtw_analyzer.run_full_analysis()
# Visualize first pair
first_pair = list(results.keys())[0]
dtw_analyzer.visualize_cost_matrix(results[first_pair])
(DTW Plots yet to be added)
def advanced_pitch_analysis(audio_file_path):
"""
Enhanced pitch analysis with voice activity detection
"""
sound = parselmouth.Sound(audio_file_path)
# Extract pitch
pitch = sound.to_pitch()
# Extract intensity to help with voice activity detection
intensity = sound.to_intensity()
# Get arrays
pitch_values = pitch.selected_array['frequency']
intensity_values = intensity.values.T
times = pitch.xs()
# Simple voice activity detection based on intensity threshold
intensity_threshold = np.percentile(intensity_values[intensity_values > 0], 25)
voiced_frames = (pitch_values > 0) & (intensity_values.flatten() > intensity_threshold)
return {
'times': times,
'pitch': pitch_values,
'intensity': intensity_values.flatten(),
'voiced': voiced_frames
}
def calculate_pitch_statistics(results):
"""
Calculate detailed statistics for pitch data
"""
stats_data = []
for result in results:
if result['student_pitch'] is not None:
# Student statistics
student_voiced = result['student_pitch'][result['student_pitch'] > 0]
teacher_voiced = result['teacher_pitch'][result['teacher_pitch'] > 0]
stats = {
'file_pair': f"{result['student_file']} vs {result['teacher_file']}",
'student_mean_f0': np.mean(student_voiced) if len(student_voiced) > 0 else np.nan,
'student_std_f0': np.std(student_voiced) if len(student_voiced) > 0 else np.nan,
'student_range_f0': np.ptp(student_voiced) if len(student_voiced) > 0 else np.nan,
'teacher_mean_f0': np.mean(teacher_voiced) if len(teacher_voiced) > 0 else np.nan,
'teacher_std_f0': np.std(teacher_voiced) if len(teacher_voiced) > 0 else np.nan,
'teacher_range_f0': np.ptp(teacher_voiced) if len(teacher_voiced) > 0 else np.nan,
'voiced_percentage_student': len(student_voiced) / len(result['student_pitch']) * 100,
'voiced_percentage_teacher': len(teacher_voiced) / len(result['teacher_pitch']) * 100
}
stats_data.append(stats)
return pd.DataFrame(stats_data)
-
Parselmouth Installation Issues:
# Try updating pip first pip install --upgrade pip # Then install parselmouth pip install praat-parselmouth # If that fails, try with conda conda install -c conda-forge praat-parselmouth
-
Audio File Format Issues:
- Ensure audio files are in WAV format
- Check sampling rate (44.1kHz recommended)
- Verify file paths are correct
-
Memory Issues with Large Files:
# Process files in chunks for large datasets def process_in_chunks(df, chunk_size=10): for i in range(0, len(df), chunk_size): chunk = df.iloc[i:i+chunk_size] # Process chunk yield chunk
-
Pitch Detection Parameter Tuning:
# Adjust parameters based on speaker characteristics # For male speakers: pitch_floor_male = 50 pitch_ceiling_male = 300 # For female speakers: pitch_floor_female = 100 pitch_ceiling_female = 500 # For children: pitch_floor_children = 150 pitch_ceiling_children = 800
-
Prepare your data structure
-
Install dependencies:
pip install praat-parselmouth pandas matplotlib numpy
-
Run the main script:
python pitch_extractor.py
-
Check outputs:
- Individual plots in
pitch_plots/
directory - Summary statistics in
pitch_data_extracted.csv
- Console output for processing progress
- Individual plots in
This tutorial provides a comprehensive approach to pitch contour extraction and analysis using modern Python tools integrated with Praat's powerful acoustic analysis capabilities.