Sentence Complexity Clustering on L2 Texts

This project explores unsupervised clustering of L2-written sentences based on syntactic and lexical complexity features.
The goal is to discover sentence-level writing patterns across proficiency levels using clustering and visualization techniques.

Project Motivation

This work was developed as part of a personal NLP research portfolio to showcase data preprocessing, feature engineering, and unsupervised learning skills on real-world L2 language data.
The project builds on previous research into syntactic complexity in second language writing.

Methodology

The pipeline includes:

Data Cleaning & Sentence Splitting
Feature Engineering
- Sentence length
- Type-Token Ratio (TTR)
- POS ratios (noun, verb, adjective)
- Syntactic depth (dependency distance)
Feature Correlation Analysis
Dimensionality Reduction
- PCA
- t-SNE (for visualization)
Clustering
- KMeans clustering of sentence features
Interpretation
- Example sentences + feature averages per cluster

All steps were implemented manually in Python without external linguistic toolkits (like TAASSC) to demonstrate full control over feature engineering.

Data

This project uses all 5600 samples in Written Essays (WE) from the ICNALE corpus, a publicly available collection of English essays written by second language users.

Note:
The ICNALE dataset is not included in this repository due to licensing restrictions.
To reproduce the full pipeline, please request access and download the dataset from the official ICNALE website linked above.
Once obtained, place the dataset files in data/raw/ as described in data/raw/README.md.

Results

Two main visualizations were created:

The PCA plot shows the global structure of sentence complexity clusters across two principal components.
A large distinct yellow cluster (Cluster 4) represents sentences that differ greatly (likely longer or more complex sentences), while a dense purple core (Cluster 0) suggests short or simple sentence structures.
The t-SNE plot focuses on local structure and reveals tighter, well-separated neighborhood clusters.
Distinct shapes and densities suggest subtle syntactic and lexical variation across learner sentence groups.

Cluster Interpretation Method

The clusters were interpreted using a two-step process combining qualitative review and quantitative feature analysis.

First, a sample of sentences from each cluster was reviewed. This provided initial qualitative impressions of the sentence types based on length, grammatical structure, and writing style. Example observations included short, awkward beginner sentences (e.g., "I agree this statement.") and very long, complex argumentative sentences with hedging and multi-clause structures.

Second, sentence-level linguistic features were analyzed numerically across clusters. Key features included sentence length (num_tokens), type-token ratio (ttr), and part-of-speech ratios (noun_ratio, verb_ratio, adj_ratio). These values provided objective evidence to refine and validate the initial hypotheses.

Both methods consistently revealed five distinct sentence types:

Cluster 0: Short, simple sentences with high verb density; typical of beginner writers.

Cluster 1: Medium-length factual statements with dense noun phrases; often used for general explanations.

Cluster 2: Longer opinion sentences mixing simple and compound structures; associated with transitional learners.

Cluster 3: Short sentences with high adjective use; descriptive or comparative structures.

Cluster 4: Very long, complex argumentative sentences; characteristic of advanced L2 writers attempting native-like syntax.

The agreement between qualitative and quantitative approaches strongly suggests that sentence complexity features can meaningfully group learner writing styles without prior labels. This validates the effectiveness of the feature design and unsupervised clustering approach.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
notebooks		notebooks
outputs		outputs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sentence Complexity Clustering on L2 Texts

Project Motivation

Methodology

Data

Results

Cluster Interpretation Method

About

Uh oh!

Releases

Packages

Languages

License

Risang-L/sentence-complexity-clustering

Folders and files

Latest commit

History

Repository files navigation

Sentence Complexity Clustering on L2 Texts

Project Motivation

Methodology

Data

Results

Cluster Interpretation Method

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages