Skip to content

Aydin62/A_Conceptual_Text_Analyzer-Clusterer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A CONCEPTUAL TEXT ANALYZER-CLUSTERER

Aydin Manzouri, 2025


SUMMARY


This notebook provides a demo toolbox for conceptual analysis and clustering of text data.


Objective

To analyze and cluster texts based on their conceptual loads, via a hybrid concept-aggregate approach


Contents

It offers the following:

(A) General

a.1. Utilizes spaCy for NLP

a.2. Works with a hard-coded sample concept_lexicon, which is an aggregate-concept dictionary with entries:

"aggregate": ['concept_1', 'concept_2', ...]

a.3. Is capable of working with both single docs and batches

(B) Working with single documents

b.1. Function analyze_txt integrates the pipeline for single docs as:

filepathread_txtnlptoken_extconcept_matcherconcept_aggregator

b.2. concept_aggregator gives a tuple (detailed, aggregated) of data

b.3. Functions json_saver and json_loader enable saving and loading the above data tuple in JSON format, resp.

b.4. Function aggreg_visu generates and saves a bar chart from aggregated

b.5. And function concept_heatmap generates and saves a heatmap from detailed

(C) Working with multiple documents

c.1. Function batch_preprocess loads multiple text files and prepares the data for the next steps

c.2. Function batch_plot generates a batch of a couple of both plot types

c.3. Functions batch_json_saver and batch_json_loader are batch-process analogs of their respective single-process functions

c.4. Function vectorizer converts batch-preprocessed data into vectorized format to be used in ML operations. It combines detailed and aggregated data into a single DataFrame

c.5. Finally, function cluster performs unsupervised learning, in the form of KMeans clustering. It:

  • receives data in vectorized format,
  • performs clustering,
  • applies PCA to high-dimensional data,
  • generates and saves the resulting 2D plot,
  • and returns a tuple (df_combo, cluster_labels)



About

A Conceptual Text Analyzer-Clusterer Using a Hybrid Concept-Aggregate Approach

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published