This notebook provides a demo toolbox for conceptual analysis and clustering of text data.
To analyze and cluster texts based on their conceptual loads, via a hybrid concept-aggregate approach
It offers the following:
a.1. Utilizes spaCy
for NLP
a.2. Works with a hard-coded sample concept_lexicon
, which is an aggregate-concept dictionary with entries:
"aggregate": ['concept_1', 'concept_2', ...]
a.3. Is capable of working with both single docs and batches
b.1. Function analyze_txt
integrates the pipeline for single docs as:
filepath
→ read_txt
→ nlp
→ token_ext
→ concept_matcher
→ concept_aggregator
b.2. concept_aggregator
gives a tuple (detailed, aggregated)
of data
b.3. Functions json_saver
and json_loader
enable saving and loading the above data tuple in JSON
format, resp.
b.4. Function aggreg_visu
generates and saves a bar chart from aggregated
b.5. And function concept_heatmap
generates and saves a heatmap from detailed
c.1. Function batch_preprocess
loads multiple text files and prepares the data for the next steps
c.2. Function batch_plot
generates a batch of a couple of both plot types
c.3. Functions batch_json_saver
and batch_json_loader
are batch-process analogs of their respective single-process functions
c.4. Function vectorizer
converts batch-preprocessed data into vectorized format to be used in ML operations. It combines detailed and aggregated data into a single DataFrame
c.5. Finally, function cluster
performs unsupervised learning, in the form of KMeans
clustering. It:
- receives data in vectorized format,
- performs clustering,
- applies PCA to high-dimensional data,
- generates and saves the resulting 2D plot,
- and returns a tuple
(df_combo, cluster_labels)