NP-Search-Tools

NP-Search-Tools is a modular toolkit for preprocessing text, generating embeddings, training and running inference on topic models, and building semantic graphs. The output is structured for integration with the NP-Backend-Dockers backend for indexing and further analysis.

Main Modules

All core functionalities are located in the src directory and are organized into the following modules:

Preprocessor

Cleans and tokenizes raw text. The default pipeline detects language, normalizes text, and applies a customizable sequence of transformations.

Main Components:

Lemmatizer
You can choose between a custom lemmatizer for Spanish (rule-based) or faster alternatives like those provided by SpaCy. By default, SpaCy is used (options.yaml/vocabulary: false).
NgramProcessor
Provides functions to search, filter, replace, or validate n-grams.
TextProcessor
Builds and executes the preprocessing pipeline. It integrates all previously mentioned tools and supports custom stopwords, ngrams, and vocabularies.
Supported steps:
- lowercase
- remove_tildes
- remove_extra_spaces
- remove_punctuation
- remove_urls
- lemmatize_text
- pos_tagging
- clean_text: keeps words with at least min_len characters and filters out malformed tokens.
- convert_ngrams
- remove_stopwords
- correct_spelling (needs improvement)
- tokenize_text
Example pipeline:
```
methods = [
    {"method": "lowercase"},
    {"method": "remove_urls"},
    {"method": "lemmatize_text"},
    {"method": "clean_text", "args": {"min_len": 1}},
    {"method": "remove_stopwords"}
]
```

The module is executed via the preprocess.py script, which loads data, applies preprocessing, and saves the results:

python preprocess.py --options config/options.yaml

It merges input data from various sources (insiders, outsiders, minors) using merge_data, and processes the _title_ and _summary_ columns (concatenated). Update the code if input parquet formats change. The pipeline uses parallelization when possible (configurable via use_dask).

Configuration (`options.yaml`)

use_dask: Enable parallel processing.
subsample: Limit dataset size for faster runs or debugging.
pipe: List of preprocessing operations to apply (as shown above).
dir_*: Paths to data, metadata, stopwords, ngrams, vocabulary, and output files.
use_stopwords, use_ngrams: Specify specific files or "all" to load all available.

See the commented options.yaml file for detailed descriptions of each setting.

Data Directory

Configured via options.yaml, defaults to /app/data, or /app/data/data.zip.

/app/data o /app/data/data.zip
├───ngrams
│    ├───ngrams.txt
│    └───proposed_ngrams.txt
├───stopwords
│    ├───common_stopwords.txt
│    ├───administración.txt
│    └───municipios.txt
├───vocabulary
│   └───vocabulary.txt
└───metadata
    ├───insiders.parquet
    ├───outsiders.parquet
    └───minors.parquet

Topic models

Contains a series of topic model implementations, wrapped under the common class BaseModel. It can be adapted by changing the data source and the models to use.

Models:

BaseModel
A foundational class for all other models. Any subclass extending this base must implement the following core methods: _model_train, _model_predict, and load_model. In addition to defining the training and inference interfaces, it provides utility methods for saving documents, as well as saving and loading topic keys, document-topic distributions, and related data.
BERTopic
The training and prediction follow the scheme from https://maartengr.github.io/BERTopic/algorithm/algorithm.html The model used for Spanish, Catalan, Galician, and Basque is paraphrase-multilingual-MiniLM-L12-v2.
Mallet
It's a wrapper for Mallet, you just need to provide the bin/mallet location. The rest of the setup is the same as the regular Mallet.
Tomotopy
Uses https://bab2min.github.io/tomotopy/v/en/ There are two versions: tomotopyLDA and tomotopyCT. The implementation is similar.
NMF
It's a simple implementation of NMF that uses a TF-IDF vectorizer or CountVectorizer.
Gensim
The basic Gensim model that I use in the test to have a baseline.
topic_models_test.py Main script. You need to pass the --options parameter with the yaml where all the settings are. First, all models are executed and then the model quality information is printed (execution times, PMI, etc.)

Embeddings

Provides the Embedder class to compute embeddings using Word2Vec or BERT. These embeddings are used internally for topic labeling and semantic representations of topics.

Graph

Builds and visualizes a semantic graph using Battacharyya distances between topic distributions. Community detection is performed using the Louvain algorithm, and node layout is computed with ForceAtlas2. Graphs include visual overlays of communities and key metrics.

Note: The Graph module requires a separate Docker container with Python 3.8, as ForceAtlas2 is not compatible with Python 3.9 and above. This is distinct from the Docker environment used for preprocessing and training topic models.

Build and Run (from `Graph` directory):

# Build Docker image
docker build -t np_graphs .

# Run graph construction
docker run --rm \
    -v /absolute/path/to/models:/data/source \
    np_graphs \
    python3 build_graph.py \
    --path_model /data/source/<model_directory_name>

Utils

Here you will find functions with various purposes, such as parallelizing or loading data.

Output Structure

After running all modules, the following output directory structure is generated:

model
├── graphs
│   ├── draw_graph_communities_communities.csv # CSV file containing community assignments for each node in the graph.
│   ├── draw_graph_communities.png             # PNG image visualizing the graph with community overlays.
│   ├── draw_graph_forceatlas2.png             # PNG image visualizing the graph layout using the ForceAtlas2 algorithm.
│   ├── draw_graph_lcc.png                     # PNG image visualizing the largest connected component of the graph.
│   ├── graph_edges.csv                        # CSV file containing the edges of the graph with source and target nodes.
│   └── graph_nodes.csv                        # CSV file containing the nodes of the graph with their attributes.
├── infer_data                     # Contains data and outputs related to the inference process
│   ├── corpus.txt                 # Raw text data used for inference (documents not included in the training corpus)
│   ├── corpus_inf.mallet          # Processed input for inference in Mallet format
│   ├── doc-topics-inf.txt         # Document-topic distribution after inference
│   ├── s3.npz                     # NumPy file storing S3 metric data 
│   └── thetas.npz                 # NumPy file storing inferred topic distributions
├── model_data                     # Stores model-related data and outputs
│   ├── TMmodel                    # Directory for various model-related files
│   │   ├── alphas.npy             # NumPy array storing alpha parameters (topics' size)
│   │   ├── alphas_orig.npy        # Original alpha parameters (in case modifications to the alphas file are made)
│   │   ├── betas.npy              # NumPy array storing beta parameters
│   │   ├── betas_ds.npy           # Downsampled beta parameters
│   │   ├── betas_orig.npy         # Original beta parameters
│   │   ├── distances.npz          # NumPy file storing the sparse matrix of document similarities.
│   │   ├── distances.txt          # Text file containing the string representation of the top similarities between each pair of documents.
│   │   ├── edits.txt              # Text file for documenting edits
│   │   ├── ndocs_active.npy       # NumPy array storing the number of active documents
│   │   ├── pyLDAvis.html          # HTML file for visualizing topic models (PyLDAvis)
│   │   ├── s3.npz
│   │   ├── thetas.npz             # NumPy file storing theta parameters (document-topic distribution)
│   │   ├── thetas_orig.npz        # Original theta parameters
│   │   ├── topic_coherence.npy    # NumPy array storing topic coherence scores
│   │   ├── tpc_coords.txt         # Text file storing topic coordinates
│   │   ├── tpc_descriptions.txt   # Text file for topic descriptions
│   │   ├── tpc_embeddings.npy     # NumPy array storing the Word2Vec embeddings of the topic descriptions.
│   │   ├── tpc_labels.txt         # Text file storing curated topic labels
│   │   └── vocab.txt              # Text file storing vocabulary
│   ├── corpus_train.mallet        # Training corpus in Mallet format
│   ├── corpus_train.txt           # Training corpus in text format
│   ├── diagnostics.xml            # XML file for model diagnostics obtained from Mallet
│   ├── dictionary.gensim          # Gensim dictionary file
│   ├── doc-topics.txt             # Document-topic distribution after training
│   ├── inferencer.mallet          # Mallet inferencer file
│   ├── model.pickle               # Pickle file storing the trained model
│   ├── topic-keys.json            # JSON file storing topic keys
│   ├── topic-keys.txt             # Text file storing topic keys
│   ├── topic-report.xml           # XML file for topic report
│   ├── vocab_freq.txt             # Text file storing vocabulary frequency
│   ├── vocabulary.txt             # Text file storing vocabulary
│   └── word-topic-counts.txt      # Text file storing word-topic counts
├── train_data                     # Contains data used for training the model
│   ├── corpus.mallet              # Corpus in Mallet format for training
│   ├── corpus.txt                 # Raw text data for training
│   ├── corpus_aux.txt             # Auxiliary text file for the corpus
│   ├── import.pipe                # Pipe file for importing inference data
│   └── train.config               # Configuration file for training
└── trainconfig.json               # JSON file containing the training configuration

Docker

Preprocessing and topic modeling can be run in a Docker container with Python 3.11 and CUDA support.

Build Image:

docker build -t image_name .

Run:

docker run --gpus all --rm -it -v /path/to/data:/app/data image_name

In this command, the data directory of the host machine is mounted into the Docker container at the /app/data location.

Main scripts

Utility scripts used to generate final topic models, graphs, metadata, and evaluations:

generate_graphs.py: Graphs for all models in a directory
generate_hierarchy.py: Treemap of topic hierarchy per CPV code
generate_users_eval.py: Excel sheets for topic evaluation
get_info_index_bsc.py: Generates topic & objective JSON for BSC API
predict_objective_with_s3.py: Predicts missing objectives using the S3 metric
train_tm.py: Trains topic models per CPV or globally

Acknowledgements

This work has received funding from the NextProcurement European Action (grant agreement INEA/CEF/ICT/A2020/2373713-Action 2020-ES-IA-0255).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NP-Search-Tools

Main Modules

Preprocessor

Main Components:

Configuration (`options.yaml`)

Data Directory

Topic models

Models:

Embeddings

Graph

Build and Run (from `Graph` directory):

Utils

Output Structure

Docker

Build Image:

Run:

Main scripts

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
aux_scripts		aux_scripts
config		config
src		src
static/Images		static/Images
stopwords		stopwords
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
generate_graphs.py		generate_graphs.py
generate_hierarchy.py		generate_hierarchy.py
generate_users_eval.py		generate_users_eval.py
get_info_index_bsc.py		get_info_index_bsc.py
predict_objective_with_s3.py		predict_objective_with_s3.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
topic_models_test.py		topic_models_test.py
train_tm.py		train_tm.py

nextprocurement/NP-Search-Tool

Folders and files

Latest commit

History

Repository files navigation

NP-Search-Tools

Main Modules

Preprocessor

Main Components:

Configuration (options.yaml)

Data Directory

Topic models

Models:

Embeddings

Graph

Build and Run (from Graph directory):

Utils

Output Structure

Docker

Build Image:

Run:

Main scripts

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Configuration (`options.yaml`)

Build and Run (from `Graph` directory):

Packages