semantic-tagging-tools

simple tag expander & merger based on tags within your dataset

the model you use for synonym matching will HEAVILY influence the results you get.

Overview

Group all unique tags, use a set() to make it easier
Pass the tags, in the format "x","y","z" to the embedding model (i used mistral-embed)
Normalise and pass the new embeddings to faiss
Adjust the threshold for matching within faiss 4b. For each tag in your original list, the model will be supplied with the tag and candidate vectors from faiss which are within your similarity threshold (distance) 4c. The model (i chose mistral-small v3) will return a json with valid synonyms it picked.
Rewrite the original label.
Profit?

Thoughts

I used mistral-small as its free and good enough for the task, though it will make mistakes sometimes. If you used something bigger (like Qwen 32+B, or Claude) you will get better results, as they're smarter models.

i believe normalising before feeding the vectors to fass is probably uncesesary, but testing with it on seemed better than without

Currently, the program will go through the list and if necessary try to batch the candidate (for my testing i set batch size and k neighbours to 10 so it'll never show, please change this). this means that if word X has 200 candidates, and you have a batch size of 100, the llm would get two batches of 100 and add the synonyms at the end, though you might wanna mess with the semantic threshold a little so you dont get too many overlapping candidates.

info on faiss: https://github.com/facebookresearch/faiss/wiki/

Examples:

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
semantic_merge.py		semantic_merge.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

semantic-tagging-tools

Overview

Thoughts

About

Uh oh!

Releases

Packages

Languages

License

ocha221/semantic-tagging-tools

Folders and files

Latest commit

History

Repository files navigation

semantic-tagging-tools

Overview

Thoughts

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages